Skip to content

NoReGeo - a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation.

License

Notifications You must be signed in to change notification settings

FusionBrainLab/NoReGeo

Repository files navigation

NoReGeo: Non-Reasoning Geometry Benchmark

Teaser
Evaluation samples from NoReGeo benchmark. Each problem is shown in three formats -- (a) textโ€‘only, (b) text with dotted-image (points only), and (c) text with full-image (points plus connecting lines) -- together with the golden answer (yellow) and the modelโ€™s prediction.

NoReGeo is a benchmark for evaluating intrinsic geometric understanding in LLMsโ€”without reasoning, algebra, or chain-of-thought. It contains 2,500 trivial geometry problems across 25 categories, designed to test whether models natively encode spatial relations and geometric properties. Across state-of-the-art LLMs, accuracy peaks at ~65%, and ablations show that such understanding does not emerge from fine-tuning alone, pointing to the need for geometry-aware training from the start.

Updates

  • [in-progress] ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ Release of the dataset on Hugging Face Datasets ๐Ÿค—.
  • [24/11/2025] ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ NoReGeo evaluation code and generation of tasks code released.
  • [08/11/2025] ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ NoReGeo has been accepted by AAAI 2026 Main track.

Evaluation with vLLM

NoReGeo has been integrated into vLLM library for automatic evaluation. With benchmark files and models from Hugging Face, you and can start automatic evaluation once the all required libraries are properly installed.

1. Install

Please install vLLM library as written here (preferable version == 10.0.0).

2. Online OpenAI-compatible server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API. This functionality lets you serve models and interact with them using an HTTP client.

In your terminal start the server with the vllm serve command as follows (do not forget to replace path-to-benchmark-data argument with your local path to benchmark):

vllm serve OpenGVLab/InternVL3-14B -tp 1 \
    --gpu-memory-utilization 0.85 \
    --port 8005 \
    --max-num-batched-tokens 20192 \
    --enable-chunked-prefill \
    --guided_decoding_backend xgrammar \
    --max_num_seqs 100 \
    --dtype bfloat16  \
    --allowed-local-media-path path-to-benchmark-data \
    --max-num-batched-tokens 8000  \
    --trust-remote-code  \
    --disable-log-requests

or use script already provided in scripts folder:

sh scripts/vllm_server.sh

3. Inference

To run inference you need to call the server.

  • For text-only setup add a --use_text_only flag.
  • For multimodal setups add a --use_full_images flag to use full-images and simply omit it for dotted-images variant.
  • Consequently, pass to --exp_name one of: eval_with_dot_images for dotted images eval or eval_with_full_images for full images eval.
  • Replace path-to-benchmark-data argument with your local path to benchmark.
  • Add your Hugging Face API token.

Example for multimodal eval with full images:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 20  \
    --data_path path-to-benchmark-data  \
    --exp_name eval_with_full_images  \
    --use_full_images \
    --api_key "your-hf-token"

Example for multimodal eval with dotted images:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 20  \
    --data_path path-to-benchmark-data  \
    --exp_name eval_with_dot_images  \
    --api_key "your-hf-token"

Example for text-only eval:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 60  \
    --data_path path-to-benchmark-data  \
    --use_text_only \
    --exp_name eval_with_text_only \
    --api_key "your-hf-token"

or use script inference.sh already provided in scripts folder:

sh scripts/inference.sh

Benchmark structure

Schemas

Dataset contains 3 answer schema types:

  • classification: choose an answer from provided options list;
  • point: generate x and y coordinates for answer;
  • number: generate a numerical answer;

Ground-true answers formats

  • classification: {'answer': <answer_value>}
  • point: {'x': <x_coordinate>, 'y': <y_coordinate>}
  • number: {'answer': <answer_value>}

Additional prompts for answers

  1. classification
"Provide your answer as JSON: {'answer': <value>}, where <value> is from the options: [comma-separated list of options]. Return only that object."
  1. point
" Provide your answer as JSON with keys: 'x' and 'y' for point coordinates. Return only that object."
  1. number
"Provide your answer as JSON: {'answer': <value>}, where <value> is a floating point or integer number. Return only that object."

Task Creation

To generate the dataset, you can run the script in the dataset_creation directory.

To launch it, simply run the following command in your terminal from the project root directory:

sh dataset_creation/run_all_generators.sh

The script for each task creation are in the dataset_creation/generators/ subdirectory.

--num_samples 100 defines how many samples should be generated for each task.

Linear Probing of Vision Encoders

The linear_probing directory contains scripts to evaluate the capabilities of pretrained vision models on the generated geometric tasks.

Training and Evaluation

To train a linear probe, you can use the train_model.py and evaluate_model.py script.

About

NoReGeo - a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published