NoReGeo: Non-Reasoning Geometry Benchmark

Evaluation samples from NoReGeo benchmark. Each problem is shown in three formats -- (a) text‑only, (b) text with dotted-image (points only), and (c) text with full-image (points plus connecting lines) -- together with the golden answer (yellow) and the model’s prediction.

NoReGeo is a benchmark for evaluating intrinsic geometric understanding in LLMs—without reasoning, algebra, or chain-of-thought. It contains 2,500 trivial geometry problems across 25 categories, designed to test whether models natively encode spatial relations and geometric properties. Across state-of-the-art LLMs, accuracy peaks at ~65%, and ablations show that such understanding does not emerge from fine-tuning alone, pointing to the need for geometry-aware training from the start.

Updates

[in-progress] 🌟🌟🌟 Release of the dataset on Hugging Face Datasets 🤗.
[24/11/2025] 🌟🌟🌟 NoReGeo evaluation code and generation of tasks code released.
[08/11/2025] 🎉🎉🎉 NoReGeo has been accepted by AAAI 2026 Main track.

Evaluation with vLLM

NoReGeo has been integrated into vLLM library for automatic evaluation. With benchmark files and models from Hugging Face, you and can start automatic evaluation once the all required libraries are properly installed.

1. Install

Please install vLLM library as written here (preferable version == 10.0.0).

2. Online OpenAI-compatible server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API. This functionality lets you serve models and interact with them using an HTTP client.

In your terminal start the server with the vllm serve command as follows (do not forget to replace path-to-benchmark-data argument with your local path to benchmark):

vllm serve OpenGVLab/InternVL3-14B -tp 1 \
    --gpu-memory-utilization 0.85 \
    --port 8005 \
    --max-num-batched-tokens 20192 \
    --enable-chunked-prefill \
    --guided_decoding_backend xgrammar \
    --max_num_seqs 100 \
    --dtype bfloat16  \
    --allowed-local-media-path path-to-benchmark-data \
    --max-num-batched-tokens 8000  \
    --trust-remote-code  \
    --disable-log-requests

or use script already provided in scripts folder:

sh scripts/vllm_server.sh

3. Inference

To run inference you need to call the server.

For text-only setup add a --use_text_only flag.
For multimodal setups add a --use_full_images flag to use full-images and simply omit it for dotted-images variant.
Consequently, pass to --exp_name one of: eval_with_dot_images for dotted images eval or eval_with_full_images for full images eval.
Replace path-to-benchmark-data argument with your local path to benchmark.
Add your Hugging Face API token.

Example for multimodal eval with full images:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 20  \
    --data_path path-to-benchmark-data  \
    --exp_name eval_with_full_images  \
    --use_full_images \
    --api_key "your-hf-token"

Example for multimodal eval with dotted images:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 20  \
    --data_path path-to-benchmark-data  \
    --exp_name eval_with_dot_images  \
    --api_key "your-hf-token"

Example for text-only eval:

python evaluation/openai_server_inference.py \
    --port 8005 \
    --semaphore_limit 10  \
    --batch_size 60  \
    --data_path path-to-benchmark-data  \
    --use_text_only \
    --exp_name eval_with_text_only \
    --api_key "your-hf-token"

or use script inference.sh already provided in scripts folder:

sh scripts/inference.sh

Benchmark structure

Schemas

Dataset contains 3 answer schema types:

classification: choose an answer from provided options list;
point: generate x and y coordinates for answer;
number: generate a numerical answer;

Ground-true answers formats

classification: {'answer': <answer_value>}
point: {'x': <x_coordinate>, 'y': <y_coordinate>}
number: {'answer': <answer_value>}

Additional prompts for answers

classification

"Provide your answer as JSON: {'answer': <value>}, where <value> is from the options: [comma-separated list of options]. Return only that object."

point

" Provide your answer as JSON with keys: 'x' and 'y' for point coordinates. Return only that object."

number

"Provide your answer as JSON: {'answer': <value>}, where <value> is a floating point or integer number. Return only that object."

Task Creation

To generate the dataset, you can run the script in the dataset_creation directory.

To launch it, simply run the following command in your terminal from the project root directory:

sh dataset_creation/run_all_generators.sh

The script for each task creation are in the dataset_creation/generators/ subdirectory.

--num_samples 100 defines how many samples should be generated for each task.

Linear Probing of Vision Encoders

The linear_probing directory contains scripts to evaluate the capabilities of pretrained vision models on the generated geometric tasks.

Training and Evaluation

To train a linear probe, you can use the train_model.py and evaluate_model.py script.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
dataset_creation		dataset_creation
evaluation		evaluation
linear_probing		linear_probing
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_metrics.ipynb		calculate_metrics.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NoReGeo: Non-Reasoning Geometry Benchmark

Updates

Evaluation with vLLM

1. Install

2. Online OpenAI-compatible server

3. Inference

Benchmark structure

Schemas

Ground-true answers formats

Additional prompts for answers

Task Creation

Linear Probing of Vision Encoders

Training and Evaluation

About

Uh oh!

Releases

Packages

Languages

License

FusionBrainLab/NoReGeo

Folders and files

Latest commit

History

Repository files navigation

NoReGeo: Non-Reasoning Geometry Benchmark

Updates

Evaluation with vLLM

1. Install

2. Online OpenAI-compatible server

3. Inference

Benchmark structure

Schemas

Ground-true answers formats

Additional prompts for answers

Task Creation

Linear Probing of Vision Encoders

Training and Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages