Evaluation samples from NoReGeo benchmark. Each problem is shown in three formats -- (a) textโonly, (b) text with dotted-image (points only), and (c) text with full-image (points plus connecting lines) -- together with the golden answer (yellow) and the modelโs prediction.
NoReGeo is a benchmark for evaluating intrinsic geometric understanding in LLMsโwithout reasoning, algebra, or chain-of-thought. It contains 2,500 trivial geometry problems across 25 categories, designed to test whether models natively encode spatial relations and geometric properties. Across state-of-the-art LLMs, accuracy peaks at ~65%, and ablations show that such understanding does not emerge from fine-tuning alone, pointing to the need for geometry-aware training from the start.
- [in-progress] ๐๐๐ Release of the dataset on Hugging Face Datasets ๐ค.
- [24/11/2025] ๐๐๐ NoReGeo evaluation code and generation of tasks code released.
- [08/11/2025] ๐๐๐ NoReGeo has been accepted by AAAI 2026 Main track.
NoReGeo has been integrated into vLLM library for automatic evaluation. With benchmark files and models from Hugging Face, you and can start automatic evaluation once the all required libraries are properly installed.
Please install vLLM library as written here (preferable version == 10.0.0).
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API. This functionality lets you serve models and interact with them using an HTTP client.
In your terminal start the server with the vllm serve command as follows (do not forget to replace path-to-benchmark-data argument with your local path to benchmark):
vllm serve OpenGVLab/InternVL3-14B -tp 1 \
--gpu-memory-utilization 0.85 \
--port 8005 \
--max-num-batched-tokens 20192 \
--enable-chunked-prefill \
--guided_decoding_backend xgrammar \
--max_num_seqs 100 \
--dtype bfloat16 \
--allowed-local-media-path path-to-benchmark-data \
--max-num-batched-tokens 8000 \
--trust-remote-code \
--disable-log-requestsor use script already provided in scripts folder:
sh scripts/vllm_server.shTo run inference you need to call the server.
- For text-only setup add a
--use_text_onlyflag. - For multimodal setups add a
--use_full_imagesflag to use full-images and simply omit it for dotted-images variant. - Consequently, pass to --exp_name one of:
eval_with_dot_imagesfor dotted images eval oreval_with_full_imagesfor full images eval. - Replace
path-to-benchmark-dataargument with your local path to benchmark. - Add your Hugging Face API token.
Example for multimodal eval with full images:
python evaluation/openai_server_inference.py \
--port 8005 \
--semaphore_limit 10 \
--batch_size 20 \
--data_path path-to-benchmark-data \
--exp_name eval_with_full_images \
--use_full_images \
--api_key "your-hf-token"Example for multimodal eval with dotted images:
python evaluation/openai_server_inference.py \
--port 8005 \
--semaphore_limit 10 \
--batch_size 20 \
--data_path path-to-benchmark-data \
--exp_name eval_with_dot_images \
--api_key "your-hf-token"Example for text-only eval:
python evaluation/openai_server_inference.py \
--port 8005 \
--semaphore_limit 10 \
--batch_size 60 \
--data_path path-to-benchmark-data \
--use_text_only \
--exp_name eval_with_text_only \
--api_key "your-hf-token"or use script inference.sh already provided in scripts folder:
sh scripts/inference.shDataset contains 3 answer schema types:
classification: choose an answer from provided options list;point: generatexandycoordinates for answer;number: generate a numerical answer;
classification:{'answer': <answer_value>}point:{'x': <x_coordinate>, 'y': <y_coordinate>}number:{'answer': <answer_value>}
- classification
"Provide your answer as JSON: {'answer': <value>}, where <value> is from the options: [comma-separated list of options]. Return only that object."
- point
" Provide your answer as JSON with keys: 'x' and 'y' for point coordinates. Return only that object."
- number
"Provide your answer as JSON: {'answer': <value>}, where <value> is a floating point or integer number. Return only that object."
To generate the dataset, you can run the script in the dataset_creation directory.
To launch it, simply run the following command in your terminal from the project root directory:
sh dataset_creation/run_all_generators.shThe script for each task creation are in the dataset_creation/generators/ subdirectory.
--num_samples 100 defines how many samples should be generated for each task.
The linear_probing directory contains scripts to evaluate the capabilities of pretrained vision models on the generated geometric tasks.
To train a linear probe, you can use the train_model.py and evaluate_model.py script.