Spatial COnsistency across PErspectives and Viewpoints
A benchmark dataset for evaluating spatial reasoning consistency across viewpoints in Large Vision-Language Models.
SCOPE discretizes the 360° field into 8 viewpoints at 45° increments and evaluates both egocentric and allocentric spatial reasoning.
All tasks query spatial relations between a target object and a reference object from multiple viewpoints, enabling systematic diagnosis of viewpoint-consistent reasoning.
Four data sources spanning real-world, synthetic, and large-scale 3D scenes
Three tasks probe distinct aspects of viewpoint-consistent spatial reasoning
The same spatial relation is queried from all 8 viewpoints. Correct answers must be coherent across every view.
Models infer object positions after an imagined counter-clockwise rotation around a midpoint by a given angle.
Given 4 images from different angles, models must identify which view (or object) satisfies a spatial statement.
Zero-shot evaluation across 26 LVLMs. Human performance shown as upper bound.
| Model | Avg | Spatial Consistency | Spatial Updating | Spatial Integration | |||
|---|---|---|---|---|---|---|---|
| Ego | Allo | Ego | Allo | Ego | Allo | ||
| 👤 Human | 91.24 | 97.66 | 83.12 | 88.67 | 90.00 | 94.67 | 93.33 |
| Baselines | |||||||
| Random (25%) | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 | 25.00 |
| GPT-5.2 (text-only) | 22.28 | 23.84 | 27.19 | 25.26 | 22.94 | 22.48 | 11.96 |
| Proprietary Models | |||||||
| 🥇 GPT-5-mini | 58.55 | 89.55 | 34.90 | 48.45 | 26.24 | 83.56 | 68.57 |
| 🥈 Gemini-2.5-Pro | 51.26 | 88.89 | 22.40 | 47.37 | 26.19 | 25.00 | 97.68 |
| Gemini-2.5-Flash | 50.05 | 85.40 | 25.92 | 46.78 | 25.26 | 22.29 | 94.64 |
| GPT-5-nano | 42.67 | 75.45 | 30.01 | 29.65 | 24.45 | 46.25 | 50.18 |
| GPT-5.2 | 40.68 | 66.11 | 26.71 | 28.57 | 25.00 | 50.75 | 46.96 |
| Gemini-2.5-Flash-Lite | 32.26 | 32.43 | 25.49 | 24.88 | 24.08 | 25.40 | 61.25 |
| Spatial-Specialized Models | |||||||
| 🥇 RoboBrain-32B | 45.37 | 87.43 | 25.51 | 23.73 | 24.75 | 28.63 | 82.14 |
| RoboBrain-7B | 40.92 | 81.49 | 25.76 | 17.39 | 26.68 | 25.28 | 68.93 |
| RoboBrain-3B | 35.49 | 71.06 | 26.33 | 16.45 | 24.62 | 26.18 | 48.30 |
| SpatialRGPT | 34.07 | 58.14 | 25.22 | 22.84 | 23.10 | 25.00 | 50.09 |
| SpatialReasoner | 31.51 | 47.74 | 26.43 | 22.55 | 25.82 | 24.59 | 41.94 |
| Open-Source Models | |||||||
| 🥇 InternVL2.5-38B | 39.98 | 75.52 | 25.82 | 26.49 | 24.78 | 24.50 | 62.77 |
| 🥈 InternVL2.5-14B | 38.83 | 74.86 | 24.46 | 20.50 | 25.10 | 26.27 | 61.76 |
| Qwen2.5-VL-72B | 37.78 | 68.23 | 25.48 | 26.75 | 24.30 | 33.99 | 47.95 |
| InternVL2.5-78B | 36.47 | 78.94 | 25.51 | 26.82 | 25.82 | 27.10 | 34.64 |
| LLaVA-OV-8B | 35.56 | 57.05 | 25.10 | 24.88 | 25.63 | 24.53 | 56.16 |
| Qwen2.5-VL-32B | 36.23 | 64.36 | 25.38 | 24.93 | 16.28 | 25.73 | 60.71 |
| InternVL2.5-8B | 33.22 | 67.03 | 25.06 | 25.17 | 25.41 | 23.44 | 45.18 |
| Qwen2.5-VL-7B | 34.52 | 58.51 | 24.90 | 20.68 | 25.06 | 26.08 | 51.88 |
| Qwen2.5-VL-3B | 33.71 | 46.37 | 25.79 | 24.25 | 24.49 | 25.14 | 56.20 |
| LLaVA-OV-4B | 32.08 | 56.16 | 26.11 | 24.72 | 26.90 | 26.82 | 31.79 |
| InternVL2.5-2B | 30.26 | 44.32 | 25.00 | 20.99 | 24.65 | 25.00 | 41.61 |
| LLaMA-3.2-11B | 29.82 | 35.87 | 26.24 | 21.49 | 25.03 | 25.57 | 44.69 |
| Gemma-3-12B | 28.91 | 25.00 | 25.13 | 26.44 | 24.11 | 21.70 | 51.07 |
| Gemma-3-27B | 28.44 | 25.00 | 13.20 | 25.83 | 26.40 | 27.00 | 53.21 |
| Gemma-3-4B | 30.36 | 25.00 | 22.59 | 24.88 | 25.38 | 24.50 | 59.82 |
🟡 Best 🟢 Second-best 🔵 Human performance upper bound
Coming soon — the dataset will be released on HuggingFace and GitHub
JSONL + PNG images
~2.4 GB total
CC BY 4.0
Free for research
use
yoonjikim@yonsei.ac.kr