📄 ACL 2026

SCOPE Dataset

Spatial COnsistency across PErspectives and Viewpoints

A benchmark dataset for evaluating spatial reasoning consistency across viewpoints in Large Vision-Language Models.

Yoonji Kim¹, Jieun Kim², Yujin Jeong¹, Sung-Bae Cho¹

¹Dept. of Computer Science · ²Dept. of Artificial Intelligence · Yonsei University, Seoul, South Korea

Code Dataset

20.1K

VQA Pairs

939

Unique Scenes

Viewpoints (360°)

7,512

Views

Overview

SCOPE discretizes the 360° field into 8 viewpoints at 45° increments and evaluates both egocentric and allocentric spatial reasoning.

All tasks query spatial relations between a target object and a reference object from multiple viewpoints, enabling systematic diagnosis of viewpoint-consistent reasoning.

Dataset Composition

Four data sources spanning real-world, synthetic, and large-scale 3D scenes

WildRGB-D 52.65% Real-world tabletop RGB-D scenes with diverse spatial configurations

SCOPE-Synthetic 35.12% Indoor/outdoor Blender-rendered scenes with precise 3D ground-truth metadata

SCOPE-Real 9.67% Expert-curated real-world photographs with verified spatial annotations

DL3DV-10K 2.56% Large-scale point-of-interest scenes for diverse scene coverage

📐 Balanced Directions: SCOPE maintains a near-uniform 12.5% distribution across all 8 spatial directions

Task Groups

Three tasks probe distinct aspects of viewpoint-consistent spatial reasoning

🔄

Spatial Consistency

The same spatial relation is queried from all 8 viewpoints. Correct answers must be coherent across every view.

Factors: Relational · Depth · Orientation

🧭

Spatial Updating

Models infer object positions after an imagined counter-clockwise rotation around a midpoint by a given angle.

Factors: Relational · Depth · Orientation · Occlusion

🗺️

Spatial Integration

Given 4 images from different angles, models must identify which view (or object) satisfies a spatial statement.

Factors: Relational · Depth · Orientation · Occlusion

Main Results

Zero-shot evaluation across 26 LVLMs. Human performance shown as upper bound.

Model	Avg	Spatial Consistency		Spatial Updating		Spatial Integration
Model	Avg	Ego	Allo	Ego	Allo	Ego	Allo
👤 Human	91.24	97.66	83.12	88.67	90.00	94.67	93.33
Baselines
Random (25%)	25.00	25.00	25.00	25.00	25.00	25.00	25.00
GPT-5.2 (text-only)	22.28	23.84	27.19	25.26	22.94	22.48	11.96
Proprietary Models
🥇 GPT-5-mini	58.55	89.55	34.90	48.45	26.24	83.56	68.57
🥈 Gemini-2.5-Pro	51.26	88.89	22.40	47.37	26.19	25.00	97.68
Gemini-2.5-Flash	50.05	85.40	25.92	46.78	25.26	22.29	94.64
GPT-5-nano	42.67	75.45	30.01	29.65	24.45	46.25	50.18
GPT-5.2	40.68	66.11	26.71	28.57	25.00	50.75	46.96
Gemini-2.5-Flash-Lite	32.26	32.43	25.49	24.88	24.08	25.40	61.25
Spatial-Specialized Models
🥇 RoboBrain-32B	45.37	87.43	25.51	23.73	24.75	28.63	82.14
RoboBrain-7B	40.92	81.49	25.76	17.39	26.68	25.28	68.93
RoboBrain-3B	35.49	71.06	26.33	16.45	24.62	26.18	48.30
SpatialRGPT	34.07	58.14	25.22	22.84	23.10	25.00	50.09
SpatialReasoner	31.51	47.74	26.43	22.55	25.82	24.59	41.94
Open-Source Models
🥇 InternVL2.5-38B	39.98	75.52	25.82	26.49	24.78	24.50	62.77
🥈 InternVL2.5-14B	38.83	74.86	24.46	20.50	25.10	26.27	61.76
Qwen2.5-VL-72B	37.78	68.23	25.48	26.75	24.30	33.99	47.95
InternVL2.5-78B	36.47	78.94	25.51	26.82	25.82	27.10	34.64
LLaVA-OV-8B	35.56	57.05	25.10	24.88	25.63	24.53	56.16
Qwen2.5-VL-32B	36.23	64.36	25.38	24.93	16.28	25.73	60.71
InternVL2.5-8B	33.22	67.03	25.06	25.17	25.41	23.44	45.18
Qwen2.5-VL-7B	34.52	58.51	24.90	20.68	25.06	26.08	51.88
Qwen2.5-VL-3B	33.71	46.37	25.79	24.25	24.49	25.14	56.20
LLaVA-OV-4B	32.08	56.16	26.11	24.72	26.90	26.82	31.79
InternVL2.5-2B	30.26	44.32	25.00	20.99	24.65	25.00	41.61
LLaMA-3.2-11B	29.82	35.87	26.24	21.49	25.03	25.57	44.69
Gemma-3-12B	28.91	25.00	25.13	26.44	24.11	21.70	51.07
Gemma-3-27B	28.44	25.00	13.20	25.83	26.40	27.00	53.21
Gemma-3-4B	30.36	25.00	22.59	24.88	25.38	24.50	59.82

🟡 Best 🟢 Second-best 🔵 Human performance upper bound

Download & Usage

Coming soon — the dataset will be released on HuggingFace and GitHub

⏳ Dataset release is pending ACL 2026 camera-ready. Check back soon!

File Structure

SCOPE/ ├── dataset/ │ ├── task1.jsonl (Ego Spatial Consistency) │ ├── task2.jsonl (Allo Spatial Consistency) │ ├── task3.jsonl (Ego Spatial Updating) │ ├── task4.jsonl (Ego Spatial Integration) │ ├── task5.jsonl (Allo Spatial Updating) │ ├── task6.jsonl (Allo Spatial Integration) │ └── image/ │ ├── {scene_id}/ │ │ ├── frame_0deg.png │ │ ├── frame_45deg.png │ │ ├── ... │ │ └── frame_315deg.png │ └── occlusion/ │ └── {scene_id}/ │ └── frame_{deg}deg.png ├── evaluate/ │ ├── src/ │ └── scripts/ └── inference/ ├── multi_integration/ └── viewpoint_invariance_and_spatial_updating/

Sample Entry

{ "image": "dl3dv_2/frame_0deg.png", "camera_yaw_deg": 0, "source_folder": "dl3dv_2", "object_subject": "bench", "object_reference": "trash can", "question": "From the camera's perspective, where is bench relative to trash can?", "options": { "A": "front", "B": "behind", "C": "left", "D": "right" }, "answer": "C", "answer_text": "left", "question_type": "mcq_4" }

Quick Start

# Load with HuggingFace Datasets (coming soon)

📦

Format

JSONL + PNG images
~2.4 GB total

⚖️

License

CC BY 4.0
Free for research use

📬

Contact

yoonjikim@yonsei.ac.kr