📄 ACL 2026

SCOPE Dataset

Spatial COnsistency across PErspectives and Viewpoints

A benchmark dataset for evaluating spatial reasoning consistency across viewpoints in Large Vision-Language Models.

Yoonji Kim1Jieun Kim2Yujin Jeong1Sung-Bae Cho1

1Dept. of Computer Science  ·  2Dept. of Artificial Intelligence  ·  Yonsei University, Seoul, South Korea

20.1K
VQA Pairs
939
Unique Scenes
8
Viewpoints (360°)
7,512
Views

Overview

SCOPE discretizes the 360° field into 8 viewpoints at 45° increments and evaluates both egocentric and allocentric spatial reasoning.

SCOPE Overview

All tasks query spatial relations between a target object and a reference object from multiple viewpoints, enabling systematic diagnosis of viewpoint-consistent reasoning.

Dataset Composition

Four data sources spanning real-world, synthetic, and large-scale 3D scenes

Dataset Composition
WildRGB-D  52.65% Real-world tabletop RGB-D scenes with diverse spatial configurations
SCOPE-Synthetic  35.12% Indoor/outdoor Blender-rendered scenes with precise 3D ground-truth metadata
SCOPE-Real  9.67% Expert-curated real-world photographs with verified spatial annotations
DL3DV-10K  2.56% Large-scale point-of-interest scenes for diverse scene coverage
📐 Balanced Directions: SCOPE maintains a near-uniform 12.5% distribution across all 8 spatial directions

Task Groups

Three tasks probe distinct aspects of viewpoint-consistent spatial reasoning

🔄
Spatial Consistency

The same spatial relation is queried from all 8 viewpoints. Correct answers must be coherent across every view.

Factors: Relational · Depth · Orientation

🧭
Spatial Updating

Models infer object positions after an imagined counter-clockwise rotation around a midpoint by a given angle.

Factors: Relational · Depth · Orientation · Occlusion

🗺️
Spatial Integration

Given 4 images from different angles, models must identify which view (or object) satisfies a spatial statement.

Factors: Relational · Depth · Orientation · Occlusion

Main Results

Zero-shot evaluation across 26 LVLMs. Human performance shown as upper bound.

Model Avg Spatial Consistency Spatial Updating Spatial Integration
EgoAllo EgoAllo EgoAllo
👤 Human 91.24 97.6683.12 88.6790.00 94.6793.33
Baselines
Random (25%) 25.0025.0025.00 25.0025.0025.0025.00
GPT-5.2 (text-only) 22.2823.8427.19 25.2622.9422.4811.96
Proprietary Models
🥇 GPT-5-mini 58.5589.5534.90 48.4526.2483.5668.57
🥈 Gemini-2.5-Pro 51.2688.8922.40 47.3726.1925.0097.68
Gemini-2.5-Flash 50.0585.4025.92 46.7825.2622.2994.64
GPT-5-nano 42.6775.4530.01 29.6524.4546.2550.18
GPT-5.2 40.6866.1126.71 28.5725.0050.7546.96
Gemini-2.5-Flash-Lite 32.2632.4325.49 24.8824.0825.4061.25
Spatial-Specialized Models
🥇 RoboBrain-32B 45.3787.4325.51 23.7324.7528.6382.14
RoboBrain-7B 40.9281.4925.76 17.3926.6825.2868.93
RoboBrain-3B 35.4971.0626.33 16.4524.6226.1848.30
SpatialRGPT 34.0758.1425.22 22.8423.1025.0050.09
SpatialReasoner 31.5147.7426.43 22.5525.8224.5941.94
Open-Source Models
🥇 InternVL2.5-38B 39.9875.5225.82 26.4924.7824.5062.77
🥈 InternVL2.5-14B 38.8374.8624.46 20.5025.1026.2761.76
Qwen2.5-VL-72B 37.7868.2325.48 26.7524.3033.9947.95
InternVL2.5-78B 36.4778.9425.51 26.8225.8227.1034.64
LLaVA-OV-8B 35.5657.0525.10 24.8825.6324.5356.16
Qwen2.5-VL-32B 36.2364.3625.38 24.9316.2825.7360.71
InternVL2.5-8B 33.2267.0325.06 25.1725.4123.4445.18
Qwen2.5-VL-7B 34.5258.5124.90 20.6825.0626.0851.88
Qwen2.5-VL-3B 33.7146.3725.79 24.2524.4925.1456.20
LLaVA-OV-4B 32.0856.1626.11 24.7226.9026.8231.79
InternVL2.5-2B 30.2644.3225.00 20.9924.6525.0041.61
LLaMA-3.2-11B 29.8235.8726.24 21.4925.0325.5744.69
Gemma-3-12B 28.9125.0025.13 26.4424.1121.7051.07
Gemma-3-27B 28.4425.0013.20 25.8326.4027.0053.21
Gemma-3-4B 30.3625.0022.59 24.8825.3824.5059.82

🟡 Best    🟢 Second-best    🔵 Human performance upper bound

Download & Usage

Coming soon — the dataset will be released on HuggingFace and GitHub

⏳  Dataset release is pending ACL 2026 camera-ready. Check back soon!

 File Structure

SCOPE/ ├── dataset/ │ ├── task1.jsonl (Ego Spatial Consistency) │ ├── task2.jsonl (Allo Spatial Consistency) │ ├── task3.jsonl (Ego Spatial Updating) │ ├── task4.jsonl (Ego Spatial Integration) │ ├── task5.jsonl (Allo Spatial Updating) │ ├── task6.jsonl (Allo Spatial Integration) │ └── image/ │ ├── {scene_id}/ │ │ ├── frame_0deg.png │ │ ├── frame_45deg.png │ │ ├── ... │ │ └── frame_315deg.png │ └── occlusion/ │ └── {scene_id}/ │ └── frame_{deg}deg.png ├── evaluate/ │ ├── src/ │ └── scripts/ └── inference/ ├── multi_integration/ └── viewpoint_invariance_and_spatial_updating/

 Sample Entry

{ "image": "dl3dv_2/frame_0deg.png", "camera_yaw_deg": 0, "source_folder": "dl3dv_2", "object_subject": "bench", "object_reference": "trash can", "question": "From the camera's perspective, where is bench relative to trash can?", "options": { "A": "front", "B": "behind", "C": "left", "D": "right" }, "answer": "C", "answer_text": "left", "question_type": "mcq_4" }

 Quick Start

# Load with HuggingFace Datasets (coming soon)
📦
Format

JSONL + PNG images
~2.4 GB total

⚖️
License

CC BY 4.0
Free for research use

📬
Contact

yoonjikim@yonsei.ac.kr

BibTeX