Beyond Referring Expressions: Scenario Comprehension Visual Grounding

He, Ruozhen; Shah, Nisarg A.; Dong, Qihua; Xiao, Zilin; Koo, Jaywon; Ordonez, Vicente

Computer Science > Computer Vision and Pattern Recognition

arXiv:2604.02323 (cs)

[Submitted on 2 Apr 2026]

Title:Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Authors:Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez

View PDF HTML (experimental)

Abstract:Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Comments:	20 pages, 18 figures, Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2604.02323 [cs.CV]
	(or arXiv:2604.02323v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2604.02323

Submission history

From: Nisarg Shah [view email]
[v1] Thu, 2 Apr 2026 17:59:08 UTC (2,857 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators