Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He¹, Nisarg A. Shah², Qihua Dong³, Zilin Xiao¹, Jaywon Koo¹, Vicente Ordonez¹
¹Rice University ²Johns Hopkins University ³Northeastern University
{catherine.he,zilin,jk125,vicenteor}@rice.edu
snisarg812@gmail.com dongqh078@gmail.com

Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues that include deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Ruozhen He¹, Nisarg A. Shah², Qihua Dong³, Zilin Xiao¹, Jaywon Koo¹, Vicente Ordonez¹ ¹Rice University ²Johns Hopkins University ³Northeastern University {catherine.he,zilin,jk125,vicenteor}@rice.edu snisarg812@gmail.com dongqh078@gmail.com

Refer to caption — Figure 1: Referring Scenario Comprehension (RSC) vs. traditional referring expression comprehension (REC). Each row shows the same target object under both paradigms. Traditional REC queries often name the target category directly, allowing success via lexical matching. RSC instead pairs each image with a lengthy scenario-based query specifying a user role, goal, and multiple disambiguating cues, including explicit contrasts against competing objects, and requires output identifying both the target object and its bounding box. The RSC difficulty tags (U/C/S/O/P: Uniqueness, Clutter, Size, Overlap, Position) characterize each instance, enabling fine-grained training and evaluation.

Table 1: Comparison of referring expression grounding benchmarks. We compare existing datasets with RSC across dataset scale, evaluation splits, referring/query style, reasoning supervision, difficulty annotations, and competing-object mentions. Most prior datasets rely on short literal phrases and lack explicit reasoning traces or OOD evaluation. RSC introduces scenario-based queries written as natural language paragraphs and provides per-instance reasoning trace annotations. It further labels five interpretable difficulty factors (U/C/S/O/P) and includes both in-distribution (ID) and out-of-distribution (OOD) category test sets to evaluate reasoning and generalization.

Benchmark	Train	ID Test	OOD Test	Referring Style	Query Style	Avg. $\|q\|$	Reasoning Traces	Difficulty Tags	Competing Mentions
RefCOCO Kazemzadeh et al. (2014)	120,624	10,752	✗	Literal	Phrase	3.5	✗	✗	✗
RefCOCO+ Kazemzadeh et al. (2014)	120,191	10,615	✗	Literal	Phrase	3.5	✗	✗	✗
RefCOCOg Mao et al. (2016)	80,512	9,602	✗	Descriptive literal	Phrase	8.2	✗	✗	✓
Cops-Ref Chen et al. (2020)	119,603	12,586	✗	Compositional literal	Template	14.4	✗	✗	✓
Ref-Reasoning Yang et al. (2020)	721,164	34,609	✗	Literal	Template	8.5	Graph	Graph layout	✓
SK-VG Chen et al. (2023b)	23,403	6,597	✗	Scene knowledge Q&A	Q&A	5.3	✗	E/M/H	✓
FineCops-Ref Liu et al. (2024a)	163,792	9,605	✗	Compositional literal	Phrase	12.2	✗	1/2/3	✓
EgoIntention Sun et al. (2025)	15,667	9,892	✗	Egocentric intention query	Sentence	18.8	✗	✗	✗
RSC (Ours)	31,342	4,038	3,247	Scenario-based query	Paragraph	52.7	Paragraph	U/C/S/O/P	✓

1 Introduction

Visual grounding, the task of associating an image region with a natural language description, is a core capability for embodied assistants and multimodal models Geng et al. (2025); Shen et al. (2021). Despite rapid progress in building capable vision-and-language models (VLMs), strong performance on standard referring comprehension (REC) often relies on explicit lexical cues, particularly category references and salient attributes that can be matched to candidate regions without deep understanding Yu et al. (2016); Mao et al. (2016). This protocol creates a systematic blind spot as models optimized for literal phrase matching may fail when users express a need for an object by describing situations instead of making explicit object references.

Real referring behavior is not necessarily concise and direct. Natural language allows users to describe a target through a situational need, an object role, or a user goal. For example, a user attempting to find what time it is may prompt a model on “checking the time” instead of explicitly asking for “the clock”. Such scenario-based queries might contain multiple disambiguating cues, demanding reasoning over intent and visual context, not just category lookup. Despite being a natural and valid form of reference, this type of queries is absent from existing grounding benchmarks, leaving a gap in how we evaluate and develop grounding models. As illustrated in Figure 1, the same target object are described in completely different ways under traditional REC and scenario-based grounding: the former can succeed via lexical matching, while the latter requires integrating relational, spatial, and intentional cues while actively ignoring explicitly mentioned distractors.

We introduce Referring Scenario Comprehension (RSC), a benchmark designed to study this under-examined setting. RSC replaces referring phrases with scenario-based queries that describe a user role, goal, and at least three disambiguating cues, and deliberately mentions competing objects to require deep understanding. Each instance is annotated with reasoning traces and five interpretable difficulty tags (Uniqueness, Clutter, Size, Overlap, and Position), which expose distinct failure modes and support fine-grained curriculum design and evaluation. RSC contains 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories, enabling evaluation of both in-domain disambiguation and cross-category generalization.

Table 1 situates RSC among existing visual grounding benchmarks. Most prior resources contain short literal phrases, are annotated via crowdsourced expression games or scene-graph templates, and provide no out-of-distribution (OOD) evaluation split. Two recent works move toward richer queries: SK-VG Chen et al. (2023b) requires reasoning over external knowledge paragraphs, and EgoIntention Sun et al. (2025) targets implicit affordance grounding in egocentric images. RSC addresses a complementary gap, exocentric scenario grounding, where the challenge is reasoning over rich paragraph-length scenarios and disambiguating among visually similar instances using rich relational and intentional cues rather than inferring affordances or external knowledge. RSC also provides an OOD split, multi-axis difficulty tags, and per-instance reasoning trace annotations to support diverse learning and analysis.

Beyond introducing the benchmark, we also propose a strong baseline ScenGround, a two-stage curriculum reasoning method for scenario-based visual grounding. In Stage 1, Thought-Primed SFT (TP-SFT) aligns the model to the output schema and elicits faithful reasoning traces before a structured answer, using the easier RSC slices to stabilize interface learning. In Stage 2, Incentive-Curriculum GRPO (IC-GRPO) refines localization and disambiguation via shaped rewards coupling geometry, including smooth IoU reward with center-consistency and out-of-bounds penalties, and alias-aware category rewards. The training follows a tag-aware curriculum, feeding more difficult non-unique, cluttered, overlapping, and off-center targets in the later stage. A prompt-template ensemble further improves robustness across query surface forms. ScenGround presents a well-characterized baseline demonstrating that difficulty-aware curriculum training substantially improves scenario grounding and transfers the improvement to standard benchmarks.

Our main contributions are as follows:

•

We introduce RSC, a scenario-based visual grounding benchmark with difficulty-controlled instances, per-instance reasoning trace annotations, and an OOD split with disjoint object categories.
•

We propose ScenGround, a curriculum reasoning method combining supervised warm-starting with difficulty-aware reinforcement learning, providing a strong and well-characterized reference point for scenario-based grounding.
•

Experiments demonstrate that RSC exposes failure modes invisible to standard benchmarks, and that difficulty-aware curriculum training transfers improvements to standard referring benchmarks.

2 Related Work

Referring Expression Datasets and Methods. Referring expression aims to localize image regions described by natural language. Early datasets Yu et al. (2016); Kazemzadeh et al. (2014); Mao et al. (2016); Plummer et al. (2015) introduced the task using short, literal phrases that directly name the target objects. Subsequent benchmarks expanded the scope along multiple axes: compositional reasoning with hard negatives Liu et al. (2019); Chen et al. (2020); Liu et al. (2024a); Dong et al. (2026), structured reasoning over scene graphs Yang et al. (2020), segmentation-based grounding Lai et al. (2024); Wu et al. (2020), 3D referring expressions Chen and Chang (2020); Achlioptas et al. (2020) and GUI domains You et al. (2024). More recently, SK-VG Chen et al. (2023b) introduces grounding conditioned on long-form scene knowledge, and EgoIntention Sun et al. (2025) targets egocentric intention grounding where models must infer the user intent from first-person views. To the best of our knowledge, no existing benchmark evaluates grounding from scenario-based queries: paragraph-length descriptions specifying user roles, goals, and explicit distractor contrasts, where the challenge lies in reasoning over rich intentional context to disambiguate distractors.

Visual Grounding with Vision-and-Language Models. Early visual grounding methods adopted modular pipelines pairing region proposals with language encoders Kamath et al. (2021); Li et al. (2022). Transformer-based architectures Deng et al. (2021); Li et al. (2024); Zhan et al. (2025) later reformulated referring expression as an end-to-end task. Grounding DINO Liu et al. (2024b) further unified open-set object detection with text-conditioned localization. With the rise of large vision-language models (LVLMs), grounding has become an inherent capability within general-purpose multimodal systems. Models such as Ferret You et al. (2023), Shikra Chen et al. (2023a), and Kosmos-2 Peng et al. (2023) pioneered region-level referring and grounding in LVLMs by representing spatial coordinates as part of the text generation process. Subsequent work scaled these ideas: Qwen-VL Wang et al. (2024) and Qwen2.5-VL Bai et al. (2025a) align image–caption–box tuples during pretraining to support open-vocabulary grounding, while InternVL Chen et al. (2024c, b) and VisionLLM Wu et al. (2024) extend grounding to hundreds of vision-language tasks. Most recently, UniVG-R1 Bai et al. (2025b) incorporates reinforcement learning with Chain-of-Thought reasoning Wei et al. (2022) for universal grounding. Despite these advances, existing methods are predominantly evaluated on short object-centric queries, and the scenario-based grounding setting, where success requires reasoning over roles, goals, and relational context, remains unstudied. RSC is designed to fill this gap, and ScenGround provides a strong curriculum reasoning baseline tailored for this setting.

3 Referring Scenario Comprehension

Referring Scenario Comprehension (RSC) asks a model to localize an object from a scenario which describes a user role, a goal, distractors, and non-literal disambiguating cues. Given an image $x\in\mathbb{R}^{H\times W\times 3}$ and a scenario $s$ , a model $\phi$ predicts:

\phi(x,s)\;\to\;(\widehat{y},\,\widehat{b}),\quad\widehat{b}=(\hat{x},\hat{y},\hat{w},\hat{h})\in\mathbb{Z}_{\geq 0}^{4},

(1)

where $\widehat{b}$ is an xywh box clipped to $[0,W]\!\times\![0,H]$ . The dataset provides tuples

\mathcal{D}=\big\{(x_{i},\,s_{i},\,y_{i},\,b_{i},\,\mathcal{A}_{i},\,e_{i},\,r_{i},\,\tau_{i})\big\}_{i=1}^{N},

(2)

with gold category $y_{i}$ , box $b_{i}$ , acceptable aliases $\mathcal{A}_{i}$ , a concise referring expression $e_{i}$ , a reasoning trace $r_{i}$ , and interpretable difficulty tags $\tau_{i}=(U_{i},C_{i},S_{i},O_{i},P_{i})$ for Uniqueness, Clutter, Size, Overlap, and Position. By construction, $s_{i}$ does not reveal $y_{i}$ . The curation pipeline proceeds in three phases, illustrated in Figure 2.

Phase 1: Source Filtering and Balancing.

We source instances from MS-COCO Lin et al. (2014), preferring images that overlap with RefCOCO/+/g for visual recognizability, and prevent leakage by excluding evaluation image IDs, retaining category labels only as hidden construction signals, and de-duplicating by image hash and annotation ID before splitting. The OOD split draws from LVIS Gupta et al. (2019) with COCO overlaps and synonym collisions removed to ensure strict category disjointness at both string and synset levels.

For each instance $i$ with box $b_{i}=(x_{i},y_{i},w_{i},h_{i})$ and center $c_{i}$ , we compute:

a_{i}=\frac{w_{i}h_{i}}{HW},\ d_{i}=\frac{\lVert c_{i}-(\tfrac{W}{2},\tfrac{H}{2})\rVert_{2}^{2}}{W^{2}+H^{2}},\ o_{i}=\sum_{j\neq i}\mathrm{IoU}(b_{i},b_{j}).

(3)

Difficulty tags are assigned by quantile binning estimated on the candidate pool before splitting: Size (S) via tertiles of $a_{i}$ ; Overlap (O) via low/mid/high percentiles of $o_{i}$ ; Position (P) via a median split of $d_{i}$ ; Uniqueness (U) by the presence of same-category distractors ( $m_{i}{=}0\Rightarrow$ U1; $m_{i}{\geq}1\Rightarrow$ U2); and Clutter (C) by binning the per-image instance count $N_{\mathrm{img}}$ . These axes expose ambiguity (U), scene density (C), scale (S), occlusion and congestion (O), and off-center placement (P) in a way that is both interpretable and controllable.

To form a balanced candidate pool, we target split-wise marginal proportions $\Pi=\{\pi_{U},\pi_{C},\pi_{S},\pi_{O},\pi_{P}\}$ and enforce per-category quotas to avoid category dominance. Allocation proceeds hierarchically: (i) assign a quota per category using long-tail-favoring priorities; (ii) partition each category’s quota across tag combinations according to $\Pi$ ; (iii) sample within each bin. A continuous difficulty score $D_{i}\in[0,1]$ uses a monotone combination of quantile-normalized $(a_{i},o_{i},d_{i})$ , distractor count, and category rarity, ordering instances within each bin.

Phase 2: Annotation.

Inspired by prior model-in-the-loop pipelines Kirillov et al. (2023); Wang et al. (2023); Xu et al. (2025), our scenario generation follows a two-stage process to ensure annotation quality before scaling. In a small-scale refinement loop, we first generate scenarios for a small set of random samples, then conduct iterative system refinement and a human audit to validate and improve the generation prompt. Generation proceeds to large scale only once the audited pass rate exceeds a quality threshold of 90%.

In large-scale generation, an LLM receives two views of each target image: the full image and the same image with a red rectangle prior marking the target region (the rectangle is not part of the object). It then returns a structured JSON containing: a concise category-free referring expression ( $\leq$ 25 words), a user-driven scenario (role, goal, and $\geq$ 3 disambiguating cues from attributes, relations, position, and affordance), a reasoning trace explaining how the scenario maps to visual evidence, canonical object attributes, an alias set $\mathcal{A}_{i}$ , and a predicted bounding box $\hat{b}_{i}$ . The prompt enforces that category names are absent from the scenario and expression, and requires at least one explicit contrast with a plausible distractor.

Table 2: RSC data retention across quality control stages. After LLM GT Filter: instances passing the automatic schema, box IoU, alias consistency, and leakage gates. After Quality Filter: instances passing the full judge-and-audit pipeline.

Split	Initial	GT Filter	Kept	Quality Filter	Kept
Train (SFT)	30,000	29,551	98.5%	23,802	80.5%
Train (RL)	10,000	9,884	98.8%	7,540	76.3%
Test (ID)	5,000	4,915	98.3%	4,038	82.2%
Test (OOD)	5,000	3,983	79.7%	3,247	81.5%

Phase 3: Audit and Filter.

Instances passing the automatic gates in Phase 2 form a rough annotation pool, which undergoes dual-track quality control summarized in Table 2. The LLM GT filter, comprising schema validation, box IoU gate, alias consistency, and leakage detection, retains 98–99% of ID instances and 79.7% of OOD instances. The lower OOD retention reflects the greater visual ambiguity of LVIS categories, for which the LLM more frequently fails to produce an accurate box or alias set. This confirms the filter is performing meaningful quality control on harder categories rather than passing annotations indiscriminately.

Instances passing the LLM GT filter then enter the quality filter stage. A quality judge automatically scores each annotation for the uniqueness of the scenario referring target, and the accuracy of the bounding box for the scenario. Borderline cases enter a judge refinement loop applying targeted system and human correction before re-scoring. After this stage, 76–82% of LLM-filtered instances are retained across splits, yielding final counts of 23,802 training (SFT), 7,540 training (RL), 4,038 in-domain test, and 3,247 OOD test examples.

In parallel, three independently drawn samples of 100 instances each undergo human audit by three expert annotators, who independently verify scenario non-leakage, alias consistency ( $y_{i}\in\mathcal{A}_{i}$ ), box correctness ( $\mathrm{IoU}(\hat{b}_{i},b_{i})\geq\theta_{\mathrm{bbox}}$ ), and attribute faithfulness. Majority-vote accuracy was 95.7% overall (per-rater: 94%/96%/97%), with substantial agreement (Fleiss’ $\kappa{=}\,$ 0.94). The residual 4% errors were attributable to LLM-hallucinated distractor descriptions ( ${\sim}3\%$ ) and minor attribute misdescriptions ( ${\sim}1\%$ ).

Each released RSC instance provides: a scenario query, reasoning traces, acceptable category names, ground-truth bounding box, and difficulty tag, with annotation logs and metadata included for research use. More details are provided in the appendix.

4 ScenGround

ScenGround is a two-stage curriculum reasoning method for scenario-based grounding on RSC. Thought-Primed SFT (TP-SFT) aligns the interface and elicits faithful <think> traces that precede <answer>. Incentive-Curriculum GRPO (IC-GRPO) then refines reasoning and localization by optimizing shaped rewards over RSC’s difficulty-stratified curriculum.

4.1 Thought-Primed SFT

Given image $x$ , scenario $s$ , and target text $\mathbf{y}$ (the concatenation of <think> and <answer> spans), we optimize the standard next-token loss:

\mathcal{L}_{\mathrm{SFT}}(\theta)=-\mathbb{E}_{(x,s,\mathbf{y})\sim\mathcal{D}_{\mathrm{sft}}}\left[\sum_{t=1}^{|\mathbf{y}|}\log p_{\theta}\!\left(y_{t}\mid x,s,\mathbf{y}_{<t}\right)\right].

(4)

The output schema requires a single JSON inside <answer> with keys target_object and bbox (xywh, pixel integers). Training uses the easier RSC slices ( $D_{i}$ easy percentiles) to stabilize schema learning before RL. TP-SFT teaches the output schema, elicits faithful reasoning traces inside <think>, and provides a stable reference policy $\pi_{\mathrm{ref}}$ for the subsequent RL stage.

4.2 Incentive-Curriculum GRPO

We fine-tune $\pi_{\theta}$ using GRPO Guo et al. (2025) with group-relative advantages (Eq. 6) and a KL-regularized objective with adaptive $\beta$ tracking a target KL band throughout training (see Appendix D).

Shaped rewards.

The scalar reward combines four components (Eq. 13). The geometry reward $r_{\mathrm{iou}}$ combines a base IoU term with smooth logistic bonuses near two operating points, a small center-consistency term, and a penalty for out-of-bounds predictions. The category reward $r_{\mathrm{cat}}$ is alias-aware: it awards full credit for canonical names, partial credit for accepted aliases and token-overlap matches, and is gated by a minimum IoU threshold to discourage well-labelled but poorly localized predictions. The format and structure rewards enforce schema compliance, rewarding parseable JSON inside <answer> with the required keys and penalizing malformed outputs. Reward weights are linearly annealed to increase geometry emphasis in later training.

Tag-aware curriculum.

IC-GRPO samples from RSC using difficulty scores $D_{i}$ . Stage 1 draws predominantly from easy-to-medium slices; Stage 2 shifts toward harder instances with non-unique categories (U2), high clutter (C3), high overlap (O2), and off-center placement (P1). This progressive schedule addresses reward sparsity: easy instances first establish reliable IoU signals before the policy encounters harder disambiguations where the category reward requires a cleared IoU gate.

Prompt-template ensemble.

To reduce sensitivity to surface-level query phrasing, we uniformly sample from eight prompt paraphrases (PTE-8) per training step. All templates share the same output schema; rewards are logged per template for analysis but supervision is identical, improving robustness without changing the learning objective.

Table 3: Performance on RSC (ID and OOD). Metrics: mIoU and

\mathrm{Acc}@\{0.5,0.7\}

(higher is better); Cat Acc = category accuracy. ^‡Oracle settings, not directly comparable: Grounding DINO receives privileged inputs unavailable at inference: cat token provides the gold category name directly and ref. cue feeds the conciser reasoning trace conclusion "referring expression cue" (See Figure 2) as the open-vocabulary query.

	RSC-ID				RSC-OOD
Model	mIoU	Acc@0.5	Acc@0.7	Cat Acc	mIoU	Acc@0.5	Acc@0.7	Cat Acc
GPT-4o Hurst et al. (2024)	19.41	13.23	5.37	79.45	16.57	9.55	3.08	62.00
Claude 3.7 Anthropic (2025)	16.64	8.32	3.71	89.67	12.04	5.54	1.87	58.98
Grounding DINO (cat token)^‡	44.60	47.55	42.03	—	32.18	31.99	27.89	—
Grounding DINO (ref. cue)^‡	48.99	51.84	46.02	—	38.12	38.26	34.07	—
InternVL2.5 8B Chen et al. (2024a)	16.76	11.88	6.74	81.70	8.08	3.64	1.61	36.50
Qwen3-VL 8B Team (2025)	15.46	11.17	6.05	75.04	7.38	3.70	1.48	46.97
Qwen2.5-VL 7B Bai et al. (2025a)	30.31	27.42	15.66	30.86	21.54	15.88	9.19	20.82
ScenGround (Ours)	55.68	60.90	42.32	94.23	38.37	38.11	22.64	21.13

Table 4: Ablation of curriculum in IC-GRPO on RSC in-domain (RSC-ID) and out-of-domain (RSC-OOD). Metrics: mIoU and

\mathrm{Acc}@\{0.5,0.7\}

for boxes; Cat Acc for category naming. Single Stage trains on the union of all RL samples without curriculum; Stage- $k$ trains on the stage-

k

slice, all other settings unchanged.

	RSC-ID				RSC-OOD
Method	mIoU	Acc@0.5	Acc@0.7	Cat Acc	mIoU	Acc@0.5	Acc@0.7	Cat Acc
Qwen2.5-VL 7B Bai et al. (2025a)	30.31	27.42	15.66	30.86	21.54	15.88	9.19	20.82
SFT	55.01	60.51	42.59	89.05	33.20	34.03	20.36	12.53
GRPO, Single Stage	54.04	59.04	38.43	94.90	37.66	36.99	20.82	20.05
GRPO, Stage 1	55.68	60.95	41.88	94.04	36.93	36.59	21.37	19.00
GRPO, Stage 2	55.68	60.90	42.32	94.23	38.37	38.11	22.64	21.13

5 Experimental Setup

Implementation details.

Annotation is generated by GPT4o Hurst et al. (2024) and the quality judge relies on Gemini-2.5-Pro Comanici et al. (2025). We train ScenGround on Qwen2.5-VL-7B-Instruct Bai et al. (2025a) in two stages. TP-SFT uses AdamW with learning rate $5{\times}10^{-6}$ for 5 epochs. GRPO Stage 1 uses AdamW (lr $1{\times}10^{-6}$ ) for 5 epochs with $K{=}6$ rollouts per prompt; Stage 2 uses lr $2{\times}10^{-6}$ for 2 epochs with $K{=}12$ rollouts, generation batch 288, temperature $0.9$ , top- $p$ $0.92$ , and max completion length 160. The KL coefficient $\beta$ is initialized at $2{\times}10^{-2}$ and adapted online in both stages. All remaining hyperparameters are provided in Appendix G.

Data for curriculum learning.

We partition the raw instances at the image level into disjoint sets $\mathcal{S}_{\mathrm{SFT}}$ , $\mathcal{S}_{\mathrm{RL}}$ , and $\mathcal{S}_{\mathrm{test}}$ , enforcing no image leakage across splits. After RSC curation filters, we retain 23k instances for SFT and 7k for RL. Difficulty is binned by global quantiles of $D_{i}$ into easy, medium, and hard buckets. The SFT split targets an easy-purity of at least 0.70. RL Stage 1 uses mixture $\Pi^{(1)}{=}(0.70,\,0.30,\,0.00)$ and Stage 2 shifts to $\Pi^{(2)}{=}(0.20,\,0.60,\,0.20)$ for (easy, medium, hard).

6 Experiments

Table 5: Referring expression comprehension on RefCOCO+ and RefCOCOg. Metric: Acc@0.5 (%,

\uparrow

). ^† Use the official grounding pipeline and prompt Bai et al. (2025a) . ^∗Use a custom prompt.

	RefCOCO+			RefCOCOg
Method	Val	Test A	Test B	Val	Test
Qwen2.5-VL	84.20	89.10	76.90	87.20	87.20
Qwen2.5-VL^∗	52.54	56.75	52.53	52.46	51.40
ScenGround^∗, SFT	50.86	43.84	50.74	65.45	63.39
ScenGround^∗, GRPO	70.16	74.63	70.05	78.19	75.61

Comparison on Referring Scenario Comprehension.

Table 3 evaluates two closed-source LLMs Hurst et al. (2024); Anthropic (2025), one open-vocabulary grounding model Liu et al. (2024b), three open-source LVLMs Bai et al. (2025a); Team (2025); Chen et al. (2024a) on RSC. The results reveal a consistent pattern across baselines: models with strong category accuracy tend to lag on localization, while strong detectors lack semantic reasoning. This trade-off reflects the core challenge RSC is designed to expose: scenario grounding requires jointly strong spatial localization and scenario-aware semantic inference, a combination underemphasized in prior benchmarks.

The Grounding DINO results illustrate the gap between detection-style grounding and full scenario reasoning. Under the oracle cat token setting, which directly supplies the gold category name, Grounding DINO achieves strong localization. Under the ref. cue setting, which supplies a short descriptive non-literal phrase rather than the full scenario, performance improves further, particularly on OOD. These gains confirm that Grounding DINO’s architecture is well-optimized for direct-cue detection, but neither setting reflects the full RSC protocol given to other models.

Among models that receive only the full scenario query $s$ and must jointly predict the target category and bounding box, ScenGround substantially outperforms all baselines on ID mIoU. Closed-source models achieve high category accuracy but poor localization, confirming that understanding a scenario does not automatically translate to spatial grounding. The localization and target category prediction gains from ScenGround are consistent across ID and OOD splits, reducing the baseline trade-off and delivering robust performance on the full RSC protocol.

Effectiveness of GRPO and Curriculum Learning.

Table 4 ablates training stages. TP-SFT already delivers a large ID jump over the off-the-shelf VLM Bai et al. (2025a), showing that aligning the model with the reasoning schema and the RSC objective teaches strong box prediction. However, OOD category accuracy remains the dominant failure mode after SFT. Moving to IC-GRPO without curriculum (Single Stage) improves OOD across both localization and category naming, but slightly regresses ID localization relative to SFT. This is consistent with reward sparsity: when easy and hard cases are mixed, many early trajectories fall below the IoU gate that modulates the alias reward, yielding high-variance gradients and modest gains.

The tag-aware curriculum addresses this. Stage 1 (easy to medium) preserves the strong ID accuracy of SFT and improves OOD over SFT, but falls short of Stage 2. Stage 2 (medium to hard) yields the best results across all OOD metrics while keeping ID performance essentially unchanged. As geometric accuracy improves on harder, more ambiguous instances, the policy learns to map scenario cues to both where and what under category shift. However, ScenGround’s OOD category accuracy remains only marginally above the untuned baseline, indicating that cross-category semantic naming is largely unsolved by curriculum training alone. This may be attributed to the fine-grained LVIS categories in the OOD split (see Appendix A), and suggests the need for stronger semantic generalization or open-vocabulary training strategies. Overall, SFT establishes the interface and ID competence, while IC-GRPO with a difficulty-aware curriculum closes the OOD localization gap.

Comparisons on standard referring-expression benchmarks.

Table 5 reports Acc@0.5 on RefCOCO+ Yu et al. (2016) and RefCOCOg Mao et al. (2016). The first row (Qwen2.5-VL^†) reflects the model’s official phrase-grounding pipeline and prompt Bai et al. (2025a). To standardize interfaces for cross-model comparison, we also evaluate Qwen2.5-VL under a custom prompt (row 2, ^∗); this prompt departs from its native detection-style prompt and substantially lowers accuracy, illustrating the sensitivity of phrase-grounding systems to prompt and decoding design.

Under this custom prompt, ScenGround shows two clear trends. First, SFT underperforms on RefCOCO+ but is notably stronger on RefCOCOg, consistent with RefCOCOg’s longer, more descriptive phrases being closer to RSC’s scenario style. Second, IC-GRPO yields large, consistent improvements across all splits, narrowing the gap to task-specialized pipelines. This suggests that geometry and schema rewards, together with the tag-aware curriculum, help the model resolve intra-class ambiguity and attend to disambiguating cues that typical phrase-grounding prompts would otherwise supply explicitly. Overall, while ScenGround is not optimized for RefCOCO-style prompting, its GRPO stage transfers robustly to these benchmarks, suggesting that curriculum training develops transferable reasoning skills rather than overfitting to a specific prompt template.

Analysis on localization ability.

Table 6 isolates localization by asking models to predict a box only, without category prediction. Qwen2.5-VL’s box-only performance closely mirrors its full-task localization, suggesting that category prediction is not its primary bottleneck. By contrast, ScenGround delivers strong OOD localization in the box-only regime, substantially outperforming Qwen2.5-VL and confirming that it can ground scenarios from visual cues even when the target category is unseen. The small gap between ScenGround’s box-only and full-task results further implies that most remaining OOD degradation comes from naming rather than spatial grounding. Notably, SFT is marginally stronger than GRPO in the box-only setting, while GRPO surpasses SFT on the full task. This suggests that IC-GRPO reallocates capacity to couple localization with scenario understanding rather than optimizing pure box regression.

Table 6: Localization under the box-only setting on RSC in-distribution (RSC-ID) and out-of-distribution (RSC-OOD). Metrics are mIoU and

\mathrm{Acc}@\{0.5,0.7\}

	RSC-ID			RSC-OOD
Method	mIoU	Acc @0.5	Acc @0.7	mIoU	Acc @0.5	Acc @0.7
Qwen2.5-VL 7B	27.37	24.09	13.42	19.11	13.50	7.70
ScenGround, SFT	51.18	55.80	37.99	40.71	42.65	25.86
ScenGround, GRPO	50.90	54.38	37.01	40.11	40.30	24.50

Analysis on reasoning ability.

Table 7 analyzes category prediction under three settings: Text Only (scenario $s$ only), Text-Image (scenario + image, no box), and Standard (full task). Removing the localization requirement substantially boosts Qwen2.5-VL on RSC-ID but yields a smaller gain on RSC-OOD, showing that its category decisions are sensitive to the localization constraint. By contrast, ScenGround exhibits a small gap between Text-Image and Standard, indicating that it can jointly localize and classify with minimal degradation. Notably, SFT’s strong ID-category accuracy comes at the cost of OOD text-only generalization, while GRPO closes this gap, suggesting that RL improves scenario understanding beyond what schema alignment alone provides. Finally, Text-Image outperforms Text Only across all models and splits, confirming that visual evidence materially helps resolve scenario ambiguities and that models are not relying solely on language priors.

Table 7: Category accuracy on RSC in-distribution (ID) and out-of-distribution (OOD) under three input–output settings. Text Only: predict category from scenario only. Text-Image: predict category from scenario+image (no box). Standard: full task.

	Text Only		Text-Image		Standard
Method	ID	OOD	ID	OOD	ID	OOD
Qwen2.5-VL 7B	56.24	22.34	77.54	33.33	29.65	19.28
ScenGround-SFT	81.03	10.69	93.66	14.92	92.72	12.30
ScenGround-GRPO	85.13	20.22	94.73	22.68	93.71	20.67

Qualitative results.

Figure 4 illustrates how ScenGround grounds user-driven scenarios. In the shown example, ScenGround correctly distinguishes the illustrated animal on the book cover from the real dog by leveraging the cues. More qualitative examples are in the Appendix F.

7 Conclusion

We introduce RSC, a benchmark for scenario-based visual grounding where targets are identified from paragraph-length queries specifying user roles, goals, and explicit distractor contrasts. RSC provides per-instance difficulty tags, paired reasoning traces, and an OOD split. Evaluations reveal that strong performance on standard referring-expression benchmarks does not transfer to scenario-based queries, exposing a systematic failure mode invisible to prior evaluation protocols. We further proposed ScenGround, a curriculum reasoning baseline combining supervised warm-starting with difficulty-aware reinforcement learning, which substantially reduces the localization–semantics trade-off on RSC. We hope RSC provides a useful testbed for studying referential reasoning, with natural extensions toward multi-object, temporal, and interactive grounding settings.

References

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas (2020) Referit3d: neural listeners for fine-grained 3d object identification in real-world scenes. In European conference on computer vision, pp. 422–440. Cited by: §2.
Anthropic (2025) Claude 3.7. Note: Large language model Cited by: Table 3, §6.
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin (2025a) Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: Appendix G, §2, Table 3, Table 4, §5, §6, §6, §6, Table 5.
S. Bai, M. Li, Y. Liu, J. Tang, H. Zhang, L. Sun, X. Chu, and Y. Tang (2025b) Univg-r1: reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231. Cited by: §2.
D. Z. Chen and A. X. Chang (2020) ScanRefer: 3d object localization in rgb-d scans using natural language. In ECCV, pp. 202–221. Cited by: §2.
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao (2023a) Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195. Cited by: §2.
Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, et al. (2024a) Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271. Cited by: Table 3, §6.
Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. (2024b) How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821. Cited by: §2.
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. (2024c) Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24185–24198. Cited by: §2.
Z. Chen, P. Wang, L. Ma, K. K. Wong, and Q. Wu (2020) Cited by: Table 1, §2.
Z. Chen, R. Zhang, Y. Song, X. Wan, and G. Li (2023b) Advancing visual grounding with scene knowledge: benchmark and method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15039–15049. Cited by: Table 1, §1, §2.
H. H. Clark and D. Wilkes-Gibbs (1986) Referring as a collaborative process. Cognition 22 (1), pp. 1–39. Cited by: Appendix A.
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. (2025) Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. Cited by: §5.
J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li (2021) Transvg: end-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1769–1779. Cited by: §2.
Q. Dong, K. Yang, L. Ju, H. Zhao, Y. Zhang, Y. Wang, H. Zeng, J. Lu, and Y. Fu (2026) Ref-adv: exploring mllm visual reasoning in referring expression tasks. arXiv preprint arXiv: 2602.23898. Cited by: §2.
H. Geng, F. Wang, S. Wei, Y. Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y. Wang, et al. (2025) Roboverse: towards a unified platform, dataset and benchmark for scalable and generalizable robot learning. arXiv preprint arXiv:2504.18904. Cited by: §1.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §4.2.
A. Gupta, P. Dollar, and R. Girshick (2019) Lvis: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5356–5364. Cited by: §3.
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) Gpt-4o system card. arXiv preprint arXiv:2410.21276. Cited by: Appendix B, Table 3, §5, §6.
A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion (2021) Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1780–1790. Cited by: §2.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798. Cited by: Table 1, Table 1, §2.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, et al. (2023) Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026. Cited by: §3.
X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia (2024) Lisa: reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9579–9589. Cited by: §2.
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J. Hwang, et al. (2022) Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10965–10975. Cited by: §2.
Z. Li, Q. Xu, D. Zhang, H. Song, Y. Cai, Q. Qi, R. Zhou, J. Pan, Z. Li, V. Tu, et al. (2024) Groundinggpt: language enhanced multi-modal grounding model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6657–6678. Cited by: §2.
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.
J. Liu, X. Yang, W. Li, and P. Wang (2024a) Finecops-ref: a new dataset and task for fine-grained compositional referring expression comprehension. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15440–15457. Cited by: Table 1, §2.
R. Liu, C. Liu, Y. Bai, and A. L. Yuille (2019) Clevr-ref+: diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4185–4194. Cited by: §2.
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. (2024b) Grounding dino: marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pp. 38–55. Cited by: §2, §6.
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20. Cited by: Table 1, §1, §2, §6.
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei (2023) Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Cited by: §2.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §2.
B. Shen, F. Xia, C. Li, R. Martín-Martín, L. Fan, G. Wang, C. Pérez-D’Arpino, S. Buch, S. Srivastava, L. Tchapmi, et al. (2021) IGibson 1.0: a simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7520–7527. Cited by: §1.
P. Sun, J. Xiao, T. H. E. Tse, Y. Li, A. Akula, and A. Yao (2025) Visual intention grounding for egocentric assistants. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2512–2522. Cited by: Table 1, §1, §2.
Q. Team (2025) Qwen3-vl. Note: Vision-Language Model Cited by: Table 3, §6.
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin (2024) Qwen2-vl: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191. Cited by: §2.
Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi (2023) Self-instruct: aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pp. 13484–13508. Cited by: §3.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: §2.
C. Wu, Z. Lin, S. Cohen, T. Bui, and S. Maji (2020) Phrasecut: language-based image segmentation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10216–10225. Cited by: §2.
J. Wu, M. Zhong, S. Xing, Z. Lai, Z. Liu, Z. Chen, W. Wang, X. Zhu, L. Lu, T. Lu, et al. (2024) Visionllm v2: an end-to-end generalist multimodal large language model for hundreds of vision-language tasks. Advances in Neural Information Processing Systems 37, pp. 69925–69975. Cited by: §2.
C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang (2025) Wizardlm: empowering large pre-trained language models to follow complex instructions. URL https://arxiv. org/abs/2304.12244. Cited by: §3.
S. Yang, G. Li, and Y. Yu (2020) Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9952–9961. Cited by: Table 1, §2.
H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S. Chang, and Y. Yang (2023) Ferret: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704. Cited by: §2.
K. You, H. Zhang, E. Schoop, F. Weers, A. Swearngin, J. Nichols, Y. Yang, and Z. Gan (2024) Ferret-ui: grounded mobile ui understanding with multimodal llms. In European Conference on Computer Vision, pp. 240–255. Cited by: §2.
L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg (2016) Modeling context in referring expressions. In ECCV, pp. 69–85. Cited by: §1, §2, §6.
Y. Zhan, S. Zheng, Y. Zhu, H. Zhao, F. Yang, M. Tang, and J. Wang (2025) Griffon v2: advancing multimodal perception with high-resolution scaling and visual-language co-referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22947–22957. Cited by: §2.

In this appendix, we provide statistical details of RSC in Section A, RSC data curation details in Section B, RSC data examples in Section C, more details of ScenGround in Section D, tag-level anlysis in Section E, ScenGround qualitative examples in Section F, training details in Section G, limitations and future work in Section H, and broader impact in Section I.

Appendix A RSC Statistics

Comparisons to RefCOCO+ and RefCOCOg.

We compare the query length and instance size distributions. Figure 5 compares query length distributions across RSC, RefCOCO+, and RefCOCOg, illustrating that RSC’s scenario-based queries are substantially longer than the short literal phrases used in prior benchmarks. Figure 6 compares instance size distributions, showing that RSC covers a broader range of target scales. Together, these distributions reflect RSC’s design goals: longer queries force models to reason over rich contextual descriptions rather than match salient keywords, while the wider size range stresses localization under various controlled conditions. Table 8 further shows that RSC’s unique query vocabulary (9,086 tokens) is roughly twice that of RefCOCO+ and RefCOCOg, confirming that the length difference reflects genuine linguistic diversity rather than repetitive padding.

Table 8: Vocabulary size comparison. Number of unique words across all queries in each dataset’s test split.

Dataset	Unique Vocabulary
RefCOCO+	4,240
RefCOCOg	4,917
RSC (Ours)	9,086

RSC difficulty tag distribution.

Figure 7 shows the difficulty tag distributions for RSC-ID and RSC-OOD test splits. The ID split maintains near-balanced marginals across all five axes, reflecting the tag-balanced sampling strategy described in Section 3. The OOD split shows a notable skew toward non-unique (U2) and smaller (S/M) instances, which partially explains the consistently lower accuracy on the OOD split. Distributions across Clutter (C), Overlap (O), and Position (P) remain broadly comparable between ID and OOD, confirming that the primary challenge of the OOD split is category shift rather than a confounding change in scene difficulty along other axes.

RSC scenario linguistic analysis.

Figure 8 shows the top-10 words per part-of-speech category across RSC scenario queries. The noun distribution is dominated by general referential terms (item, object) and scene participants (man, animal, photographer), confirming that scenarios describe roles and contexts rather than naming target categories directly. The verb distribution reflects the action- and goal-oriented nature of scenario queries, with terms such as distinguish, contrast, highlight, and review indicating that queries explicitly require disambiguation reasoning. The preposition distribution is rich in spatial terms (on, behind, from, by), consistent with RSC’s emphasis on relational and positional cues for grounding. Together, these distributions support the claim that RSC queries demand a qualitatively different form of language understanding than the short attribute-and-location phrase queries in standard REC benchmarks.

RSC category distribution.

Figure 9 lists the 20 most frequent categories in each test split, and Figures 10 and 11 show the top 80 frequency distributions. RSC-ID covers 79 COCO categories across 4,038 instances, with a moderately long-tailed distribution dominated by common household objects such as dining table, cup, and chair. RSC-OOD draws from 395 LVIS categories across 3,247 instances, with a substantially longer-tailed distribution reflecting LVIS’s fine-grained vocabulary, where top categories include clothing items (jersey, jean, shirt) and household furnishings (cabinet, pillow, curtain). Crucially, all OOD categories are disjoint from RSC-ID at both string and synset levels, ensuring that OOD evaluation measures genuine cross-category generalization rather than near-duplicate transfer.

Comparison to natural referential language.

RSC’s linguistic profile aligns with natural referential tendencies identified in prior work. RefCOCOg, the most naturalistically annotated standard benchmark, has an average query length of 8.4 words and a vocabulary of 4,917 unique tokens; RSC extends this tendency toward longer, richer descriptions (52.7 words, 9,086 tokens), consistent with the shift from object-naming to situation-description that characterizes more complex referential contexts. The dominance of relational verbs (distinguish, contrast, highlight) and spatial prepositions (behind, beside, from) in RSC queries mirrors patterns documented in human referential communication studies, where disambiguating reference in cluttered scenes naturally recruits relational and contextual cues rather than bare category names Clark and Wilkes-Gibbs (1986). While RSC scenarios are LLM-generated, the linguistic structure they exhibit reflects these well-documented tendencies rather than purely synthetic artifacts.

To assess ecological validity, we collected human-authored scenario queries for a subset of 100 RSC instances, asking three annotators to describe the target object without naming its category. Writing a scenario took approximately three minutes per instance, and annotators reported finding the task genuinely challenging: composing a coherent third-person description that provides sufficient disambiguating cues without revealing the target category requires the same kind of intentional reasoning that RSC is designed to evaluate. Comparing human-authored and LLM-generated scenarios for the same targets reveals both differences and commonalities. Human scenarios favor the present tense and exhibit more diverse structural patterns, whereas LLM-generated scenarios follow a more consistent template, which typically opens with a role description, followed by distractor contrasts and attribute cues. Annotators note that synthetic scenarios are less stylistically varied but more detailed and are factually accurate.

To verify that both query types convey equivalent referential content, we asked the three annotators to localize the target from the LLM-generated scenario for the same instances they had authored queries for. All annotators successfully identified the correct target from LLM-generated scenarios, confirming that the referential intent is recoverable from either formulation. Example pairs for the same targets are shown below (Table 10), illustrating that while surface form differs, the underlying disambiguating content is equivalent. Table 9 shows that model rankings are preserved across both query types and absolute performance levels are comparable, supporting the claim that RSC’s LLM-generated scenarios probe the same grounding capability as human-authored ones. RSC is designed to test whether models can perform this localization task, as long as the scenario accurately identifies the target through relational and contextual cues, whether human- or LLM-authored.

Table 9: Model performance on human-authored vs. LLM-generated scenarios for the same 100 RSC instances. The human-authored scenario query is randomly drawn from one of the three annotators. ScenGround performs comparably on human-authored queries, supporting ecological validity and confirming that training does not overfit to LLM-generated patterns.

	LLM-generated		Human-authored
Method	mIoU	Cat. Acc	mIoU	Cat. Acc
Qwen2.5-VL	33.4	31.0	31.3	32.0
ScenGround	58.0	96.0	59.2	98.0

Table 10: Human-authored vs. LLM-generated scenario pairs for the same targets. Both convey equivalent referential content through different surface forms.

Human-authored	LLM-generated
There is a tourist walking down the street who needs to cross at the intersection. You are looking for a green light on a pole, away from the main intersection light. It’s closer to the left side of the street and looks higher than the building.	As a pedestrian waiting to cross, you want to check if it’s safe. You look for the green-lit signal on the rightmost pole, above a blue sign, away from the main intersection lights. Its color, position, and separation from the cluster help you identify it.
A parent is watching from the beach, making sure the kid with dark hair is safe. You notice most of his body is submerged, he’s close to an adult, and he’s the leftmost in the group. Please identify the kid in the crowd.	As a parent watching from the shore, you want to check if your child, who has dark hair and is the leftmost in the group, is safe. They are closest to the water’s edge, mostly submerged, and positioned to the left of the shirtless swimmer.
A student is organizing the desk. They are looking for a green-handled item that sits on the desk. There are orange scissors and colorful pens in a mug, but those are not needed. Find the target object to the left of the mug that can be used to cut things.	As a teacher organizing supplies, you need the small green-handled item lying on the desk, not the larger orange-handled ones in the mug. Its bright green color, flat position, and separation from the mug make it easy to spot.

Appendix B RSC Curation Details

Annotation prompts.

Fig. 12 demonstrates the prompt used for RSC annotation generation. For each target, the annotator LLM is shown two views of the same image—the full frame and the frame with a red rectangle prior $R(b^{\ast})$ over the target region (the rectangle is not part of the object). The prompt supplies context for reasoning only (category hint, image size $H{\times}W$ , pixel box $[x,y,w,h]$ , and its normalized form $[x/W,y/H,w/H,h/H]$ ) and then instructs the model to (i) write one category-free referring expression ( $\leq$ 25 words), (ii) write one user-driven scenario (3–5 sentences specifying role/goal and $\geq$ 3 disambiguating cues from attributes/relations/position/affordance), and (iii) produce a brief thought explaining how the scenario maps to visual evidence. Outputs must be JSON-only with the fixed schema: object_name, acceptable_names (alias list for internal validation), bbox_normalized $[\tfrac{x}{W},\tfrac{y}{H},\tfrac{w}{W},\tfrac{h}{H}]$ , object_attributes {color, material, shape, texture, size, position, affordance, relation}, referring_expression, scenario, and thought. The template enforces JSON validity, forbids category leakage in the expression/scenario, requires at least one explicit contrast with plausible distractors, and discourages coordinates in prose in favor of relational cues.

Filtering and quality judge.

We first provide the predicted box and the GT bounding boxes from Phase 2 to compute the IoU as the first step of filtering. Then we provide the image, category label, and the generated acceptable name list to GPT-4o Hurst et al. (2024) for the second step of filtering. To reduce the risk of annotation leakage from the localization prior, we not only explicitly instruct the generator to avoid coordinate-based descriptions in favor of relational cues (e.g., beside, behind, closest to), but also apply a leakage detection to reject any scenario that references the rectangle or uses spatial coordinates verbatim.

Figure 13 shows the prompt used by the quality judge in Phase 3. The system prompt instructs the judge to act as a rigorous visual-grounding quality assessor, determining whether the scenario uniquely and unambiguously identifies exactly one object in the image. The evaluation prompt asks the judge to assess three criteria: Unique (whether exactly one object in the image plausibly matches the scenario), Accuracy (whether the bounding box correctly and tightly encloses the matching object), and Coherent (whether the scenario is internally consistent with no contradictory cues). The judge returns a structured JSON response including per-criterion verdicts, a plausible candidate count, a confidence score, and a one-sentence reason. An instance is kept if and only if all three criteria are satisfied (Unique $\wedge$ Accuracy $\wedge$ Coherent); borderline cases with low confidence enter the judge refinement loop described in Section 3.

Human audit.

Human verification follows the same three-criterion protocol as the quality judge: annotators assess Unique, Accuracy, and Coherent for each instance, with a keep decision requiring all three to be satisfied. Three computer science PhD students conducted the audit independently, without knowledge of each other’s decisions. To reduce cultural and linguistic bias, annotators were recruited from three different countries (two male, one female). Each annotator reviewed a stratified random sample of 100 instances per audit round. Majority-vote accuracy was 95.6% overall (per-rater: 94%/96%/97%) with great inter-annotator agreement (Fleiss’ $\kappa{=}\,$ 0.94).

Appendix C Data Examples

We include additional data examples to illustrate the user-driven scenarios, reasoning traces, and tag structure in RSC. Figure 14 samples from the training split and, for each instance, shows the natural image, the user-driven scenario and the reasoning process. From the held-out test split, Figures 15- 16-17 present easy, medium, and hard examples, respectively, reflecting increasing ambiguity (U2), scene clutter (C2/C3), occlusion/overlap (O1/O2), and off-center placement (P1). All examples satisfy the curation gates (alias consistency and IoU threshold; see main text) and omit category names to preserve the scenario.

Appendix D ScenGround: Implementation Details

This appendix gives a complete, symbol-level description of ScenGround. All equations correspond one-to-one with our implementation; concrete hyperparameter values appear in Section 5.

D.1 Thought-Primed SFT (TP-SFT)

Objective.

Given image $x$ , scenario $s$ , and target text $\mathbf{y}$ (the concatenation of <think> and <answer> spans), we minimize:

\mathcal{L}_{\mathrm{SFT}}(\theta)=-\mathbb{E}_{(x,s,\mathbf{y})\sim\mathcal{D}_{\mathrm{sft}}}\left[\sum_{t=1}^{|\mathbf{y}|}\log p_{\theta}\!\left(y_{t}\mid x,s,\mathbf{y}_{<t}\right)\right].

(5)

Output schema.

The instruction requires a single JSON inside <answer>…</answer> with two keys: target_object (canonical category string) and bbox (four pixel integers in xywh, clipped to image bounds). Several key aliases are accepted at scoring time for robustness.

Curriculum.

TP-SFT draws from the easy RSC slice (instances with $D_{i}$ below the easy-percentile threshold $\delta_{\mathrm{easy}}$ ) to stabilize schema and trace learning before RL. The trained checkpoint is saved as the reference policy $\pi_{\mathrm{ref}}$ that anchors KL regularization in Stage 2.

D.2 Incentive-Curriculum GRPO (IC-GRPO)

GRPO objective.

For each item $(x_{i},s_{i})$ we sample $K$ completions $\{c_{i,k}\}_{k=1}^{K}\sim\pi_{\theta}(\cdot\mid x_{i},s_{i})$ and compute group-relative advantages:

A_{i,k}=r_{i,k}-\frac{1}{K}\sum_{k^{\prime}=1}^{K}r_{i,k^{\prime}}.

(6)

The KL-regularized objective is:

	$\displaystyle\mathcal{J}(\theta)=$	$\displaystyle\mathbb{E}_{i}\!\left[\frac{1}{K}\sum_{k=1}^{K}A_{i,k}\log\pi_{\theta}\!\left(c_{i,k}\mid x_{i},s_{i}\right)\right]$
		$\displaystyle-\beta\,\mathrm{KL}\!\left(\pi_{\theta}\,\\|\,\pi_{\mathrm{ref}}\right).$		(7)

The KL coefficient $\beta$ is adapted online by the AdaptiveKLScheduler: if the observed KL exceeds the target band $[\kappa_{\mathrm{tgt}}-\kappa_{\mathrm{tol}},\,\kappa_{\mathrm{tgt}}+\kappa_{\mathrm{tol}}]$ , $\beta$ is multiplied by $\mu_{\uparrow}$ ; if it falls below, by $\mu_{\downarrow}$ ; and $\beta$ is clipped to $[\beta_{\min},\beta_{\max}]$ .

Box normalization.

A predicted 4-vector $\widehat{b}=[x,y,a,b^{\prime}]$ is mapped to a valid box $\widetilde{b}\in\mathbb{Z}_{\geq 0}^{4}$ by the following priority logic:

1.

Prefer xywh: if $(x,y)$ lie inside the image and $(a,b^{\prime})$ are valid width/height that fit within bounds, interpret as $(x,y,w,h)$ .
2.

Try xyxy: else if $a>x$ and $b^{\prime}>y$ and all values are within image bounds, convert $(x_{1},y_{1},x_{2},y_{2})$ to xywh.
3.

Clamp: otherwise clamp aggressively to image bounds with minimum $1$ -px side length.

The out-of-bounds indicator is $\mathbb{1}_{\mathrm{OOB}}=\mathbf{1}\{\widehat{b}\neq\widetilde{b}\}$ . All IoU values are computed in xyxy after conversion.

Geometry reward.

Let $\sigma(z)=(1+e^{-z})^{-1}$ be the logistic function, $c(\cdot)$ the box center, and $\mathrm{diag}=\sqrt{W^{2}+H^{2}}$ the image diagonal. Define the normalized center distance $d=\|c(\widetilde{b})-c(b^{\ast})\|_{2}/\mathrm{diag}$ . The geometry reward is:

$\displaystyle r_{\mathrm{iou}}$	$\displaystyle=\min\!\Big\{1,\;\underbrace{\mathrm{IoU}(\widetilde{b},b^{\ast})}_{\text{base}}+\alpha_{1}\,\sigma\!\left(\tfrac{\mathrm{IoU}(\widetilde{b},b^{\ast})-\tau_{1}}{\kappa}\right)$
	$\displaystyle\quad\quad+\alpha_{2}\,\sigma\!\left(\tfrac{\mathrm{IoU}(\widetilde{b},b^{\ast})-\tau_{2}}{\kappa}\right)+\alpha_{c}\,\exp\!\left(-\tfrac{d^{2}}{2\sigma_{c}^{2}}\right)\Big\}$
	$\displaystyle\quad\quad-\alpha_{\mathrm{oob}}\,\mathbb{1}_{\mathrm{OOB}},$	(8)

where $\tau_{1}<\tau_{2}$ are the two IoU operating points, $\kappa$ controls logistic width, $\alpha_{1},\alpha_{2}$ are the corresponding bonus magnitudes, $\alpha_{c}$ and $\sigma_{c}$ govern the center-consistency term, and $\alpha_{\mathrm{oob}}$ is the out-of-bounds penalty.

Category reward.

Let $\mathcal{A}^{\mathrm{can}}$ be the canonical name set and $\mathcal{A}^{\mathrm{norm}}$ the normalized alias set. Define an IoU gate:

g_{\mathrm{iou}}=\begin{cases}1,&\mathrm{IoU}(\widetilde{b},b^{\ast})\geq\tau_{g},\\ g,&\mathrm{IoU}(\widetilde{b},b^{\ast})<\tau_{g},\end{cases}

(9)

where $\tau_{g}$ is the gate threshold and $g<1$ is the halving factor. With token-level Jaccard similarity $\mathrm{Jac}(\cdot,\cdot)$ computed after lowercasing and light stemming:

r_{\mathrm{cat}}=g_{\mathrm{iou}}\times\begin{cases}1,&\widehat{y}\in\mathcal{A}^{\mathrm{can}},\\[3.0pt] \eta,&\widehat{y}\in\mathcal{A}^{\mathrm{norm}},\\[3.0pt] \rho_{l}+\rho_{s}\,\max\limits_{a\in\mathcal{A}^{\mathrm{norm}}}\mathrm{Jac}(\widehat{y},a),&\text{otherwise,}\end{cases}

(10)

where $\eta$ is the alias credit, and $\rho_{l},\rho_{s}$ map Jaccard $\in[0,1]$ to the soft-overlap range $[\rho_{l},\,\rho_{l}+\rho_{s}]$ .

Format and structure rewards.

Let $\delta_{\mathrm{tag}}\in\{0,1\}$ indicate the presence of answer tags, $\delta_{\mathrm{json}}\in\{0,1\}$ whether a parseable JSON is found inside, and $\delta_{\mathrm{keys}}\in\{0,1\}$ whether both a bbox key and a class key are present:

	$\displaystyle r_{\mathrm{fmt}}$	$\displaystyle=\begin{cases}+1,&\delta_{\mathrm{tag}}=1\;\text{and}\;\delta_{\mathrm{json}}=1,\\ -1,&\text{otherwise,}\end{cases}$		(11)
	$\displaystyle r_{\mathrm{struct}}$	$\displaystyle=\max\!\left\{\gamma_{\mathrm{tag}}\,\delta_{\mathrm{tag}}+\gamma_{\mathrm{key}}\,\delta_{\mathrm{keys}},\;\gamma_{\min}\right\},$		(12)

where $\gamma_{\mathrm{tag}}$ , $\gamma_{\mathrm{key}}$ , and $\gamma_{\min}$ are the tag, key, and floor coefficients.

Total reward and weight annealing.

The scalar reward is:

r=w_{\mathrm{iou}}\,r_{\mathrm{iou}}+w_{\mathrm{cat}}\,r_{\mathrm{cat}}+w_{\mathrm{fmt}}\,r_{\mathrm{fmt}}+w_{\mathrm{struct}}\,r_{\mathrm{struct}}.

(13)

Weights are linearly annealed from start to late values over the first $p_{\mathrm{anneal}}$ fraction of training:

	$\displaystyle w_{h}(s)=w_{h}^{\mathrm{start}}+p(s)\,\bigl(w_{h}^{\mathrm{late}}-w_{h}^{\mathrm{start}}\bigr),$		(14)
	$\displaystyle p(s)=\min\!\left\{1,\,\frac{s}{p_{\mathrm{anneal}}\cdot S}\right\},$

where $s$ is the global step and $S$ the total steps.

Tag-aware curriculum.

IC-GRPO samples RSC using difficulty scores $D_{i}$ and tag-based marginals. Let $\Pi^{(m)}=(\pi^{(m)}_{\mathrm{easy}},\,\pi^{(m)}_{\mathrm{med}},\,\pi^{(m)}_{\mathrm{hard}})$ denote the easy/medium/hard sampling mixture for RL stage $m$ . The mixture shifts between stages to progressively expose the policy to harder U2, C3, O2, and P1 instances, addressing reward sparsity by first establishing reliable IoU signals on easier cases.

Prompt-template ensemble (PTE-8).

At each training step we uniformly sample one of $T$ prompt paraphrases. All templates enforce the same output schema; rewards are logged per template for analysis but supervision is identical, improving robustness to surface query variation without changing the learning objective.

Appendix E Tag-level Analysis

Table 11 breaks down performance by difficulty tag on RSC-OOD. For mIoU, GRPO consistently improves over SFT across all tags, with the largest relative gains on small instances (S: 2.56 $\to$ 17.18) and high-overlap instances (O2: 29.40 $\to$ 43.74), confirming that the difficulty-aware curriculum meaningfully improves localization on the hardest cases. Large objects (L) achieve the highest absolute mIoU (60.67), as expected from their visual salience. For category accuracy, a notable pattern emerges: SFT drops below the baseline across nearly all tags, indicating that schema alignment reduces OOD category generalisation. GRPO recovers and surpasses the baseline in most cases, demonstrating that RL restores and extends semantic understanding on unseen categories beyond what SFT alone provides. High-overlap instances (O2) remain hard for category naming across all methods, suggesting that visual congestion impairs both localisation and semantic inference simultaneously.

Table 11: Tag-level analysis on RSC-OOD. mIoU and category accuracy (Cat Acc) broken down by difficulty tag for the baseline (Qwen2.5-VL), SFT, and GRPO.

	mIoU (%)			Cat Acc (%)
Tag	Baseline	SFT	GRPO	Baseline	SFT	GRPO
U1	27.24	41.71	46.78	26.24	13.57	22.20
U2	16.98	26.44	31.67	16.52	11.71	20.28
C1	22.85	36.06	39.54	12.32	10.53	19.26
C2	21.30	33.00	39.29	19.93	12.22	21.33
C3	20.87	31.44	36.40	27.82	14.31	22.18
S	2.56	14.85	17.18	14.85	12.41	21.87
M	16.50	33.49	38.78	20.56	12.60	21.00
L	49.59	52.64	60.67	27.67	12.57	20.51
O0	18.03	30.11	34.69	20.30	14.81	22.77
O1	21.18	33.53	39.29	19.80	12.08	21.83
O2	29.40	38.73	43.74	24.15	8.91	16.21
P0	16.38	33.01	37.36	21.34	13.96	22.38
P1	26.77	33.39	39.38	20.29	11.08	19.85

Appendix F Qualitative Examples

Figure 18 presents further qualitative examples illustrating how our ScenGround–trained model uses scenario-aware reasoning to improve grounding performance. For each case, we show the input image together with the user scenario, the ground-truth target region, and the model’s prediction. The examples span challenging settings such as cluttered scenes, small or partially occluded objects, and pragmatically underspecified descriptions.

Appendix G Experimental Setup

Base model.

All experiments use Qwen2.5-VL-7B-Instruct Bai et al. (2025a) as the backbone, with images resized to $512{\times}512$ .

TP-SFT hyperparameters.

AdamW optimizer, learning rate $5{\times}10^{-6}$ , cosine schedule with warmup ratio $0.15$ , 5 epochs, gradient accumulation steps $4$ . The SFT split targets easy-purity $\delta_{\mathrm{easy}}\geq 0.70$ .

IC-GRPO hyperparameters.

We run two RL stages. Stage 1: AdamW, lr $1{\times}10^{-6}$ , cosine schedule, warmup ratio $0.05$ , 5 epochs, $K{=}6$ rollouts per prompt, generation batch 240, temperature $0.8$ , top- $p$ $0.95$ , max completion length 256. Stage 2: AdamW, lr $2{\times}10^{-6}$ , cosine schedule, warmup ratio $0.05$ , 2 epochs, $K{=}12$ rollouts per prompt, generation batch 288, temperature $0.9$ , top- $p$ $0.92$ , max completion length 160. Both stages use DeepSpeed ZeRO-3, gradient checkpointing, and bf16 precision.

Adaptive KL scheduler.

Initial $\beta_{0}=2{\times}10^{-2}$ . Stage 1: $\kappa_{\mathrm{tgt}}{=}0.13$ , $\kappa_{\mathrm{tol}}{=}0.03$ , $\mu_{\uparrow}{=}1.5$ , $\mu_{\downarrow}{=}0.66$ . Stage 2: $\kappa_{\mathrm{tgt}}{=}0.15$ , $\kappa_{\mathrm{tol}}{=}0.03$ , $\mu_{\uparrow}{=}1.6$ , $\mu_{\downarrow}{=}0.66$ . Both stages clip $\beta\in[5{\times}10^{-4},\,5{\times}10^{-2}]$ .

Geometry reward parameters.

IoU operating points $\tau_{1}{=}0.50$ , $\tau_{2}{=}0.70$ ; logistic width $\kappa{=}0.03$ ; bonus magnitudes $\alpha_{1}{=}0.30$ , $\alpha_{2}{=}0.50$ ; center coefficient $\alpha_{c}{=}0.02$ with bandwidth $\sigma_{c}{=}0.20$ (as a fraction of image diagonal); OOB penalty $\alpha_{\mathrm{oob}}{=}0.05$ .

Category reward parameters.

IoU gate threshold $\tau_{g}{=}0.30$ , gate factor $g{=}0.5$ ; alias credit $\eta{=}0.80$ ; soft-overlap range $\rho_{l}{=}0.40$ , $\rho_{s}{=}0.30$ (mapping Jaccard $\in[0,1]$ to $[0.40,\,0.70]$ ).

Format and structure reward parameters.

$\gamma_{\mathrm{tag}}{=}0.25$ , $\gamma_{\mathrm{key}}{=}0.75$ , $\gamma_{\min}{=}{-}0.50$ .

Reward weight schedule.

Annealing completes by $p_{\mathrm{anneal}}{=}0.60$ of total training steps. Concrete start and late values are:

Table 12: Reward weight schedules. Stage 2 weights are set by the RewardWeightScheduler from step 0, overriding the initialisation values.

	$w_{\mathrm{iou}}$	$w_{\mathrm{cat}}$	$w_{\mathrm{fmt}}$	$w_{\mathrm{struct}}$
Stage 1	0.75	0.15	0.07	0.03
Stage 2 (start)	0.55	0.25	0.12	0.08
Stage 2 (late)	0.75	0.20	0.04	0.01

Curriculum mixtures.

Stage 1 RL uses $\Pi^{(1)}{=}(0.70,\,0.30,\,0.00)$ ; Stage 2 RL uses $\Pi^{(2)}{=}(0.20,\,0.60,\,0.20)$ for (easy, medium, hard) respectively.

Prompt-template ensemble.

$T{=}8$ paraphrase templates (PTE-8).

Appendix H Limitations and Future Work

RSC scenarios are LLM-generated rather than collected from real users. Although our ecological validity study shows that model rankings are preserved across human- and LLM-authored queries for the same instances, the synthetic scenarios exhibit less stylistic diversity than naturalistic language. RSC is also limited to single-target grounding on MS-COCO and LVIS images, which may not generalize to specialized domains or multi-object, video-based, and interactive settings. Future work could complement RSC with human-authored queries to validate ecological validity, extend the benchmark to diverse image domains, and explore multi-object and dialogue-grounded scenario grounding.

Appendix I Broader Impact

RSC is intended to support the development of multimodal models that reason over rich natural language, with applications in embodied assistants and accessibility tools. The benchmark inherits demographic and geographic biases from MS-COCO and LVIS, and LLM-generated scenarios may carry residual hallucinations or culturally specific assumptions despite quality controls. We release annotation logs and metadata to support bias analysis. RSC contains no personally identifiable information, and we do not anticipate direct misuse risks, though we encourage downstream users to consider the implications of scenario-based grounding capabilities in sensitive deployment contexts.