RefBench-PRO:Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

1Xi'an Jiaotong University, 2Institute of Artificial Intelligence (TeleAI), China Telecom, 3Shanghai Jiao Tong University, 4University of Science and Technology Beijing
*Equal Contributions, Corresponding Authors
Impact of the sparsity-aware algorithm in XGBoost on the Allstate-10K dataset.
Overview of our RefBench-PRO benchmark and the underlying RefObjects-200k dataset. Starting from 12 million high-resolution images in FineHARD, we construct RefObjects-200k, a challenging referring expression comprehension dataset with 203,985 high-quality instances spanning two core dimensions—perception and reasoning—which are further decomposed into six sub-dimensions. RefBench-PRO then selects 6,000 carefully curated samples from RefObjects-200k, 1,000 per category, to rigorously evaluate the referring expression comprehension capabilities of modern MLLMs.

Abstract

Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions, establishing a stronger baseline for REC. Extensive experiments demonstrate that our RefBench-PRO enables interpretable evaluation of MLLM on referring expression comprehension, presenting greater challenges in both perception and reasoning.

Benchmark Statistic

Impact of the sparsity-aware algorithm in XGBoost on the Allstate-10K dataset.
Impact of the sparsity-aware algorithm in XGBoost on the Allstate-10K dataset.
Refbench-PRO contains 6,000 pairs distributed across six sub-categories: Attribute, Position, Interaction, Relation, Commonsense, and Rejection. The benchmark features high-resolution images, covers over 1,000 distinct object types, and emphasizes small or marginally visible targets, with an average target object area ratio of 10%. our RefBench-PRO exhibits a broader distribution, with greater emphasis on objects with small relative size. Additionally, our RefBench-PRO produces comparably long descriptions by incorporating discriminative visual cues, achieving higher information density.

Benchmark Example

Leaderboard

Overall
Visual-cue Perception
Compositional Reasoning
GLEE2023-1236.131.248.238.434.540.431.427.97.129.7
Grounding DINO L2023-0337.631.347.543.331.840.935.030.30.132.7
Gemini-2.5-pro2025-039.68.010.411.510.810.97.28.2-7.7
GPT-4o2024-0512.110.111.712.811.912.112.411.6-12.0
GPT-52025-0826.121.829.225.527.027.226.222.9-24.6
PaDT-3B2025-1026.622.230.428.030.429.623.720.8-22.3
VLM-R1-3B2025-0454.445.359.058.054.157.047.853.2-50.5
ChatRex-7B2024-1149.541.354.751.153.253.045.143.4-44.2
Migician-7B2025-0152.343.657.359.752.856.645.446.1-45.7
UniVG-R1-7B2025-0553.044.259.457.255.157.248.344.9-46.6
Rex-Thinker-7B2025-0663.653.067.164.561.764.459.365.6-62.4
CogVLM-Grounding-17B2023-1157.147.562.462.455.960.249.455.2-52.3
Qwen2-VL-7B2024-0945.442.655.347.837.847.039.446.528.543.0
Mimo-VL-RL-7B2025-0656.346.960.958.457.358.951.852.90.152.4
Qwen2.5-VL-7B2025-0257.648.561.763.058.661.149.155.63.152.3
InternVL3-8B2025-0420.120.324.918.422.321.919.815.021.317.4
InternVL3.5-8B2025-0841.534.645.741.245.344.137.837.3-37.5
LLaVA-OneVision-1.5-8B2025-0950.742.354.554.048.452.348.148.6-48.3
Qwen3-VL-8B2025-1071.462.276.676.167.373.368.968.315.868.6
Ovis2.5-9B2025-0861.751.565.763.659.763.058.761.0-59.9
GLM-4.1V-Base-9B2025-0760.150.162.961.057.760.557.061.9-59.4
LLaVA-OneVision-72B2024-0856.547.160.159.453.757.754.255.0-54.6
Qwen2.5-VL-72B2025-0266.759.568.669.169.469.161.664.823.663.2
InternVL3-78B2025-0421.822.335.024.928.229.424.320.824.822.5
Qwen3-VL-32B2025-1079.066.682.780.376.379.874.181.64.577.9

BibTeX citation

@article{gao2025refbench,
title={RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension},
author={Gao, Tianyi and Li, Hao and Fang, Han and Wei, Xin and Dong, Xiaodong and Sun, Hongbo and Yuan, Ye and He, Zhongjiang and Xu, Jinglin and Xin, Jingmin and others},
journal={arXiv preprint arXiv:2512.06276},
year={2025}
}