.
├── README.md
├── config
├── document
├── figure
├── process
├── requirements.txt
├── run_script.sh
└── toolsconfig: Including prompt to use and parameters to set, etc.
document: model's final performance, examiner priority, and position bias.
figure: figures used in paper
process: code of AutoBench-V
tools: Some common tools, such as image base64 conversion, data visualization and so on.
run_script.sh: api to use.
pip -r install requirements.txt
./run_script.sh
python pipeline.pyRemember to change parameters: user_input and generate_type when run pipeline.py.
five options for user_input:
basic_understandingspatial_understandingsemantic_understandingreasoning_capacityatmospheric_understanding
For a complete pipeline, you only need to use 7 kinds for generate_type in order:
aspect: generate aspectsprompts: generate image descriptionsimages: generate images based on descriptionalignment: test the alignment of images and descriptions via VQAquestions: generate questions to test LVLMsadjust: adjust the option distribution of questionsanswers: answer questions and score
@misc{bao2025autobenchvlargevisionlanguagemodels,
title={AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?},
author={Han Bao and Yue Huang and Yanbo Wang and Jiayi Ye and Xiangqi Wang and Xiuying Chen and Yue Zhao and Tianyi Zhou and Mohamed Elhoseiny and Xiangliang Zhang},
year={2025},
eprint={2410.21259},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.21259},
}
If you have any questions, suggestions, or would like to collaborate, please feel free to reach out to us via email at hbao@nd.edu

