PlanBench: A Comprehensive Benchmark for Urban Planning

PlanBench: Planning Knowledge Benchmark

A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models

Abstract

Urban planning, as a highly interdisciplinary and practice-oriented field, requires not only simple recall of knowledge but also complex situational judgment, policy understanding, spatial logical reasoning, and value assessment. Planning texts are characterized by dense terminology, complex structures, and long reasoning chains. Constructing benchmarks can help enhance large models' planning adaptation capabilities in the following aspects:

Deconstruction of planning texts (e.g., regulation breakdown, indicator interpretation)
Multi-level spatial governance logic (national - city - community)
Situational policy judgment and plan generation (e.g., site selection, land allocation, industry recommendations)

Text-based benchmarks serve as the linguistic foundation for "multimodal urban intelligence." In subsequent integrations with maps, charts, and spatial models, text comprehension capabilities are fundamental for achieving the three-dimensional linkage of "text-image-policy."

Architecture

Figure 1: PlanBench Text Benchmark Architecture.

📅 Release Date: May 19, 2025

PlanBench-V: Planning Visual Recognition Benchmark

Multimodal Multi-image Understanding for Evaluating Multimodal Large Language Models

🔗Homepage

Abstract

National spatial planning maps visually present the concepts, goals, strategies, and specific measures of spatial planning, serving as a guide for coordinating various spatial development, protection, and utilization activities. They are not only crucial for planning decisions but also important tools for public participation and oversight of planning implementation. Planning is a highly interdisciplinary and specialized task; understanding planning maps requires grasping detailed elements (symbols, legends, geographic features) and the ability to conduct comprehensive analysis and judgment in conjunction with policies. This complexity makes understanding planning maps challenging. With the rapid development of multimodal large language models (MLLMs), we have established a benchmark for national spatial planning maps to evaluate MLLMs' capabilities in understanding these maps. Our contributions are as follows:

(1) Data: We constructed the Spatial Planning Map Database (SPMD), featuring diverse image content and high-quality annotations provided by experts in the field of planning.
(2) Framework: We proposed a comprehensive framework based on planning disciplines, measuring MLLMs' understanding of planning maps from four perspectives: perception, reasoning, association, and application, including eight subcategories.
(3) Experiments: By constructing question-answer tasks based on authoritative question banks (China's Registered Urban Planner Qualification Examination), we significantly reduced the proportion of "hallucination-style normative citations" by models.
(4) Results: All models performed worst in the application dimension, with Qwen2.5-VL-32B-Instruct achieving the highest overall score across all four dimensions.

Architecture

Figure 2: PlanBench-V Architecture.

📅 Release Date: May 19, 2025

Benchmark Results

PlanBench-V Results (Vision, Judge: gpt-4o-mini, 300 items)

Rank	Model	Overall	描述	类型	评价	决策	专业推理	关联	空间关系	要素
🥇	gemini-2.5-pro	1.472/2 (73.6%)	1.775	1.656	1.439	1.525	1.425	1.468	1.444	1.408
🥈	gpt-5.4	1.431/2 (71.6%)	1.900	1.562	1.586	1.508	1.486	1.438	1.383	1.233
🥉	claude-opus-4.7	1.384/2 (69.2%)	1.825	1.320	1.434	1.321	1.558	1.493	1.295	1.186
4	gpt-4o-mini	1.084/2 (54.2%)	1.244	1.342	0.901	1.155	1.079	1.110	1.151	0.918

PlanBench Results (Text, Judge: gpt-4o-mini, 405 items)

Rank	Model	Score	Remember	Understand	Apply	Analyze	Evaluate
1	Qwen3-32B	80.9%	97.5	86.4	95.1	86.1	39.5
2	Qwen3-14B	80.6%	97.5	77.8	92.6	86.8	48.1
3	QwQ-32B	80.4%	95.1	85.2	91.4	91.9	38.3
4	Qwen3-8B	80.0%	93.8	80.2	90.1	90.4	45.7
5	Qwen3-4B	78.8%	95.1	72.8	90.1	89.3	46.9
6	Qwen3-30B-A3B	78.4%	97.5	79.0	88.9	89.5	37.0
7	Qwen3-1.7B	74.1%	95.1	79.0	76.5	85.1	34.6
8	glm-4-9b-chat	73.3%	91.4	72.8	84.0	79.9	38.3
9	Meta-Llama-3-8B-Instruct	70.6%	95.1	58.0	72.8	78.8	48.1
10	Qwen2.5-3B-Instruct	70.3%	98.8	66.7	92.6	64.0	29.6
11	Qwen2.5-7B-Instruct	69.5%	98.8	70.4	81.5	65.9	30.9
12	Qwen2-VL-7B-Instruct	68.2%	93.8	65.4	76.5	65.7	39.5
13	DeepSeek-R1-Distill-Llama-8B	68.1%	93.8	64.2	75.3	78.8	28.4
14	DeepSeek-R1-Distill-Qwen-7B	68.0%	96.3	69.1	77.8	73.4	23.5
15	Qwen3-0.6B	55.9%	90.1	55.6	46.9	74.8	12.3
16	Llama-3.1-Tulu-3-8B	49.0%	60.5	56.8	30.9	80.8	16.0
17	chatglm3-6b	48.3%	80.2	37.5	44.4	58.3	21.0
18	Qwen2.5-0.5B-Instruct	39.3%	65.4	21.0	25.9	69.4	14.8

Citation

@misc{zhu2024plangptenhancingurbanplanning,
      title={PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval}, 
      author={He Zhu and Wenjia Zhang and Nuoxian Huang and Boyang Li and Luyao Niu and Zipei Fan and Tianle Lun and Yicheng Tao and Junyou Su and Zhaoya Gong and Chenyu Fang and Xing Liu},
      year={2024},
      eprint={2402.19273},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.19273}, 
}

@misc{deng2025urban,
    title={Urban Planning Bench: A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models},
    author={Yijie Deng and He Zhu and Wen Wang and Minxin Chen and Junyou Su and Wenjia Zhang},
    year={2025},
    institution={Behavioral and Spatial AI Lab, Tongji University},
}

Members

He Zhu, Minxin Chen, Yijie Deng, Junyou Su, Wen Wang, Yurun Wang, Yulin Wu, Caicheng Niu, Tianhua Lu, Chengcheng Liu, Boyang Li, Nuoxian Huang, Ying'er Cai, Yue Wei, Sizheng Yang, Luyao Niu, Jiayu Gu, Yuhan Zou, Fenghong An, Siqi Cha, Chuang Deng, Hanying Li, Hongzhou Zheng and Qi Wang.