自动驾驶的世界模型综述(英文)
VIP专免
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
A Survey of World Models for
Autonomous Driving
Tuo Feng, Wenguan Wang, Senior Member, IEEE, Yi Yang, Senior Member, IEEE
Abstract—Recent breakthroughs in autonomous driving have revolutionized the way vehicles perceive and interact with their
surroundings. In particular, world models have emerged as a linchpin technology, offering high-fidelity representations of the driving
environment that integrate multi-sensor data, semantic cues, and temporal dynamics. Such models unify perception, prediction, and
planning, thereby enabling autonomous systems to make rapid, informed decisions under complex and often unpredictable conditions.
Research trends span diverse areas, including 4D occupancy prediction and generative data synthesis, all of which bolster scene
understanding and trajectory forecasting. Notably, recent works exploit large-scale pretraining and advanced self-supervised learning
to scale up models’ capacity for rare-event simulation and real-time interaction. In addressing key challenges – ranging from domain
adaptation and long-tail anomaly detection to multimodal fusion – these world models pave the way for more robust, reliable, and
adaptable autonomous driving solutions. This survey systematically reviews the state of the art, categorizing techniques by their focus
on future prediction, behavior planning, and the interaction between the two. We also identify potential directions for future research,
emphasizing holistic integration, improved computational efficiency, and advanced simulation. Our comprehensive analysis
underscores the transformative role of world models in driving next-generation autonomous systems toward safer and more equitable
mobility.
Index Terms—Autonomous Driving, World Models, Self-Supervised Learning, Behavior Planning, Generative Approaches
✦
1 INTRODUCTION
1.1 Overview
THE quest for fully autonomous driving has rapidly
become a global focal point in both scientific research
and industry endeavors. At its core lies the ambition to
simultaneously reduce traffic accidents, alleviate conges-
tion, and enhance mobility for diverse societal groups [1].
Current statistics underscore that human error remains the
principal cause of accidents on the road [2], indicating
that minimizing human intervention could significantly
lower the incidence of traffic-related fatalities and injuries.
Beyond safety, economic factors (e.g., reducing congestion
and optimizing logistics) further propel the development of
autonomous driving technologies [3].
Despite these compelling incentives, achieving high-
level autonomy demands overcoming substantial technical
hurdles. Foremost among these is perceiving and under-
standing dynamic traffic scenarios, which requires fusing
heterogeneous sensor streams (e.g., LiDAR, radar, cam-
eras) into a cohesive environmental representation [4],
[5]. From complex urban layouts to high-speed high-
ways, autonomous vehicles must rapidly assimilate multi-
modal data, detect salient objects (vehicles, pedestrians,
cyclists), and anticipate their motion under varying con-
ditions – such as inclement weather, unstructured roads,
or heavy traffic [6], [7]. Furthermore, real-time decision-
making introduces stringent computational constraints, im-
•T. Feng is with ReLER Lab, Australian Artificial Intelligence Institute
(AAII), University of Technology Sydney, NSW, Australia. (e-mail:
feng.tuo@student.uts.edu.au)
•W. Wang and Y. Yang are with Collaborative lnnovation Center of
Artificial Intelligence (CCAI), Zhejiang University, China. (Email: wen-
guanwang.ai@gmail.com, yangyics@zju.edu.cn)
posing millisecond-level responsiveness to address unex-
pected obstacles or anomalous behaviors in the driving en-
vironment [8], [9]. Equally pivotal is the system’s resilience
in extreme or long-tail scenarios (e.g., severe weather, con-
struction zones, or erratic driving behaviors), where perfor-
mance shortfalls can compromise overall safety [10], [11].
Within this context, constructing robust and stable world
models has emerged as a foundational element. The no-
tion of a world model involves creating a high-fidelity
representation of the driving environment – encompassing
static structures (e.g., roads, buildings) and dynamic enti-
ties (e.g., vehicles, pedestrians) [3], [8]. A comprehensive
world model continuously captures semantic and geometric
information while updating these representations in real-
time, thereby informing downstream tasks such as physi-
cal world prediction [12], [13]. Recent advances integrate
multi-sensor data to refine these representations, such as
generative approaches [14], [15] that simulate the physical
world for training that unify heterogeneous sensor inputs
into consistent top-down perspectives [16], [17].
In turn, these robust world models leverage environ-
mental representations to optimize the behavior planning
of intelligent agents, providing the keystone for safer and
more efficient autonomous driving applications. By en-
abling proactive trajectory optimization, real-time hazard
detection, and adaptive route planning, they directly mit-
igate risks posed by unforeseen hazards [5] and align
with evolving vehicle-to-everything (V2X) systems [9]. Ul-
timately, world models facilitate more cohesive integration
between perception and control subsystems, streamlining
the closed-loop autonomy pipeline [18], [19].
Existing surveys on world models that involve au-
tonomous driving can generally be classified into two cat-
arXiv:2501.11260v1 [cs.RO] 20 Jan 2025
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
egories. The mainstream category focuses on describing
general world models that find applications across mul-
tiple fields [20]–[22], with autonomous driving being just
one of the specific areas. The second category [23], [24],
concentrates on the application of world models within the
autonomous driving sector, and attempts to summarize the
current state of the field. There are only a few existing
surveys on world models in autonomous driving, they tend
to broadly categorize these studies and often focus solely
on world simulation or lack discussions on the interaction
between behavior planning and physical world prediction,
resulting in a lack of a clear taxonomy in the field. In this
paper, we aim not only to define and categorize world
models for autonomous driving formally but also to pro-
vide a comprehensive review of recent technical progress
and explore their extensive applications in various sectors,
particularly emphasizing their transformative potential in
autonomous driving. This structured taxonomy allows us
to highlight how these models are shaped by and adapt to
the challenges of the automotive sector.
1.2 Contributions
Guided by the principle that the world model is central
to the understanding of dynamic scenes, this survey aims
to provide a comprehensive, structured review of existing
methodologies. We categorize state-of-the-art research into
three key areas: Future Prediction of the Physical World:
Focusing on the physical world evolution of both dynamic
objects and static entities [11], [25]; Behavior Planning for
Intelligent Agents: Examining generative and rule-based
planning methods that produce safe, efficient paths under
uncertain driving conditions [12], [13]; Interaction Between
Behavior Planning and Future Prediction: Highlighting
how unified frameworks can capture agent interactions
and leverage predictive insights for collaborative optimiza-
tion [18], [26], [27]. Specifically, we provide:
•An In-Depth Analysis of Future Prediction Models:
We discuss how Image-/BEV-/OG-/PC-based methods
achieve geometric and semantic fidelity in dynamic
scenes, including 4D occupancy forecasting and diffusion-
based generation.
•Investigation of Behavior Planning: We explore the be-
havior planning through both rule-based and learning-
based approaches, demonstrating notable improvements
in robustness and collision avoidance.
•Proposition of Interactive Model Research: We system-
atically review interactive models that jointly address
future prediction and agent behavior, indicating how this
synergy can vastly enhance real-world adaptability and
operational safety.
We conclude by identifying open challenges, such as
seamless integration of self-supervised approaches [26],
large-scale simulation for rare-event augmentation [10], [28],
and real-time multi-agent coordination [27], offering direc-
tions for future exploration. With the expanding research
landscape and the urgency of real-world adoption, this
survey aspires to serve as a valuable reference point for
researchers and practitioners, laying the groundwork for
safer, more robust autonomous driving solutions.
1.3 Structure
A summary of the structure of this paper can be found in
Fig. 1, which is presented as follows: Sec. 1introduces the
significance of world models in autonomous driving and
outlines the societal and technical challenges they address.
Sec. 2provides a comprehensive background on the for-
mulation and core tasks of world models in autonomous
driving, specifically focusing on the future prediction of the
physical world and behavior planning for intelligent agents.
Sec. 3details the taxonomy of methods: Sec. 3.1 delves
into methods for future prediction of the physical world,
discussing physical world evolution of dynamic objects and
static entities. Sec. 3.2 discusses advanced behavior planning
approaches that emphasize the generation of safe, effec-
tive driving strategies. Sec. 3.3 investigates the interactive
relationship between future prediction and behavior plan-
ning, highlighting collaborative optimization techniques for
complex scenarios. Sec. 4explores different approaches to
data and training paradigms, including supervised and
self-supervised learning, and data generation techniques.
Sec. 5examines the application areas and tasks where world
models can be applied, discussing the impact of these tech-
nologies across diverse domains including perception, pre-
diction, simulation, and system integration. Sec. 6provides a
detailed evaluation of world models for autonomous driv-
ing, assessing their effectiveness across various tasks and
metrics. Sec. 7explores open challenges, potential research
avenues, and promising directions for further innovation
in autonomous driving technologies. Sec. 8concludes the
survey and summarizes key findings, reiterating the impor-
tance of robust world models for autonomous driving.
2 BACKGROUND
In this section, we first provide a detailed problem formula-
tion for world models in autonomous driving (Sec. 2.1), en-
compassing two key tasks: future prediction of the physical
world and behavior planning for intelligent agents. Then, in
Sec. 2.2, we introduce key terminologies and concepts rel-
evant to world models, such as representation spaces, gen-
erative models, and spatiotemporal modeling techniques.
These aspects lay the foundation for understanding state-
of-the-art methods.
2.1 Problem Formulation
2.1.1 Core Tasks in World Models
In autonomous driving, a critical aspect is accurately pre-
dicting the future states of both the ego vehicle and its
surrounding environment. To address the core tasks, world
models win autonomous driving takes sensor inputs (in-
cluding a set of multi-view images Iand a set of LiDAR
points P) collected from previous frames and infers the
scene and trajectory for the next frames. Specifically, the ego
trajectory at time T+1, denoted as τT+1 , is predicted along-
side the surrounding scene zT+1.wmodels the coupled
dynamics of the ego vehicle’s motion and the environment’s
evolution. Formally, the function wis given by:
zT+1,τ T+1 =w((IT,· · · ,IT−t),(PT,· · · ,PT−t)).(1)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
Introduction
( )
Introduction
( )
Background
( )
Background
( )
Taxonomy
( )
Data and Training
Paradigms ( )
Application Areas
and Tasks ( )
Application Areas
and Tasks ( )
Performance
Comparison ( )
Performance
Comparison ( )
Future Research
Directions ( )
Future Research
Directions ( )( )
Conclusions
( )
Physical
World
Prediction
( )
Image Representations BEV Representations OG Representations PC Representations
Behavior
Planning
( )
Rule-based (e.g. HD map)
Interaction
Between
Agents &
World
( )
Learning-based
Data &
Training
( )
p = 0.01
p = 0.05
p = 0.03 TrajectoriesContinuous Rollout
World Pedictor Planner
Future World
Self-Supervised Learning 4D Pre-training Data Generation for Training
......
......
4D Pre-text Task
2D to 3D
View Transform
World
Model
§1§2§3§4§5§6§7§8
§3.1
§3.2
§3.3
§4
Fig. 1. Structure of the overall review. The first row shows the structure of the paper. The second and third rows detail future prediction of the
physical world, behavior planning for intelligent agents, and interaction between behavior planning and future prediction. The fourth row highlights
various methodologies for training models in autonomous driving, covering both self-supervised learning paradigms, pretraining strategies, and
innovative approaches for data generation.
The first core task is future prediction of the physical
world [17], [27], [29], [30], which involves forecasting the
future states of dynamic entities such as vehicles, pedestri-
ans, and traffic elements. This task emphasizes capturing
potential interactions, stochastic behaviors, and uncertain-
ties within rapidly changing and complex scenes. Advanced
techniques, such as 4D occupancy prediction and generative
models, play a crucial role in addressing this challenge by
leveraging multi-modal sensor data and probabilistic fore-
casting frameworks. The second core task is behavior plan-
ning for intelligent agents [18], [19], [31], [32], focusing on
generating optimal and feasible trajectories for the ego ve-
hicle. It requires accounting for safety constraints, dynamic
obstacles, traffic regulations, and real-time adaptability. Be-
havior planning is often achieved through a combination
of model-based and learning-based approaches that ensure
robustness and responsiveness in diverse driving scenarios.
2.2 Context and Terminology
2.2.1 Representation Spaces
Occupancy Grid (OG) Representation. An OG represen-
tation partitions the environment into discrete cells, each
annotated with a probability of occupancy, thereby offering
a unified representation for static and dynamic objects in
3D space. Although OG approaches are highly descriptive,
they typically require large memory and computational
resources, which can limit their applicability in real-time
autonomous systems [29].
Bird’s-Eye View (BEV) Representation. A BEV Representa-
tion converts multi-modal sensor data into a top-down view,
facilitating more intuitive spatial understanding, particu-
larly for motion prediction and trajectory planning. How-
ever, BEV representations may have difficulty capturing
fine-grained 3D geometries, especially in environments with
complex depth relationships [14], [27].
Point Cloud (PC) Representation. A PC Representation
uses raw 3D point data collected from LiDAR sensors to en-
code the spatial and geometric structure of the environment.
Point clouds provide fine-grained 3D details and are inher-
ently suited for capturing both static and dynamic objects in
high-resolution environments. Despite their precision, point
cloud processing is computationally intensive due to data
sparsity and the high dimensionality of the input [17].
2.2.2 Generative Models
Generative models, such as VAEs and diffusion architec-
tures [14], [33], play a pivotal role in simulating future
driving environments by facilitating trajectory prediction,
rare-event synthesis, and uncertainty modeling through di-
verse scenario generation. For instance, OccSora [34] intro-
duces a diffusion-based 4D occupancy generation model
that yields realistic, temporally consistent 3D driving sim-
ulations, while InfinityDrive [28] pushes temporal limits by
producing long-duration, high-resolution video sequences
of future states. Despite these advancements, balancing
high-fidelity outputs with computational efficiency remains
an active research challenge.
2.2.3 Spatiotemporal Modeling Techniques
Spatiotemporal modeling addresses how entities evolve in
space and time within dynamic driving environments.
Transformers. employ attention mechanisms to capture
long-range temporal dependencies and multi-agent inter-
actions, facilitating improved motion prediction and scene
understanding [35].
Slot Attention, on the other hand, segments the environ-
ment into interpretable entities, enabling more structured
标签: #自动驾驶
本资料由有户自行上传,仅供个人参考学习使用(禁止商用)。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本内容侵犯了原著者的合法权益,可联系我们进行处理。
相关推荐
-
世界经济论坛-2024年度十大新兴技术报告-中文版VIP专免
2024-07-02 207 -
Kantar-2024年最具价值的全球品牌(英)VIP专免
2024-07-23 150 -
平台经济发展观察(2024年)-中国信通院VIP专免
2024-08-01 155 -
笛卡尔未来研究所&谷歌:2023年未来准备经济指数报告(FREI 2023)(英文版)VIP专免
2024-08-23 154 -
2023-2024年全球杰出青年社区年度报告-世界经济论坛(WEF)VIP专免
2024-09-04 139 -
2024低空经济场景白皮书v1.0VIP专免
2024-10-24 292 -
2024年中国企业全球化报告-TCL案例研究VIP专免
2024-12-04 127 -
高盛-亚洲经济分析:关于2025年的十个问题VIP专免
2025-01-10 165 -
2025年全球宏观经济展望与政策挑战pdfVIP专免
2025-03-29 143 -
2025年关于川普关税冲击最完整的报告(繁体版)VIP专免
2025-04-10 141
作者:天黑了
分类:研究报告
价格:免费
属性:17 页
大小:1.18MB
格式:PDF
时间:2025-03-11
相关内容
-
中国宏观经济月度数据分析报告(第79期):期待政策持续发力的中国宏观经济
分类:研究报告
时间:2025-09-14
标签:宏观经济
格式:PDF
价格:免费
-
中信建投:全球变局中的宏观经济与资产布局
分类:研究报告
时间:2025-09-20
标签:宏观经济
格式:PDF
价格:免费
-
全球经济展望:2025年终盘点与未来趋势
分类:研究报告
时间:2025-09-30
标签:全球经济
格式:PDF
价格:免费
-
【国开证券】 2026年宏观经济展望:蓄势待发,向新而行
分类:研究报告
时间:2026-01-20
标签:宏观经济
格式:PDF
价格:免费
-
世界经济论坛(WEF):2026首席经济学家展望报告
分类:研究报告
时间:2026-01-20
标签:报告
格式:PDF
价格:免费


