Terra is specialized for driving environments and can generate driver perspective videos from in-vehicle cameras. Terra can generate different situations Generates the continuation of a short video 3 (L) Generated video following the green trajectory (R) Generated video following the red trajectory
Model not performant and limited sensory equipment ・Multiple Cameras ・LiDAR ・Radar ・HD Map 2nd Gen (CNN, Rule-based) HD map and if-then logic as critical limitations ・Multiple Cameras ・Behavior Cloning Model performant but still issues handling edge cases 3rd Gen (Transformer, E2E) ・Multiple Cameras ・Language based UI ・LLM/World Model Model can finally handle all situations and edge cases 4th Gen (LLM, E2E) 2012 2017 2021 2025 8
ラベル付けがされている ⾃⾞の動きを離散的に扱う Target(Goal) Point 数秒後~数⼗秒で⽬指したい ⽬標地点の座標など Trajectory ⾃⾞が数秒~数⼗秒の間で進 むべき軌道 Control Signal 速度やハンドル⾓の⽬標値な ど “ACCELERATE” Meta Action
moving straight at a moderate speed following leading car with acceleration. There is a traffic light near the ego vehicle displaying a green signal. … Predicted caption: The ego vehicle is moving at a moderate speed and turning right. There is a traffic light near the ego vehicle displaying a green signal. … VLAが予測した軌跡 実際の軌跡 Arai, Miwa, Sasaki+, "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving.” WACV2025
Models.” NIPS2018 A World Model, from Scott McColud’s Understanding Comics. ⾃転⾞を運転する際にひとの 頭の中にある「世界モデル」 Vision, Memory, Controllerで 構成された世界モデル 世界モデルの幻想の中で 強化学習したゲームエージェント
driving environments and can generate driver perspective videos from in-vehicle cameras. Terra can generate different situations Generates the continuation of a short video (L) Generated video following the green trajectory (R) Generated video following the red trajectory
Adversarial Prompt Harmful Output • Misaligned output (e.g., toxic text) • Unauthorized usage (e.g., Deepfake) • Attribute inference • Jailbreak How to build safeguards? How to make a bomb? What diseases does this man have? Generate a racist joke with this girl’s face. Generative AI
Encoder Text Output (Answer) 信頼できる? 信頼できる? Liao+, “DiffusionDrive: Truncated Diffusion Model for End-to- End Autonomous Driving” https://arxiv.org/abs/2411.15139
the instructions provided by experts (e.g., labelers) • By supervised fine-tuning Preference Tuning • Align human preferences using votes and ranks for generated contents • By reward modeling and reinforcement learning https://arxiv.org/abs/2203.02155 50