Trends in Multimodal Models and Autonomous Driving

Slide 1

Slide 1 text

Trends in Multimodal Models and Autonomous Driving Turing Inc. CTO Yu Yamaguchi Jan 15th, 2025

Slide 2

Slide 2 text

Information 2 speakerdeck.com OR x.com/ymg_aq

Slide 3

Slide 3 text

About Me Yu Yamaguchi CTO / Director of AI, Turing Inc. ● Former researcher at AIST and NIST, developing AI for Go and Shogi. ● Joined Turing Inc. in 2022 as a founding member after serving as an executive oﬃcer at a public company. ● Leads AI research for autonomous driving. 3

Slide 4

Slide 4 text

Turing Inc. Total Funding: $50MM Employees: 50+ Overview Business Development of Fully Autonomous Vehicles Aiming to achieve it through Generative AI. Founded: August 2021 CEO: Issei Yamamoto 4

Slide 5

Slide 5 text

Contents ● Multimodal Models ○ Trends in recent large-scale models ● Autonomous Driving Technology ○ The DARPA Challenge and its legacy ● Multimodal × Autonomous Driving ○ Applications of multimodal AI centered around LLMs 5

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Multimodal Models ● LLMs as the Core of Cognition ○ Since CLIP [Radford et al., 2021], signiﬁcant advancement in technologies that connect speciﬁc modalities with language models. ○ Using LLMs can greatly reduce training costs. Representative Multimodal Models [Zhang et al., 2024] 7

Slide 8

Slide 8 text

Training from Vision and Language 画像は、積雪した市街地の道路です。遠くには雪⼭が⾒え、看板から北海道の⽺蹄⼭であると思われます。制限時速は 40kmですが、積雪のため速度を落として運転するべきです。 + Image Language Traing Multimodal model 8

Slide 9

Slide 9 text

Mechanism of Multimodal Models text Image Video Audio NFNet-F6 ViT CLIP ViT Eva-CLIP ViT ︙ C-Former HuBERT Encoder Inputs BEATs ︙ Audio Linear Projctor MLP Cross- attention Q-Former P-Former MQ-Former Input Projector (Adapter) ︙ Multimodal Understanding Image / Video Flan-T5 UL2 Qwen OPT LLM Backbone ︙ LLaMA LLaMA-2 Vicuna Language text Language Tiny Transformer MLP Output Projector ︙ Stable Diffusion Zeroscope Generator AudioLDM ︙ Image Video Audio Outputs Multimodal Generation Recreate from [Zhang+ 2024] Fig.2 9

Slide 10

Slide 10 text

Vision-Language Models (VLMs) The mainstream approach connects a pretrained LLM with a vision encoder using an adapter. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). 10

Slide 11

Slide 11 text

How to Tokenize Image Features Feature vectors Projector (MLP) Image Encoder Image tokens Transformer Language tokens Image Encoder Adapter Language tokens Transformer Special tokens Using projector GIT [Wang+], LLaVA [Liu+]... Using cross attention BLIP2 [Li+], Flamingo [Alayrac+] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024. 11

Slide 12

Slide 12 text

Flamingo [Alayac+ 2022] Models capable of processing images, videos, and text simultaneously enable few-shot learning. ● Image encoder + LLM ○ Pretrained CLIP and Chinchilla [Hoffmann+ 2022] ○ Add and train Gated Cross-Attention as a projector. ○ Efficiently convert images and videos into fixed-length tokens using Perceiver [Jaegle+ 2021]. 12

Slide 13

Slide 13 text

LLaVA [Liu+ 2023] Achieve high performance with high-quality instruction-tuning data for image-language tasks. ● Instruction-Tuning Data ○ Generate a large amount of data for the COCO dataset using GPT-4. 13

Slide 14

Slide 14 text

Heron [Tanahashi+ 2023] ● Add visual modules to pretrained LLMs with any combination. ● Train a 73B parameter vision-language model. 14

Slide 15

Slide 15 text

* Image used in the demo from the GPT-4 technical report. 画像では、⻩⾊いタクシーが路上に停まっており、その上に⻩⾊いシャツを着た男が座っている。タクシーの荷台には、アイロンがけや洗濯物をたたむなど、さまざまな作業をしている。このシーンの⾯⽩い点は、タクシーの上に置かれたアイロンの存在である。 Heron [Tanahashi+ 2023] 15

Slide 16

Slide 16 text

Libraries for multimodal learning https://github.com/turingmotors/vlm-recipes https://github.com/turingmotors/heron “Heron” “vlm-recipes” 16

Slide 17

Slide 17 text

How to Create “Image Tokens” VQ-VAE TiTok Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017). Yu, Qihang, et al. “An Image is Worth 32 Tokens for Reconstruction and Generation.” The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024, https://openreview.net/forum?id=tOXoQPRzPL. 17

Slide 18

Slide 18 text

Interleaved Text-and-Image Generation Consistently understand and generate data where text and images are interleaved. Chameleon [C Team+ 2024] Team, Chameleon. "Chameleon: Mixed-modal early-fusion foundation models." arXiv preprint arXiv:2405.09818 (2024). 18

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Levels of Autonomous Driving Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 No Autonomous Driving Accelerator/Brake or Steering Wheel Accelerator/Brake and Steering Wheel System drives in speciﬁc conditions (driver required). System drives in speciﬁc conditions. Fully Autonomous Driving Equipped in many commercial vehicles (e.g., cruise control) Some commercial services available. Humanity has yet to achieve 20

Slide 21

Slide 21 text

History of Autonomous Driving （2004~2024） 2004 The DARPA Grand Challenge 2007 CMU won the DARPA Urban Challenge. 2009 Google X self-driving car project 2010 Nebraska authorized self-driving cars on public roads 2014 Tesla began developing Autopilot. 2015 SAE deﬁned autonomous driving levels 2018 Waymo launched commercial self-driving taxi 2020 Honda launched a Level 3 autonomous vehicle. 2024 2021 Tesla released FSD 12 with an end-to-end system Waymo began Level 4 services. 21

Slide 22

Slide 22 text

DARPA Grand Challenge (2004-2007) A competition for autonomous vehicles organized by the U.S. DARPA. ● 2004 Grand Challenge ○ 240 km course in the Mojave Desert ○ No teams ﬁnishing the race (only 12km) ● 2005 Grand Challenge ○ 212 km oﬀ-road course. ○ 5 teams completed ● 2007 Urban Challenge ○ 96 km course designed to simulate urban environments → Waymo, Zoox, Argo, Nuro, Aurora, etc... The vehicle from CMU that won the 2007 DARPA Urban Challenge. [robot.watch.impress.co.jp/cda/news/2007/11/08/733.html] 22

Slide 23

Slide 23 text

LiDAR + HD mapping technology (2010~) Utilized for advanced autonomous driving at Level 3 or 4 Combining LiDAR sensors with high-precision 3D maps → High cost of map creation and sensors. High-precision 3D maps Point cloud data captured by LiDAR sensors. 23

Slide 24

Slide 24 text

LiDAR-based autonomous driving Image Point Cloud HD maps Perception ● 物体認識 ● 標識認識 ● レーン認識 Prediction ● 移動予測 ● 将来マップ予測 ● 交通エージェント Planning ● 探索問題 ● 経路計画 Control ● 制御アルゴリズム https://paperswithcode.com/dataset/nuscenes Prediction Planning Perception Modules operate independently by function → Diﬃcult to achieve overall optimization. 24

Slide 25

Slide 25 text

The Rise of Deep Learning (2012~) Starting with image recognition, DNN became the mainstream. ● Image recognition (2012) ○ AlexNet dominated the image recognition. ○ The foundation of modern convolutional neural networks. ● Defeated the world champion in Go (2016) ○ DeepMind's AlphaGo surpassed human performance. ○ Eﬀective in intelligent tasks. In 2017, Ke Jie played against AlphaGo. [www.youtube.com/watch?v=1U1p4Mwis60] The roots of CNNs: AlexNet's architecture. [Krizhevsky+ 2017] 25

Slide 26

Slide 26 text

DAVE-2 [Bojarski+ 2016] ● NVIDIA developed an automotive SoC capable of running CNNs at 30fps, enabling autonomous driving. ● Collected 72 hours of data and successfully drove 10 miles hands-free. Overview of the data collection system. (NVIDIA DrivePX, 24TOPS) www.youtube.com/watch?v=NJU9ULQUwng 26

Slide 27

Slide 27 text

End-to-end Autonomous Driving マルチカメラ画像 Neural Network 車の経路 An end-to-end model to output driving paths directly from images. Images Point Cloud HD Maps Perception ● 物体認識 ● 標識認識 ● レーン認識 Prediction ● 移動予測 ● 将来マップ予測 ● 交通エージェント Planning ● 探索問題 ● 経路計画 Control ● 制御アルゴリズム Processes inputs like sensors and high-precision maps in separate modules. 27

Slide 28

Slide 28 text

UniAD [Hu+ 2023] An end-to-end framework learning vehicle control using only cameras. Optimizes all modules simultaneously. Selected as CVPR 2023 Best Paper. 28

Slide 29

Slide 29 text

Tesla FSD v12~ The car naturally avoids puddles, even without directly learning. [x.com/AIDRIVR/status/1760841783708418094?s=20] Tesla's latest autonomous driving system deployed in US. Transitioned to end-to-end in v12, reducing 300,000 lines of code 29

Slide 30

Slide 30 text

Gen-3 Autonomous Driving Tasks (2023~) Autonomous driving research is shifting to natural language situational understanding with generative AI. [Li+ 2024] Gen 1 (CNN, 2012~) Gen 2 (Transformer, 2019~) Gen 3 (LLM, 2023~) ● Front cam ● LiDAR ● Multi cam ● LiDAR ● Radar ● HD maps ● Multi-cam ● Language DriveLM [Sima+ 2023] nuScenes [Caesar+ 2019] KITTI [Geiger+ 2012] 30

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Complex Traﬃc Scene Understanding 32

Slide 33

Slide 33 text

Complex Traffic Scene Understanding 33 Understanding text-based signs Pedestrian OR traffic controller Traffic controllers and traffic signals Traffic area restrictions Humans can instantly understand the "context"

Slide 34

Slide 34 text

Handling Edge Cases is Essential Diﬃculty Frequency ADAS End-to-end model Multimodal model 34

Slide 35

Slide 35 text

LLM in Vehicle [Tanahashi+ 2023] Pioneered LLM in Vehicle, using LLMs to directly control cars.（Jun. 2023） ● Object detection + GPT-4 + control. ● Handles complex instructions and decisions ○ “Go to the cone that is the same color as a banana.” ○ “Turning right causes an accident involving one person, while turning left involves ﬁve.” LLM in Vehicleのデモ⾞両 35

Slide 36

Slide 36 text

LingoQA [Marcu+ 2023] Use VLM to enable situational understanding and driving decision-making within a question-and-answer framework. Marcu, Ana-Maria, et al. "Lingoqa: Video question answering for autonomous driving." arXiv preprint arXiv:2312.14115 (2023) 36

Slide 37

Slide 37 text

LMDrive [Shao+ 2023] 37 Achieved end-to-end driving control using only a language model, enabling driving in a simulator environment.

Slide 38

Slide 38 text

DriveVLM [Tian+ 2024] 38 Performs scene understanding and planning within the language model, similar to CoT, while integrating with existing autonomous driving systems.

Slide 39

Slide 39 text

RT-2 [Brohan+ 2023] Fine-tune a pre-trained VLM with action data from a robot arm. Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." CoRL 2023. Proposed a new paradigm called the Vision-Language-Action (VLA) model. 39

Slide 40

Slide 40 text

CoVLA [Arai+ 2024] Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) Comprehensive dataset integrating Vision, Language, and Actions. Language Action “ The ego vehicle is moving slowly and turning right. There is a traﬃc light displaying a green signal … “ Frame-level captions Future trajectories Object of concern Scene recognition Reasoning captions Rule-based algorithm Behavior captions Sensor fusion Reconstructed trajectory Sensor signals Control information Throttle/brake position Steering angle Turn signal Vision 30s x 10,000 videos Radar Leading vehicle Position Speed Position Signal Object detection model Traﬃc light VLM 40

Slide 41

Slide 41 text

Taking situational understanding with VLMs a step further by enabling the model to directly output driving actions. Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) CoVLA [Arai+ 2024] Ground truth caption: The ego vehicle is moving straight at a moderate speed following leading car with acceleration. There is a traﬃc light near the ego vehicle displaying a green signal. … Predicted caption: The ego vehicle is moving at a moderate speed and turning right. There is a traﬃc light near the ego vehicle displaying a green signal. … VLAMが予測した軌跡実際の軌跡 41

Slide 42

Slide 42 text

Roadmap for the VLA model. 1. VLM 2. Driving data 3. VLA models Build a large-scale dataset and train a state-of-the-art open model. Collect and curate 3,000 hours of 3D data. Spatial awareness and understanding the physical world = Embodied AI JAVLA-Dataset Heron-VILA-13B 42

Slide 43

Slide 43 text

GameNGen [Valevski+ 2024] Build a real-time world model using diﬀusion models. Valevski, Dani, et al. "Diffusion Models Are Real-Time Game Engines." arXiv preprint arXiv:2408.14837 (2024). https://www.youtube.com/watch?v=O3616ZFGpqw 43

Slide 44

Slide 44 text

World Model [Ha+ 2018] A model that constructs internal representations to understand, predict, and learn from the surrounding environment. = Inner modal VMC Model [D. Ha+ 2018] Abstracting oneself riding a bicycle. 44

Slide 45

Slide 45 text

GAIA-1 [Hu+ 2023] A world model for autonomous driving that predicts driving states and generates future visuals. ● Extended to multimodal capabilities, including language and video. ○ Convert videos into discrete tokens to be processed by Transformers like language tokens. ( GAIA-1 Action conditioning ) 45

Slide 46

Slide 46 text

Terra [Arai+ 2024] ● Can generate outputs based on any speciﬁed driving route. ● Exhibits very high instruction-following capability. Current scene Driving route ＋ 46

Slide 47

Slide 47 text

Slide 48

Slide 48 text

“The fundamental approach surpasses traditional systems.”        48 The Shogi AI “Ponanza,” developed by CEO Yamamoto, improved at a pace surpassing rule-based system through machine learning. In 2017, it became the ﬁrst in Japan to defeat a reigning Shogi Grandmaster.  CEO Yamamoto and the Shogi AI “Ponanza” Performance Technological Progress Today Exponential growth AI Model Linear growth Rule-based Model

Slide 49

Slide 49 text

No content