Inc. • Former researcher at AIST and NIST, developing AI for Go and Shogi. • Joined Turing Inc. in 2022 as a founding member after serving as an executive officer at a public company. • Leads AI research for autonomous driving. 3
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 5
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 6
Since CLIP [Radford et al., 2021], significant advancement in technologies that connect specific modalities with language models. ◦ Using LLMs can greatly reduce training costs. Representative Multimodal Models [Zhang et al., 2024] 7
with a vision encoder using an adapter. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). 10
Encoder Image tokens Transformer Language tokens Image Encoder Adapter Language tokens Transformer Special tokens Using projector GIT [Wang+], LLaVA [Liu+]... Using cross attention BLIP2 [Li+], Flamingo [Alayrac+] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024. 11
text simultaneously enable few-shot learning. • Image encoder + LLM ◦ Pretrained CLIP and Chinchilla [Hoffmann+ 2022] ◦ Add and train Gated Cross-Attention as a projector. ◦ Efficiently convert images and videos into fixed-length tokens using Perceiver [Jaegle+ 2021]. 12
Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017). Yu, Qihang, et al. “An Image is Worth 32 Tokens for Reconstruction and Generation.” The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024, https://openreview.net/forum?id=tOXoQPRzPL. 17
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 19
Level 3 Level 4 Level 5 No Autonomous Driving Accelerator/Brake or Steering Wheel Accelerator/Brake and Steering Wheel System drives in specific conditions (driver required). System drives in specific conditions. Fully Autonomous Driving Equipped in many commercial vehicles (e.g., cruise control) Some commercial services available. Humanity has yet to achieve 20
2007 CMU won the DARPA Urban Challenge. 2009 Google X self-driving car project 2010 Nebraska authorized self-driving cars on public roads 2014 Tesla began developing Autopilot. 2015 SAE defined autonomous driving levels 2018 Waymo launched commercial self-driving taxi 2020 Honda launched a Level 3 autonomous vehicle. 2024 2021 Tesla released FSD 12 with an end-to-end system Waymo began Level 4 services. 21
by the U.S. DARPA. • 2004 Grand Challenge ◦ 240 km course in the Mojave Desert ◦ No teams finishing the race (only 12km) • 2005 Grand Challenge ◦ 212 km off-road course. ◦ 5 teams completed • 2007 Urban Challenge ◦ 96 km course designed to simulate urban environments → Waymo, Zoox, Argo, Nuro, Aurora, etc... The vehicle from CMU that won the 2007 DARPA Urban Challenge. [robot.watch.impress.co.jp/cda/news/2007/11/08/733.html] 22
driving at Level 3 or 4 Combining LiDAR sensors with high-precision 3D maps → High cost of map creation and sensors. High-precision 3D maps Point cloud data captured by LiDAR sensors. 23
DNN became the mainstream. • Image recognition (2012) ◦ AlexNet dominated the image recognition. ◦ The foundation of modern convolutional neural networks. • Defeated the world champion in Go (2016) ◦ DeepMind's AlphaGo surpassed human performance. ◦ Effective in intelligent tasks. In 2017, Ke Jie played against AlphaGo. [www.youtube.com/watch?v=1U1p4Mwis60] The roots of CNNs: AlexNet's architecture. [Krizhevsky+ 2017] 25
of running CNNs at 30fps, enabling autonomous driving. • Collected 72 hours of data and successfully drove 10 miles hands-free. Overview of the data collection system. (NVIDIA DrivePX, 24TOPS) www.youtube.com/watch?v=NJU9ULQUwng 26
directly learning. [x.com/AIDRIVR/status/1760841783708418094?s=20] Tesla's latest autonomous driving system deployed in US. Transitioned to end-to-end in v12, reducing 300,000 lines of code 29
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 31
LLMs to directly control cars.(Jun. 2023) • Object detection + GPT-4 + control. • Handles complex instructions and decisions ◦ “Go to the cone that is the same color as a banana.” ◦ “Turning right causes an accident involving one person, while turning left involves five.” LLM in Vehicleのデモ⾞両 35
driving decision-making within a question-and-answer framework. Marcu, Ana-Maria, et al. "Lingoqa: Video question answering for autonomous driving." arXiv preprint arXiv:2312.14115 (2023) 36
from a robot arm. Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." CoRL 2023. Proposed a new paradigm called the Vision-Language-Action (VLA) model. 39
Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) Comprehensive dataset integrating Vision, Language, and Actions. Language Action “ The ego vehicle is moving slowly and turning right. There is a traffic light displaying a green signal … “ Frame-level captions Future trajectories Object of concern Scene recognition Reasoning captions Rule-based algorithm Behavior captions Sensor fusion Reconstructed trajectory Sensor signals Control information Throttle/brake position Steering angle Turn signal Vision 30s x 10,000 videos Radar Leading vehicle Position Speed Position Signal Object detection model Traffic light VLM 40
the model to directly output driving actions. Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) CoVLA [Arai+ 2024] Ground truth caption: The ego vehicle is moving straight at a moderate speed following leading car with acceleration. There is a traffic light near the ego vehicle displaying a green signal. … Predicted caption: The ego vehicle is moving at a moderate speed and turning right. There is a traffic light near the ego vehicle displaying a green signal. … VLAMが予測し た軌跡 実際の 軌跡 41
3. VLA models Build a large-scale dataset and train a state-of-the-art open model. Collect and curate 3,000 hours of 3D data. Spatial awareness and understanding the physical world = Embodied AI JAVLA-Dataset Heron-VILA-13B 42
predicts driving states and generates future visuals. • Extended to multimodal capabilities, including language and video. ◦ Convert videos into discrete tokens to be processed by Transformers like language tokens. ( GAIA-1 Action conditioning ) 45
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 47
The Shogi AI “Ponanza,” developed by CEO Yamamoto, improved at a pace surpassing rule-based system through machine learning. In 2017, it became the first in Japan to defeat a reigning Shogi Grandmaster. CEO Yamamoto and the Shogi AI “Ponanza” Performance Technological Progress Today Exponential growth AI Model Linear growth Rule-based Model