Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

TransformerによるBEV Perception

TransformerによるBEV Perception

TransformerによるBEV Perception
- BEVを用いた3D認識 :BEVFormer [Li+, ECCV2022]
- BEVを用いたE2Eによる自動運転 :UniAD [Hu+, CVPR2023],PARA-Drive [Weng+, CVPR2024]
- BEVを用いたE2E+LLMによる自動運転 :TOKENS [Hu+, CoRL2024]
- BEVの自己教師あり学習 :VIDAR [Yang+, CVPR2024]

Hironobu Fujiyoshi

October 23, 2024
Tweet

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Transcript

  1. w 5SBOTGPSNFSΛը૾ೝࣝʹద༻ͨ͠Ϟσϧ  ը૾Λݻఆύονʹ෼ղ  ෼ղͨ͠ύονؒͷ4FMG"UUFUJPOʹΑΓಛ௃Λܭࢉ  *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5" 7JTJPO5SBOTGPSNFS 7J5

    <%PTPWJUTLJZ *$-3>  7JTJPO 5SBOTGPSNFSͷ͘͠Έ Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention 4FMG"UUFOUJPO Scaled Dot-Product Attention Figure 2: (left) Scaled Dot-Product Attention. ( attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention
  2. w ࢹ֮త৘ใͱςΩετ৘ใΛಉ࣌ʹॲཧͯ͠ੜ੒͢Δ"*Ϟσϧ  4FMG"UUFUOJPO ɿύονը૾ʢೖྗτʔΫϯʣؒͷؔ܎ʹج͍ͮͯಛ௃Λදݱ  $SPTT"UUFOUJPO ɿը૾ͷग़ྗτʔΫϯͱͷ࣭໰จͱͷؔ܎ʹج͍ͮͨಛ௃Λ֫ಘ 7JTJPOBOE-BOHVBHF.PEFM 7-.

     *NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classification, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. 5FYUJOQVUʢ࣭໰จʣ 5FYUPVUQVUʢճ౴จʣ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ೖྗը૾ɿ
  3. w #JSE`T&ZF7JFX #&7 ۭؒͱ͸ʁ w #&71FSDFQUJPO  #&7Λ༻͍ͨ%ೝࣝ ɿ#&7'PSNFS<-J &$$7>

     #&7Λ༻͍ͨ&&ʹΑΔࣗಈӡస ɿ6OJ"%<)V $713>ɼ1"3"%SJWF<8FOH $713>  #&7Λ༻͍ͨ&&ʴ--.ʹΑΔࣗಈӡస ɿ50,&/4<)V $P3->  #&7ͷࣗݾڭࢣ͋Γֶश ɿ7*%"3<:BOH $713> 5SBOTGPSNFSʹΑΔ#&71FSDFQUJPO 
  4. w ೖྗσʔλ΍࢖༻Մೳͳ৘ใʢਂ౓ਪఆɼΧϝϥߦྻʣʹԠ༷ͯ͡ʑͳํ๏͕ఏҊ #&7ۭؒͷߏஙํ๏  ఺܈Λபঢ়ʹ෼ׂ ப಺ͷ఺܈ͷಛ௃ྔΛू໿ Point cloud Stacked Pillars

    Learned Features Pseudo image Predictions Pillar Feature Net Backbone (2D CNN) Detection Head (SSD) Point cloud N P D C P H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C Pillar Index 1PJOU1JMMBST<-BOH $713> ը૾͔Βਂ౓Λਪఆ͠ɼըૉ஋Λ఺܈Խ I X E  ఺܈ I X E ΛΧϝϥߦྻΛ༻͍ͯ Y Z [ ΁࠲ඪม׵ ࠲ඪม׵ͨ͠఺܈ʹରͯ͠1PJOU1JMMBSTΛద༻ -JGU 4QMBU 4IPPU<1IJMJPOʴ &$$7> ɾ5SBOTGPSNFSͷ$SPTT"UUFOUJPOʹΑΓը૾Λ#&7ۭؒ΁౤Ө ɾ#&7ۭؒͷࢀর఺ΛΧϝϥߦྻΛ༻͍ͯը૾ۭؒ΁࠲ඪม׵ͯ͠ ɹ$SPTT"UUFOUJPOΛద༻ #&7'PSNFS<-J &$$7> Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (b) Spatial Cross-Attention Hit Views 𝒱hit Backbone Det & Seg Heads Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross- (c) Temporal Self- Backbone Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 ఺܈σʔλ ୯؟Χϝϥ ϚϧνΧϝϥ
  5. w #&7ۭؒΛհͯ͠ը૾܈͔Β࣍ݩͷ෺ମݕग़͕Մೳ  ैདྷͷ#&71FSDFQUJPO ɿ఺܈ˠ#&7ˠ%෺ମݕग़  ը૾ͷ#&71FSDFQUJPO ɿը૾܈ˠ#&7ˠ%෺ମݕग़ ը૾ʹ͓͚Δ#&71FSDFQUJPO 

    ఺܈ˠ#&7ˠ%෺ମݕग़ ʢྫɿ1PJOU1JMMBST<-BOH $713>ʣ Pseudo image Predictions Backbone (2D CNN) Detection Head (SSD) H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C 1PJOU1JMMBST Point cloud Stacked Pillars Learned Features Pseudo image Predictions Pillar Feature Net Backbone (2D CNN) Detection Head (SSD) Point cloud N P D C P H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C Pillar Index ը૾ˠ%෺ମݕग़ ʢྫɿ:0-0<3FENPO $713>ʣ :0-0 ը૾܈ˠ#&7ˠ%෺ମݕग़ ʢྫɿ#&7'PSNFS<-J &$$7>ʣ #&7'PSNFS Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 History BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attentio Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone History BEV 𝐵𝑡−1 BEV Q Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 ೖྗը૾ʹର͢Δ%෺ମݕग़ #&7ۭؒ
  6. w #&7͸෺ମͷҐஔͱେ͖͞Λਖ਼֬ʹදݱͰ͖ࣗಈӡసλεΫʹద੾ w ը૾͔Β#&7ಛ௃Λநग़͢Δطଘख๏͸ਂ౓৘ใʹґଘˠਂ౓ʹґଘ͠ͳ͍ํ๏ΛఏҊ w #&7ΫΤϦɼ3(#ը૾ɼ̍࣌ࠁલͷ#&7ಛ௃ྔΛ༻͍ͯಛ௃நग़͢Δ#&7'PSNFSΛఏҊ  4QBUJBM"UUFOUJPO ɿ#&7ΫΤϦͱը૾ؒͷؔ܎Λଊ͑ͯಛ௃நग़ 

    5FNQPSBM"UUFOUJPO ɿ#&7ΫΤϦͱ࣌ࠁલͷ#&7ಛ௃ྔؒͷؔ܎Λଊ͑ͯಛ௃நग़ #&7'PSNFS<-J &$$7>  Temporal Attention Multi-Camera Images at Timestamp 𝒕 Lookup & Aggregate BEV Queries Ego Spatial Attention BEV at Timestamp 𝒕 − 𝟏 There is a car ! What's in here at timestamp 𝒕 ? BEV at Timestamp 𝒕
  7. w ैདྷͷ5SBOTGPSNFSߏ଄ʢσίʔμʣΛϕʔεͱͯ͠#&7'PSNFSͷΤϯίʔμΛߏங  #&7ΫΤϦɼ5FNQPSBM"UUFOUJPO૚ɼ4QBUJBM"UUFOUJPO૚Λಋೖ  ը૾͸όοΫϘʔϯʢྫɿ3FT/FUʣͰಛ௃நग़ w ૚ͷΤϯίʔμʹΑΔಛ௃நग़ͷޙɼλεΫʹԠͨ͡ϔουʹΑΓೝࣝ݁ՌΛ༧ଌ #&7'PSNFSͷϞσϧߏ଄ 

    Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡
  8. w άϦουܗঢ়ͷֶशՄೳͳύϥϝʔλͰߏ੒͞ΕͨࣗंҐஔΛத৺ͱͨ͠#&7ۭؒ w #&7'PSNFS΁ೖྗ͢ΔલʹֶशՄೳͳ1PTJUJPO&NCFEEJOHΛద༻ #&7'PSNFSͷϞσϧߏ଄ɿ#&7ΫΤϦ  Feed Forward Add &

    Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Q ∈ ℝH×W×C H, W ɿ#&7ۭؒͷେ͖͞ CɿΫΤϦʢ#&7ಛ௃ྔʣͷ࣍ݩ਺
  9. w ը૾ۭؒͷಛ௃ྔɹΛ$SPTT"UUFOUJPOʹΑΓ#&7ۭؒ΁౤Ө  ϚϧνϏϡʔ͸ෳ਺ͷը૾͔Βߏ੒͞Ε͍ͯΔͨΊ.VUJM)FBE"UUFOUJPOͷܭࢉίετ͕૿Ճ ˠܭࢉίετͷ࡟ݮΛ໨తͱͯ͠ఏҊ͞Εͨ%FGPSNBCMF"UUFOUJPOΛ࢖༻ w #&7ΫΤϦΛΫΤϦɼը૾ۭؒͷಛ௃ྔΛΩʔɾόϦϡʔͱͯ͠࢖༻ #&7'PSNFSͷϞσϧߏ଄ɿ4QBUJBM$SPTT"UUFOUJPO  Feed

    Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Ft
  10. 𝒱 hit ɿࢀর఺ͷ౤Өઌͷը૾ Nref ɿࢀর఺ͷ਺ QpɿҐஔɹɹɹɹͷ#&7ΫΤϦ p ∈ (x′ 

    , y′  ) t ɿ࣌ࠁɹʹ͓͚Δɹ൪໨ͷը૾ͷಛ௃ྔ Fi t i w #&7ΫΤϦ্ͷ%ࢀর఺Λ౤Ө͢Δ͜ͱͰը૾ͷࢀর఺Λઃఆ͠$SPTT"UUFOUJPOΛద༻ #&7'PSNFSͷϞσϧߏ଄ɿ4QBUJBM$SPTT"UUFOUJPO  SCA(Qp , Ft ) = 1 | 𝒱 hit | ∑ i∈ 𝒱 hit Nref ∑ j=1 DeformAttn(Qp , 𝒫 (p, i, j), Fi t ) × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (b) Spatial Cross-Attention Hit Views 𝒱hit et & Seg Heads Ґஔ p (xij , yij ) ൪໨ͷը૾ i zij ⋅ [xij yij 1] T = Ti ⋅ [x′  y′  z′  j 1] T ΧϝϥߦྻɹΛ༻͍ͯ %ࢀর఺Λը૾ۭؒ΁౤Ө Ti $SPTT"UUFOUJPO ը૾্ͷ$SPTT"UUFOUJPOͷࢀর఺ #&7ۭؒͷߴ͞ํ޲Λߟྀ ϚϧνϏϡʔΛߟྀ
  11. w #&7ۭؒํ޲ͱ࣌ܥྻํ޲ͷؔ܎ੑΛଊ͑ͯಛ௃நग़ʢ࣌ࠁ໨͸#&7ۭؒํ޲ͷΈʣ  4FMG"UUFOUJPOͱͯ͠%FGPSNBCMF"UUFOUJPOΛ࢖༻ #&7'PSNFSͷϞσϧߏ଄ɿ5FNQPSBM4FMG"UUFOUJPO  Feed Forward Add &

    Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 TSA(Qp , {Q, B′  t−1 }) = ∑ V∈{Q,B′  t−1 } DeformAttn(Qp , p, V) Qɿݱࡏͷ࣌ࠁɹͷ#&7ΫΤϦ t B′  t−1 ɿ࣌ࠁલɹɹͷ#&7ಛ௃ྔ t − 1 QpɿҐஔɹɹɹɹͷ#&7ΫΤϦ p ∈ (x, y)
  12. w OV4DFOFTͷ5FTUTFUʹର͢ΔੑೳΛ఺܈Λ༻͍ͨख๏ - ɼը૾Λ༻͍ͨख๏ $ ͱൺֱ #&7'PSNFSͷֶशޮՌ  Table 1:

    3D detection results on nuScenes test set. ⇤ notes that VoVNet-99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [31]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# SSN [55] L - 0.569 0.463 - - - - - CenterPoint-Voxel [52] L - 0.655 0.580 - - - - - PointPainting [43] L&C - 0.581 0.464 0.388 0.271 0.496 0.247 0.111 FCOS3D [45] C R101 0.428 0.358 0.690 0.249 0.452 1.434 0.124 PGD [44] C R101 0.448 0.386 0.626 0.245 0.451 1.509 0.127 BEVFormer-S C R101 0.462 0.409 0.650 0.261 0.439 0.925 0.147 BEVFormer C R101 0.535 0.445 0.631 0.257 0.405 0.435 0.143 DD3D [31] C V2-99⇤ 0.477 0.418 0.572 0.249 0.368 1.014 0.124 DETR3D [47] C V2-99⇤ 0.479 0.412 0.641 0.255 0.394 0.845 0.133 BEVFormer-S C V2-99⇤ 0.495 0.435 0.589 0.254 0.402 0.842 0.131 BEVFormer C V2-99⇤ 0.569 0.481 0.582 0.256 0.375 0.378 0.126 Table 2: 3D detection results on nuScenes val set. “C” indicates Camera. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# FCOS3D [45] C R101 0.415 0.343 0.725 0.263 0.422 1.292 0.153 PGD [44] C R101 0.428 0.369 0.683 0.260 0.439 1.268 0.185 DETR3D [47] C R101 0.425 0.346 0.773 0.268 0.383 0.842 0.216 ը૾Λೖྗͱ͢Δैདྷ๏ͱൺ΂ͯߴ͍ೝࣝੑೳΛൃش
  13. w OV4DFOFTͷ5FTUTFUʹର͢ΔੑೳΛ఺܈Λ༻͍ͨख๏ - ɼը૾Λ༻͍ͨख๏ $ ͱൺֱ #&7'PSNFSͷֶशޮՌ  Table 1:

    3D detection results on nuScenes test set. ⇤ notes that VoVNet-99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [31]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# SSN [55] L - 0.569 0.463 - - - - - CenterPoint-Voxel [52] L - 0.655 0.580 - - - - - PointPainting [43] L&C - 0.581 0.464 0.388 0.271 0.496 0.247 0.111 FCOS3D [45] C R101 0.428 0.358 0.690 0.249 0.452 1.434 0.124 PGD [44] C R101 0.448 0.386 0.626 0.245 0.451 1.509 0.127 BEVFormer-S C R101 0.462 0.409 0.650 0.261 0.439 0.925 0.147 BEVFormer C R101 0.535 0.445 0.631 0.257 0.405 0.435 0.143 DD3D [31] C V2-99⇤ 0.477 0.418 0.572 0.249 0.368 1.014 0.124 DETR3D [47] C V2-99⇤ 0.479 0.412 0.641 0.255 0.394 0.845 0.133 BEVFormer-S C V2-99⇤ 0.495 0.435 0.589 0.254 0.402 0.842 0.131 BEVFormer C V2-99⇤ 0.569 0.481 0.582 0.256 0.375 0.378 0.126 Table 2: 3D detection results on nuScenes val set. “C” indicates Camera. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# FCOS3D [45] C R101 0.415 0.343 0.725 0.263 0.422 1.292 0.153 PGD [44] C R101 0.428 0.369 0.683 0.260 0.439 1.268 0.185 DETR3D [47] C R101 0.425 0.346 0.773 0.268 0.383 0.842 0.216 ఺܈Λೖྗͱ͢Δ44/ͱಉఔ౓ͷೝࣝੑೳΛൃش
  14. w ैདྷͷγϯάϧλεΫϞσϧͷ࿈ܞ΍ϚϧνλεΫϞσϧʹΑΔγεςϜ͸ϦεΫ͕ଘࡏ  γϯάϧλεΫϞσϧɿ্ྲྀͷαϒλεΫʹ͓͚ΔΤϥʔ͕ԼྲྀͷαϒλεΫʹӨڹ  ϚϧνλεΫϞσϧɹɿαϒλεΫͷ࿈ܞෆ଍ʹΑΓਫ਼౓͕௿Լ w αϒλεΫͷ૊Έ߹ΘͤํΛࣗಈӡసंͷϓϥϯχϯάʹ࠷దԽͨ͠6OJ"%ΛఏҊ 6OJ fi

    FE"VUPOPNPVT%SJWJOH 6OJ"% <)V $713>  Design Approach Perception Prediction Plan Det. Track Map Motion Occ. (b) NMP [101] 3 3 3 NEAT [19] 3 3 BEVerse [105] 3 3 3 (c.1) [14,16,78,97] 3 (c.2) PnPNet† [57] 3 3 3 ViP3D† [30] 3 3 3 P3 [82] 3 3 MP3 [11] 3 3 3 ST-P3 [38] 3 3 3 LAV [15] 3 3 3 3 (c.3) UniAD (ours) 3 3 3 3 3 3 Table 1. Tasks comparison and taxonomy. “Design” column is classified as in Fig. 1. “Det.” denotes 3D object detection, “Map” rate on the of these mo To this e Driving alg toward a sa and Table 1 spirit. We a mere engin based desig bounding b receptive fi stream pred and encode ࣗಈӡసϑϨʔϜϫʔΫͷൺֱ αϒλεΫͱϑϨʔϜϫʔΫͷ૊Έ߹Θͤ
  15. w ֤ϞδϡʔϧʹΑΔೝࣝੑೳ΁ͷӨڹΛධՁ  OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ ෳ਺λεΫͷ࿈ܞʹΑΔޮՌɿఆྔతධՁ  ID Modules Tracking Mapping

    Motion Forecasting Occupancy Prediction Planning Track Map Motion Occ. Plan AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# 0⇤ 3 3 3 3 3 0.356 1.328 893 0.302 0.675 0.858 1.270 0.186 55.9 34.6 47.8 26.4 1.154 0.941 1 3 0.348 1.333 791 - - - - - - - - - - - 2 3 - - - 0.305 0.674 - - - - - - - - - 3 3 3 0.355 1.336 785 0.301 0.671 - - - - - - - - - 4 3 - - - - - 0.815 1.224 0.182 - - - - - - 5 3 3 0.360 1.350 919 - - 0.751 1.109 0.162 - - - - - - 6 3 3 3 0.354 1.339 820 0.303 0.672 0.736(-9.7%) 1.066(-12.9%) 0.158 - - - - - - 7 3 - - - - - - - - 60.5 37.0 52.4 29.8 - - 8 3 3 0.360 1.322 809 - - - - - 62.1 38.4 52.2 32.1 - - 9 3 3 3 3 0.359 1.359 1057 0.304 0.675 0.710(-3.5%) 1.005(-5.8%) 0.146 62.3 39.4 53.1 32.2 - - 10 3 - - - - - - - - - - - 1.131 0.773 11 3 3 3 3 0.366 1.337 889 0.303 0.672 0.741 1.077 0.157 - - - - 1.014 0.717 12 3 3 3 3 3 0.358 1.334 641 0.302 0.672 0.728 1.054 0.154 62.3 39.5 52.8 32.3 1.004 0.430 Table 2. Detailed ablations on the effectiveness of each task. We can conclude that two perception sub-tasks greatly help motion forecasting, and prediction performance also benefits from unifying the two prediction modules. With all prior representations, our goal- planning boosts significantly to ensure safety. UniAD outperforms naive MTL solution by a large margin for prediction and planning tasks, and it also owns the superiority that no substantial perceptual performance drop occurs. Only main metrics are shown for brevity. “avg.L2” and “avg.Col” are the average values across the planning horizon. ⇤: ID-0 is the MTL scheme with separate heads for each task.
  16. w ֤ϞδϡʔϧʹΑΔೝࣝੑೳ΁ͷӨڹΛධՁ  OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ ෳ਺λεΫͷ࿈ܞʹΑΔޮՌɿఆྔతධՁ  ID Modules Tracking Mapping

    Motion Forecasting Occupancy Prediction Planning Track Map Motion Occ. Plan AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# 0⇤ 3 3 3 3 3 0.356 1.328 893 0.302 0.675 0.858 1.270 0.186 55.9 34.6 47.8 26.4 1.154 0.941 1 3 0.348 1.333 791 - - - - - - - - - - - 2 3 - - - 0.305 0.674 - - - - - - - - - 3 3 3 0.355 1.336 785 0.301 0.671 - - - - - - - - - 4 3 - - - - - 0.815 1.224 0.182 - - - - - - 5 3 3 0.360 1.350 919 - - 0.751 1.109 0.162 - - - - - - 6 3 3 3 0.354 1.339 820 0.303 0.672 0.736(-9.7%) 1.066(-12.9%) 0.158 - - - - - - 7 3 - - - - - - - - 60.5 37.0 52.4 29.8 - - 8 3 3 0.360 1.322 809 - - - - - 62.1 38.4 52.2 32.1 - - 9 3 3 3 3 0.359 1.359 1057 0.304 0.675 0.710(-3.5%) 1.005(-5.8%) 0.146 62.3 39.4 53.1 32.2 - - 10 3 - - - - - - - - - - - 1.131 0.773 11 3 3 3 3 0.366 1.337 889 0.303 0.672 0.741 1.077 0.157 - - - - 1.014 0.717 12 3 3 3 3 3 0.358 1.334 641 0.302 0.672 0.728 1.054 0.154 62.3 39.5 52.8 32.3 1.004 0.430 Table 2. Detailed ablations on the effectiveness of each task. We can conclude that two perception sub-tasks greatly help motion forecasting, and prediction performance also benefits from unifying the two prediction modules. With all prior representations, our goal- planning boosts significantly to ensure safety. UniAD outperforms naive MTL solution by a large margin for prediction and planning tasks, and it also owns the superiority that no substantial perceptual performance drop occurs. Only main metrics are shown for brevity. “avg.L2” and “avg.Col” are the average values across the planning horizon. ⇤: ID-0 is the MTL scheme with separate heads for each task. ˠશͯͷλεΫΛߟྀͯ͠ద੾ͳܨ͕ΓͰֶश͢Δ͜ͱ͕ॏཁ
  17. w ϦΞϧλΠϜࣗಈӡసͷͨΊʹઃܭ͞Εͨ৽͍͠ΞʔΩςΫνϟ  ΫΤϦϕʔεͷϚϧνλεΫʹΑΔ&&Ϟσϧ  1"3"%SJWF<8FOH $713> UniAD (CVPR ‘23)

    VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA Latent features Compact outputs Latent + Compact Test time o Figure 2. Visual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modula [12], VAD [13], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2 placements due to different inter-module connections, and (3) information flow with different representations such as latent features and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is des parallelized for runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and inter Table 4. Ablation on information flow when the BEV feature maps are not used in planning. High-dimensional latent queries tend to carry more information for planning performance in this case. Methods Col. Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Aveall 1.0s 2.0s 3.0s Aveall Baseline 0.00 0.07 0.51 0.13 0.24 0.55 1.07 0.53 Baseline - BEV6 3.41 8.09 7.91 5.88 2.83 5.37 7.61 4.66 (4) Baseline - BEV + “Mot-Plan” bbox 0.00 0.29 3.94 0.97 0.34 1.15 2.53 1.10 (4) Baseline - BEV + “Mot-Plan” query 0.00 0.10 0.46 0.14 0.24 0.58 1.09 0.54 (5) Baseline - BEV + “Map-Plan” BEV 0.12 1.07 3.44 1.22 0.95 1.82 2.63 1.59 (5) Baseline - BEV + “Map-Plan” query 0.02 0.20 0.65 0.20 0.28 0.62 1.15 0.58 (6) Baseline - BEV + “Occ-Plan” BEV 0.48 1.75 4.84 1.85 1.96 3.75 5.41 3.26 (6) Baseline - BEV + “Occ-Plan” query 0.14 0.27 1.00 0.38 0.42 0.82 1.47 0.78 6 This model also removes the edge (4) and we omit for simplicity, i.e., the only input to planning is the high-level command. We aim to use this model to serve as the basis to validate which informa- tion added can help the most in planning, in the case of not using the BEV features. Mapping Occupancy Prediction Motion Prediction Planning BEV features (1) (2) Mapping Occupancy Prediction Motion Prediction Planning BEV features (3) (4) (5) Improved baseline after removing edges (1)(2) Alternative designs with various inter-module connectivity (6) (7) (0) Figure 4. We highlight all inter-module connectivities that build a conditional dependency between modules. For example, edge (1) refers to using outputs from the mapping module for motion pre- Table 5. Ablation experiments on the necessity of modules. Scenarios Methods Col. # L2 # Map Comp. (%) # Aveall Aveall Offroad OffLane val Baseline - edge (4) 7 0.20 0.5734 0.32 1.20 Baseline - Map. 0.16 0.5332 0.71 3.03 Baseline - Occ. 0.64 0.8174 4.19 4.43 Baseline - Mot. 0.60 0.8561 4.05 4.49 val Baseline - Occ. + “Mot-Plan” 0.14 0.5483 0.38 0.82 Baseline - Mot. + “Occ-Plan” 0.25 0.5953 0.80 1.33 7 Similar to Table 4 row 2, where we eliminate the edge (4) as the base model to fairly compare the impact of removing each upstream module on planning. 8-9 for edge (5), we pass either the compact semantic BEV map denoting the road and lane geometry information or the latent queries of the map elements to downstream planning. Similarly, in rows 10-11 for edge (6), we compare the use of compact BEV occupancy maps or object queries for plan- ning. Interestingly, despite that passing the compact BEV outputs leads to slightly better performance than passing the high-dimensional query features, we find that it is not nec- essary to pass either the compact outputs or the per-module latent queries due to the potential redundancy with infor- mation already flowing from the BEV feature maps into the planning module via the edge (0). ෳࡶԽʂ AD (CVPR ‘23) VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA-Drive Latent features Compact outputs Latent + Compact Test time only ual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modular AV, i.e., UniAD ], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2) distinct module e to different inter-module connections, and (3) information flow with different representations such as high-dimensional and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is designed completely r runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and interpretability. ,FFQJU4JNQMF 4UVQJEʢ,*44
  18. w ೝࣝ ༧ଌ ܭըͷ֤ϞδϡʔϧΛಉ࣌ʹֶशʢϚϧνλεΫʣ  Ϟδϡʔϧؒͷ৘ใͷड͚౉͠͸τʔΫϯԽ͞Εͨ#&7ΫΤϦΛհͯ͠҉໧తʹߦ͏  ςετ࣌͸Ұ෦ͷϞδϡʔϧΛແޮʹ͢Δ͜ͱͰ࣮ߦ଎౓Λ޲্ʢഒߴ଎Խʣ  1"3"%SJWF<8FOH

    $713> Mapping Occupancy Prediction Motion Prediction Planning BEV features Map queries Motion queries Occupancy queries Plan queries BEV queries C C C C Ego vehicle’s information C Cross-attention operation Can be deactivated at runtime Figure 5. PARA-Drive architecture. Perception, prediction, and planning modules are co-trained in parallel. No dependency is in- suring that relevant inform A distinct advantage from prior work with hyb operational independence training, the planning hea perception and prediction cant flexibility during infe motion, and occupancy c duced frame rate only whe ducting safety checks, or design significantly boosts tion planning module, en and thereby enhancing ov
  19.  1"3"%SJWF<8FOH $713> Table 6. Main comparison with state-of-the-art approaches

    under our standardized evaluation methodology on the nuScenes val dataset. PARA-Drive achieves superior performance consistently on all metrics and targeted scenarios. Scenarios Evaluation Methodology Methods Using Ego’s States Collision Rates (%) # L2 (m) # Map Comp. Errors (%) # 1.0s 2.0s 3.0s Ave1,2,3s Aveall 1.0s 2.0s 3.0s Ave1,2,3s Aveall Offroad Offlane val Standardized UniAD [12] No 0.32 0.29 0.73 0.45 0.40 0.4774 0.8947 1.4703 0.9474 0.8317 0.91 1.74 VAD [13] No 0.02 0.26 0.83 0.37 0.30 0.4084 0.8577 1.4596 0.9086 0.7830 1.03 1.93 PARA-Drive No 0.00 0.12 0.65 0.26 0.17 0.2581 0.5927 1.1196 0.6568 0.5574 0.12 0.83 AD-MLP [30] Yes 0.00 0.14 0.70 0.28 0.20 0.2267 0.5847 1.1782 0.6632 0.5568 1.21 2.45 PARA-Drive+ Yes 0.00 0.09 0.49 0.19 0.13 0.2035 0.5195 1.0425 0.5885 0.4939 0.11 0.78 targeted Standardized UniAD [12] No 0.00 0.00 0.73 0.24 0.15 0.4698 1.0921 1.9162 1.1594 0.9935 - - VAD [13] No 0.00 0.29 0.87 0.39 0.34 0.5305 1.2015 2.0696 1.2672 1.0840 - - PARA-Drive No 0.00 0.00 0.72 0.24 0.14 0.3844 0.9729 1.8805 1.0793 0.9082 - - AD-MLP [30] Yes 0.00 0.58 3.62 1.40 0.94 0.3309 0.9938 2.0559 1.1269 0.9360 - - PARA-Drive+ Yes 0.00 0.00 0.29 0.10 0.05 0.2856 0.7575 1.4608 0.8346 0.7018 - - Table 7. Comparison of performance on perception and prediction. Methods Detection and Motion Prediction Mapping Occupancy mAP " NDS " AMOTA " minADE # IoU-lane " IoU-road " IoU-n " VPQ-n " UniAD, R101 [12] 0.38 0.50 0.36 0.73 0.30 0.67 62.3 52.8 Ours, R101 0.37 0.48 0.35 0.72 0.33 0.71 63.6 55.6 Ours, R50 0.32 0.44 0.30 0.89 0.32 0.70 58.9 51.4 4.1. Additional Results and Analysis Table 6 summarizes the performance of all recent state-of- the-art approaches. On our standardized evaluation proto- cols, PARA-Drive sets the new state-of-the-art on all met- rics. Also, our results re-established the benchmark for a fair comparison across prior art. Given the new benchmark, although we observed that VAD still outperforms UniAD Table 8. Performance on old existing evaluation protocols Eval. Methods Col. Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Ave123s 1.0s 2.0s 3.0s Ave123s UniAD GT 0.35 0.38 0.35 0.36 - - - - VAD 0.02 0.28 0.85 0.38 0.50 1.02 1.68 1.07 UniAD 0.05 0.17 0.71 0.31 0.48 0.96 1.65 1.03 Ours 0.07 0.25 0.60 0.30 0.40 0.77 1.31 0.83 VAD GT 1.02 0.96 0.91 0.96 - - - - VAD 0.07 0.17 0.41 0.22 0.41 0.70 1.05 0.72 UniAD 0.12 0.13 0.28 0.17 0.48 0.74 1.07 0.76 Ours 0.14 0.23 0.39 0.25 0.25 0.46 0.74 0.48 BEV features for planning together with other tasks might not necessarily lead to the negative transfer issue as ob- served in prior work [9, 19] for co-training of perception- ˠ1"3"%SJWF͕ैདྷ๏ΑΓߴਫ਼౓ Table 6. Main comparison with state-of-the-art approaches under our standardized evaluation methodo PARA-Drive achieves superior performance consistently on all metrics and targeted scenarios. Scenarios Evaluation Methodology Methods Using Ego’s States Collision Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Ave1,2,3s Aveall 1.0s 2.0s 3.0s Av val Standardized UniAD [12] No 0.32 0.29 0.73 0.45 0.40 0.4774 0.8947 1.4703 0 VAD [13] No 0.02 0.26 0.83 0.37 0.30 0.4084 0.8577 1.4596 0 PARA-Drive No 0.00 0.12 0.65 0.26 0.17 0.2581 0.5927 1.1196 0 AD-MLP [30] Yes 0.00 0.14 0.70 0.28 0.20 0.2267 0.5847 1.1782 0 PARA-Drive+ Yes 0.00 0.09 0.49 0.19 0.13 0.2035 0.5195 1.0425 0 targeted Standardized UniAD [12] No 0.00 0.00 0.73 0.24 0.15 0.4698 1.0921 1.9162 1 VAD [13] No 0.00 0.29 0.87 0.39 0.34 0.5305 1.2015 2.0696 1 PARA-Drive No 0.00 0.00 0.72 0.24 0.14 0.3844 0.9729 1.8805 1 AD-MLP [30] Yes 0.00 0.58 3.62 1.40 0.94 0.3309 0.9938 2.0559 1 PARA-Drive+ Yes 0.00 0.00 0.29 0.10 0.05 0.2856 0.7575 1.4608 0 Table 7. Comparison of performance on perception and prediction. Methods Detection and Motion Prediction Mapping Occupancy mAP " NDS " AMOTA " minADE # IoU-lane " IoU-road " IoU-n " VPQ-n " UniAD, R101 [12] 0.38 0.50 0.36 0.73 0.30 0.67 62.3 52.8 Ours, R101 0.37 0.48 0.35 0.72 0.33 0.71 63.6 55.6 Ours, R50 0.32 0.44 0.30 0.89 0.32 0.70 58.9 51.4 4.1. Additional Results and Analysis Table 8. Performance on old Eval. Methods Col. Rates (% 1.0s 2.0s 3.0s UniAD GT 0.35 0.38 0.35 VAD 0.02 0.28 0.85 UniAD 0.05 0.17 0.71 Ours 0.07 0.25 0.60 VAD GT 1.02 0.96 0.91 ˠ֤λεΫͷੑೳ΋ಉ౳Ҏ্
  20. w ैདྷͷ&&Ϟσϧ͸-POHUBJMΠϕϯτʢൃੜස౓͕௿͍͕ॏཁੑ͕ߴ͍Πϕϯτʣ͕ۤख  --.ͷ׆༻͕ظ଴͞ΕΔ͕ɼ--.ͷੑೳ͸ೖྗςΩετʹ͓͚Δγʔϯઆ໌ͷ඼࣭ͱৄࡉʹґଘ ˠηϯαೖྗ͔Βநग़ͨ͠τʔΫϯʹΑΓγʔϯͷৄࡉΛදݱ ˠ1BSB%3*7&ͷग़ྗτʔΫϯΛར༻͢Δ͜ͱͰɼ-POHUBJMΠϕϯτͷهड़Λߴਫ਼౓Խ 50,&/<5JBO $P3->  Tokenize

    the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving Ran Tian1,2 Boyi Li1,2 Xinshuo Weng1 Yuxiao Chen1 Edward Schmerling1 Yue Wang1,3 Boris Ivanovic1 Marco Pavone1,4 1NVIDIA 2UC Berkeley 3University of Southern California 4Stanford University Large Language Model (LLaMA) TOKEN: A Multi-Modal LLM with Object-Centric Tokenization for Autonomous Driving <Multi-view video, HD-map, …, Symbolic representation> Object-centric Tokenization … … Scene-centric Tokenization A: … … Q: Given the input information, please…? Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the ego vehicle? Scene Understanding Hierarchical Planning A: 1) critical objects grounding…; 2) interaction planning…; 3) lane-level decision-making…; 4) motion plan generation…. Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? Reasoning Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9). It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. 0959v1 [cs.AI] 1 Jul 2024
  21. w ͭͷϞδϡʔϧ͔Βߏ੒  0CKFDUDFOUSJD5PLFOJ[FS ɿηϯαೖྗ͔Β෺ମϨϕϧͷ৘ใΛநग़ʢ1"3"%SJWFʣ  4DFOFDFOUSJD5PLFOJ[FS ɿηϯαೖྗ͔ΒγʔϯϨϕϧͷ৘ใΛநग़  "EBQUFS

    ɿ֤5PLFOJ[FSͷग़ྗΛςΩετͷຒΊࠐΈۭؒʹΞϥΠϝϯτ  --. ɿ--.ͷߴ͍ཧղྗͱҰൠԽೳྗΛ༻͍࣭ͯ໰ʹରͯ͠ճ౴ʢ--B."ʣ 50,&/ͷϞσϧߏ଄  Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tokens using pre-trained object-centric querying transformers Detection Mapping Prediction Transformer- based driving model … Downstream driving tasks for training token extraction … BEV features Track querying transformer … … Prediction querying transformer … … Map querying transformer … … Occupancy querying transformer … … C C C C C Cross attention !!"#$% !'#( !')!*)+ !)$$. "!"#$% "'#( ")$$. "')!*)+ 1"3"%SJWFͷ ֤Ϟδϡʔϧͷ ग़ྗτʔΫϯΛར༻ 1"3"%SJWF
  22. w ͭͷεςοϓ͔Β50,&/Λֶश  1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश  3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"ͱ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश

     1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏  Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model
  23. w ͭͷεςοϓ͔Β50,&/Λֶश  1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश  3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश

     1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏  Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 4DFOF6OEFSTUBOEJOH2" Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehic should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at ( 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic co are blocking the ego vehicle’s lane. The autonomous vehicle should bypass traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 4DFOF6OEFSTUBOEJOH2" γʔϯཧղʹؔ͢Δճ౴Λੜ੒͢ΔλεΫ
  24. w ͭͷεςοϓ͔Β50,&/Λֶश  1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश  3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश

     1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏  Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 3FBTPOJOH2" 1.9, 12.1) vehicle? g into ego Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 3FBTPOJOH2" ࣭໰จʹର͢Δཧ༝ͷઆ໌จΛੜ੒͢ΔλεΫ
  25. w ͭͷεςοϓ͔Β50,&/Λֶश  1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश  3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश

     1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏  Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 1MBOOJOH2" hicle pay attention 7) ? Why? is blocking the The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … icle pay attention ? Why? ith velocity (-1.6, in front of the wer is: Yes. 1MBOOJOH2" ߦಈͷܭըΛੜ੒͢ΔλεΫ
  26. w γʔϯཧղɼ෺ମͷάϥϯσΟϯάɼϓϥϯχϯάͷੑೳΛධՁ  7JEFP--B." 7*-" ɿϚϧνϞʔμϧ--.ͷैདྷ๏  #&750,&/ ɿ#&7ಛ௃Λ"EBQUFSͷೖྗͱͨ͠50,&/ --.ͷ׆༻ʹΑΔޮՌɿఆྔతධՁ

     ࣗಈӡసʹಛԽͨ͠Τϯίʔμͷ࢖༻ɼ0CKFDUDFOUSJDͳτʔΫϯԽʹΑΓ ैདྷͷϚϧνϞʔμϧ--.Λ௒͑ΔੑೳΛൃش categorical questions in the DriveLM dataset [7]. To evaluate the model’s ability localize and reason about critical objects, we use precision and recall to measure its grounding ability (we use Hungarian matching to match the predictions with the ground truth), and use accuracy to measure its ability to identify whether an object is critical given the object’s center location in the ego frame. We consider three variants of trajectory L2 error: the overall, turning, and progress errors, which are calculated from the original L2 distance, heading difference, and longitudinal-weighted L2 distance between the prediction and GT1. We use the average collision rate over the entire horizon to measure the safety of a motion plan. More details about our evaluation protocol can be found in App. B. Scene understanding " Critical object grounding " Traj L2 (m) # Method Obj. class. Lane-object asso. acc. Acc. Precision Recall Import. class. 1s 2s 3s Ave123s Aveall Video-LLaMA 0.28 0.39 0.38 0.22 0.27 0.58 0.27 1.72 6.34 3.01 2.39 VILA-1.5 0.37 0.22 0.42 0.19 0.16 0.55 0.28 1.56 4.41 2.09 1.66 BEV-TOKEN 0.68 0.64 0.61 0.58 0.62 0.76 0.39 1.01 2.02 1.14 0.96 TOKEN 0.92 0.68 0.76 0.87 0.76 0.92 0.26 0.71 1.47 0.81 0.68 PARA-Drive NA NA NA NA NA NA 0.23 0.68 1.50 0.80 0.66 Table 1: Quantitative evaluation of the scene understanding, critical object grounding, and planning tasks. TOKEN significantly outperforms baseline VLMs due to its use of driving-task pre-trained features and object-centric tokenization. Bold numbers denote the best results in each column, and the numbers shaded in green indicate significant improvements. We also show the PARA-Drive’s planning performance as reference (shaded in grey). Results. We present the quantitative evaluation of each model’s scene understanding, object ground-
  27. w छྨͷϩϯάςʔϧΠϕϯτʹ͓͍ͯϓϥϯχϯάੑೳΛධՁ --.ͷ׆༻ʹΑΔޮՌɿఆྔతධՁ  શͯͷ-POHUBJMΠϕϯτʹ͓͍ͯҰ؏ͯ͠1"3"%SJWFΛ௒͑ΔੑೳΛୡ੒ Traj L2 (m) # Heading

    L2 (rad) # Lon. weighted traj L2 (m) # Collision (%) # Long-tail Method 1s 2s 3s Ave123s Aveall 1s 2s 3s Ave123s Aveall 1s 2s 3s Ave123s Aveall Aveall 3-point turn PARA-Drive 0.50 1.38 2.76 1.55 1.29 0.40 0.83 1.03 1.48 0.90 0.78 2.10 4.19 2.36 1.96 5.33 TOKEN 0.39 1.29 2.60 1.43 1.18 0.21 0.35 0.71 0.42 0.36 0.68 2.15 4.33 2.39 1.98 4.00 TOKEN ⇤ 0.20 0.73 1.77 0.90 0.73 0.35 0.41 0.37 0.38 0.33 0.37 1.28 2.93 1.53 1.24 0.00 Resume motion PARA-Drive 0.14 0.79 2.30 1.08 0.85 0.11 0.45 0.57 0.38 0.35 0.26 1.46 4.18 1.97 1.56 0 from full stop TOKEN 0.13 0.70 1.58 0.80 0.65 0.09 0.24 0.31 0.22 0.19 0.24 1.24 2.66 1.38 1.13 0 Overtake PARA-Drive 0.27 0.89 1.94 1.03 0.85 0.05 0.11 0.18 0.11 0.10 0.50 1.56 3.37 1.81 1.50 2.3 TOKEN 0.29 0.77 1.63 0.90 0.74 0.04 0.07 0.11 0.07 0.09 0.53 1.36 2.86 1.58 1.31 0 Construction zone PARA-Drive 0.38 0.93 1.65 0.99 0.84 0.05 0.09 0.11 0.08 0.07 0.69 1.61 2.84 1.71 1.47 6.70 TOKEN 0.26 0.63 1.28 0.72 0.60 0.02 0.04 0.06 0.04 0.04 0.47 1.03 2.08 1.19 1.00 2.22 Table 2: TOKEN significantly outperforms PARA-Drive in long-tail scenarios. PARA-Drive TOKEN GT t=2s t=4s t=6s t=8s t=10s Prompt: The autonomous vehicle needs to make a left 3-point turn. What objects are important for the autonomous vehicle's planning? What are these objects, their (x,y) locations? and how should the autonomous vehicle interact with them? Please plan the autonomous vehicle's 3s future trajectory using 6 waypoints, one every 0.5 seconds. TOKEN*
  28. --.ͷ׆༻ʹΑΔޮՌɿఆੑతධՁ  TOKEN: There is 1 important object: car at

    (-1.9, 8.1). The autonomous vehicle should proceed with caution and prepare to overtake the car at (-1.9, 8.1) from the autonomous vehicle's left side. It should use the adjacent right lane to continue driving, decelerate and steer slightly to right, …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted] ো֐෺ͷ௥͍ӽ͠ ˠো֐෺Λආ͚ͨޙʹݩͷंઢʹ໭ΔΑ͏ʹϓϥϯχϯά 1SPNQU5IFBVUPOPNPVTWFIJDMFOFFETUPLFFQGPSXBSEBMPOHUIFSPBE8IBUPCKFDUTBSFJNQPSUBOUGPSUIFBVUPOPNPVTWFIJDMFT QMBOOJOH 8IBUBSFUIFTFPCKFDUT UIFJS Y Z MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN  1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసं͸ಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳ෺ମ͸ԿͰ͔͢ʁ ͜ΕΒͷ෺ମͱ͜ΕΒͷ Y Z ࠲ඪ͸ԿͰ͋Γɼࣗಈӡసं͸͜ΕΒͷ෺ମͱͲͷΑ͏ʹ૬ޓ࡞༻͢΂͖Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳ෺ମ͕ͭ͋Γ·͢   ͷंɻࣗಈӡసं͸৻ॏʹਐΈɼࠨଆʹ͋Δ   ͷंΛ௥͍ӽ͢४උΛ͢Δඞཁ͕͋Γ·͢ɻ ྡ઀͢ΔӈଆͷंઢΛ࢖༻ͯ͠૸ߦΛଓ͚ɼݮ଎ͯ͠ӈʹΘ͔ͣʹϋϯυϧΛ੾ͬͯʜ<লུ> 50,&/5IFSFJTJNQPSUBOUPCKFDUDBSBU   5IFBVUPOPNPVTWFIJDMFTIPVMEQSPDFFEXJUIDBVUJPOBOEQSFQBSFUP PWFSUBLFUIFDBSBU   GSPNUIFBVUPOPNPVTWFIJDMFTMFGUTJEF*UTIPVMEVTFUIFBEKBDFOUSJHIUMBOFUPDPOUJOVFESJWJOH  EFDFMFSBUFBOETUFFSTMJHIUMZUPSJHIU ʜ<PNJUUFE>
  29. --.ͷ׆༻ʹΑΔޮՌɿఆੑతධՁ  ަ௨ίʔϯͷճආ ˠྡͷंઢͷத৺Λ૸ΔΑ͏ͳܗͰަ௨ίʔϯΛճආ 1SPNQU5IFBVUPOPNPVTWFIJDMFOFFETUPLFFQGPSXBSEBMPOHUIFSPBE8IBUPCKFDUTBSFJNQPSUBOUGPSUIFBVUPOPNPVTWFIJDMFT QMBOOJOH 8IBUBSFUIFTFPCKFDUT UIFJS Y Z

    MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN  1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసं͸ಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳ෺ମ͸ԿͰ͔͢ʁ ͜ΕΒͷ෺ମͱ͜ΕΒͷ Y Z ࠲ඪ͸ԿͰ͋Γɼࣗಈӡసं͸͜ΕΒͷ෺ମͱͲͷΑ͏ʹ૬ޓ࡞༻͢΂͖Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳ෺ମ͕ͭ͋Γ·͢   ͷަ௨ίʔϯɼ   ͷަ௨ίʔϯɼ   ͷަ௨ίʔϯɼ   ͷަ௨ίʔϯɻ͜ΕΒͷަ௨ίʔϯ͸ɼࣗं྆ͷंઢ Λ࠹͍Ͱ͍·͢ɻࣗಈӡసं͸   ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ   ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ   ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ   ͷަ ௨ίʔϯΛӈଆ͔Βճආ͢Δඞཁ͕͋Γ·͢ʜ<লུ> 50,&/5IFSFBSFJNQPSUBOUPCKFDUTUSB ff i DDPOFBU   USB ff i DDPOFBU   USB ff i DDPOFBU   USB ff i DDPOFBU   5IFTFUSB ffi D DPOFTBSFCMPDLJOHUIFFHPWFIJDMFTMBOF5IFBVUPOPNPVTWFIJDMFTIPVMECZQBTTUIFUSB ffi DDPOFBU   GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTT UIFUSB ff i DDPOFBU   GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIFUSB ffi DDPOFBU   GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIF USB ff i DDPOFBU   GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEFʜ<PNJUUFE> TOKEN: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side, bypass the traffic cone at (2.1, 11.7) from the autonomous vehicle's right side, bypass the traffic cone at (1.7, 16.8) from the autonomous vehicle's right side, bypass the traffic cone at (1.6, 19.8) from the autonomous vehicle's right side …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted]
  30. w ΫΤϦϕʔεͷ#&71FSDFQUJPOख๏ͷͨΊͷࣗݾڭࢣ͋Γֶशʢࣄલֶश๏ʣΛఏҊ w աڈ͔Βݱࡏ·ͰͷϚϧνϏϡʔը૾Λೖྗͱͯ͠ະདྷͷ఺܈Λ༧ଌ͢Δ͜ͱͰࣄલֶश  ະདྷͷ఺܈༧ଌʹ͸ҙຯ৘ใɼ%ߏ଄ɼ࣌ܥྻΛଊ͑Δ͜ͱ͕ඞཁ ˠ#&7ಛ௃ྔʹزԿֶతɾ࣌ؒతͳ৘ใ͕ຒΊࠐ·Εɼ໨తͷҟͳΔෳ਺ͷαϒλεΫʹߩݙ 7J%"3<:BOH $713> 

    Visual Point Cloud Forecasting History Visual Inputs Future Point Cloud Predictions ! ! − # ! − $ ! + # ! + & ! + ' … … … … … … ViDAR: A Visual Autonomous Driving Pre-training Model 1.88 1.25 0.74 0.67 0.27 0.23 62.8 65.2 31.2 33.2 26.9 31.6 38.3 42.1 51.7 54.8 Previous SOTA ViDAR Point Cloud Forecasting Motion Forecasting Future Occupancy Prediction 3D Object Detection Semantic Occupancy Prediction Multi-Object Tracking Map Segmentation Planning
  31. w 6OJ"%ͱ7J%"3ͷࣄલֶशΛಋೖͨ͠6OJ"%ͷೝࣝੑೳΛൺֱ  ࣄલֶशͰ͸ϑϨʔϜ෼ͷϚϧνϏϡʔը૾Λೖྗͯ͠ϑϨʔϜ෼ͷະདྷͷ఺܈Λ༧ଌ  OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ 7J%"3ʹΑΔࣄલֶशͷޮՌ  Method Detection

    Tracking Mapping Motion Forecasting Future Occupancy Prediction Planning NDS " mAP " AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# UniAD 49.36 37.96 38.3 1.32 1054 31.3 69.1 0.75 1.08 0.158 62.8 40.1 54.6 33.9 1.12 0.27 ViDAR 52.57 42.33 42.0 1.25 991 33.2 71.4 0.67 0.99 0.149 65.4 42.1 57.3 36.4 0.91 0.23 Table 10. Performance gain of ViDAR for joint perception, prediction, and planning. ViDAR consistently improves UniAD [28] on all tasks towards end-to-end autonomous driving, validating its effectiveness for scalable visual autonomous driving. 0.301 0.346 0.394 0.426 0.279 0.329 0.377 0.400 mAP 0.321 0.387 0.451 0.518 0.249 0.316 0.408 0.477 0.200 0.350 0.500 1/8 1/4 1/2 1/1 NDS Training Subset ViDAR ImageNet-cls Groups of Latent R. N/A 1 2 4 8 16 NDS (%) 40.20 39.18 43.36 45.53 47.01 47.58 Table 11. Ablation of Latent Rendering for downstream fine- tuning. We compare the performance of 3D detection pre-trained by ViDAR without the Latent Rendering operation (denoted as “N/A”) and by ViDAR with different groups of Latent Rendering. ࣄલֶशΛಋೖ͢Δ͜ͱͰશͯͷλεΫʹ͓͍ͯೝࣝੑೳ͕վળ
  32. w %ೝࣝɿ#&7'PSNFS<-J &$$7>  $SPTT"UUFOUJPOʹΑΓը૾͔Β#&7ۭؒ΁ͷ౤Өͱ࣌ܥྻͷؔ܎Λଊ͑Δ#&71FSDFQUJPOख๏ w &&ʹΑΔࣗಈӡసɿ 6OJ"%<)V $713> 

    શͯͷαϒλεΫΛ#&7ΫΤϦͰ઀ଓ͢ΔϑϨʔϜϫʔΫʹΑΓෳ਺ͷαϒλεΫΛ4UFQͰ&&ֶश 1"3"%SJWF<8FOH $713>  ೝࣝ ༧ଌ ܭըͷ֤ϞδϡʔϧΛಉ࣌ʹϚϧνλεΫతʹ&&ֶश  ςετ࣌͸Ұ෦ͷϞδϡʔϧΛແޮʹ͢Δ͜ͱͰ࣮ߦ଎౓Λ޲্ʢഒߴ଎Խʣ w &&ʴ--.ʹΑΔࣗಈӡసɿ50,&/<)V $P3->  1"3"%SJWFͷग़ྗτʔΫϯͱ--.Λ༻͍ͯ-POHUBJMΠϕϯτͷهड़ͱϓϥϯχϯάΛߴਫ਼౓Խ w ࣗݾڭࢣ͋Γֶशɿ7J%"3<:BOH $713>  ༷ʑͳαϒλεΫͰߴ͍ࣄલֶशͷޮՌΛൃش͢Δ#&71FSDFQUJPOͷͨΊͷࣗݾڭࢣ͋Γֶश ·ͱΊɿ5SBOTGPSNFSʹΑΔ#&71FSDFQUJPO