TransformerによるBEV Perception

5SBOTGPSNFSʹΑΔ#&71FSDFQUJPO ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

w 5SBOTGPSNFSΛը૾ೝࣝʹద༻ͨ͠Ϟσϧ ը૾Λݻఆύονʹ෼ղ ෼ղͨ͠ύονؒͷ4FMG"UUFUJPOʹΑΓಛ௃Λܭࢉ *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5" 7JTJPO5SBOTGPSNFS 7J5
<%PTPWJUTLJZ *$-3> 7JTJPO 5SBOTGPSNFSͷ͘͠Έ Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention 4FMG"UUFOUJPO Scaled Dot-Product Attention Figure 2: (left) Scaled Dot-Product Attention. ( attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention

w ࢹ֮త৘ใͱςΩετ৘ใΛಉ࣌ʹॲཧͯ͠ੜ੒͢Δ"*Ϟσϧ 4FMG"UUFUOJPO ɿύονը૾ʢೖྗτʔΫϯʣؒͷؔ܎ʹج͍ͮͯಛ௃Λදݱ $SPTT"UUFOUJPO ɿը૾ͷग़ྗτʔΫϯͱͷ࣭໰จͱͷؔ܎ʹج͍ͮͨಛ௃Λ֫ಘ 7JTJPOBOE-BOHVBHF.PEFM 7-.
*NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classiﬁcation, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. 5FYUJOQVUʢ࣭໰จʣ 5FYUPVUQVUʢճ౴จʣ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ೖྗը૾ɿ

w #JSE`T&ZF7JFX #&7 ۭؒͱ͸ʁ w #&71FSDFQUJPO #&7Λ༻͍ͨ%ೝࣝ ɿ#&7'PSNFS<-J &$$7>
#&7Λ༻͍ͨ&&ʹΑΔࣗಈӡస ɿ6OJ"%<)V $713>ɼ1"3"%SJWF<8FOH $713> #&7Λ༻͍ͨ&&ʴ--.ʹΑΔࣗಈӡస ɿ50,&/4<)V $P3-> #&7ͷࣗݾڭࢣ͋Γֶश ɿ7*%"3<:BOH $713> 5SBOTGPSNFSʹΑΔ#&71FSDFQUJPO

w ্ۭ͔Β஍্ΛݟԼΖͨ͠Α͏ͳۭؒදݱ ෳ਺ͷΧϝϥ΍ηϯα͔Βͷ৘ใΛ౷߹ ं྆ͷલํɼޙํɼଆ໘ʹ͋Δ෺ମͷҐஔ΍ಈ͖ͳͲͷपลঢ়گΛ೺Ѳ #JSE`T&ZF7JFX #&7 ۭؒͱ͸ʁ
#&7ۭؒʹΑΔपғͷঢ়گ ํ޲ͷҟͳΔݸͷΧϝϥʹΑΔपғͷঢ়گ <;IPVBOE,SÅIFOCÛIM $713>͔ΒҾ༻

w ೖྗσʔλ΍࢖༻Մೳͳ৘ใʢਂ౓ਪఆɼΧϝϥߦྻʣʹԠ༷ͯ͡ʑͳํ๏͕ఏҊ #&7ۭؒͷߏஙํ๏ ఺܈Λபঢ়ʹ෼ׂ ப಺ͷ఺܈ͷಛ௃ྔΛू໿ Point cloud Stacked Pillars
Learned Features Pseudo image Predictions Pillar Feature Net Backbone (2D CNN) Detection Head (SSD) Point cloud N P D C P H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C Pillar Index 1PJOU1JMMBST<-BOH $713> ը૾͔Βਂ౓Λਪఆ͠ɼըૉ஋Λ఺܈Խ I X E ఺܈ I X E ΛΧϝϥߦྻΛ༻͍ͯ Y Z [ ΁࠲ඪม׵ ࠲ඪม׵ͨ͠఺܈ʹରͯ͠1PJOU1JMMBSTΛద༻ -JGU 4QMBU 4IPPU<1IJMJPOʴ &$$7> ɾ5SBOTGPSNFSͷ$SPTT"UUFOUJPOʹΑΓը૾Λ#&7ۭؒ΁౤Ө ɾ#&7ۭؒͷࢀর఺ΛΧϝϥߦྻΛ༻͍ͯը૾ۭؒ΁࠲ඪม׵ͯ͠ ɹ$SPTT"UUFOUJPOΛద༻ #&7'PSNFS<-J &$$7> Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (b) Spatial Cross-Attention Hit Views 𝒱hit Backbone Det & Seg Heads Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross- (c) Temporal Self- Backbone Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 ఺܈σʔλ ୯؟Χϝϥ ϚϧνΧϝϥ

w #&7ۭؒΛհͯ͠ը૾܈͔Β࣍ݩͷ෺ମݕग़͕Մೳ ैདྷͷ#&71FSDFQUJPO ɿ఺܈ˠ#&7ˠ%෺ମݕग़ ը૾ͷ#&71FSDFQUJPO ɿը૾܈ˠ#&7ˠ%෺ମݕग़ ը૾ʹ͓͚Δ#&71FSDFQUJPO
఺܈ˠ#&7ˠ%෺ମݕग़ ʢྫɿ1PJOU1JMMBST<-BOH $713>ʣ Pseudo image Predictions Backbone (2D CNN) Detection Head (SSD) H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C 1PJOU1JMMBST Point cloud Stacked Pillars Learned Features Pseudo image Predictions Pillar Feature Net Backbone (2D CNN) Detection Head (SSD) Point cloud N P D C P H W C Deconv Deconv Deconv Concat Conv Conv Conv H/2 W/2 C H/4 W/4 2C H/8 W/8 4C H/2 W/2 2C H/2 W/2 2C H/2 W/2 2C H/2 W/2 6C Pillar Index ը૾ˠ%෺ମݕग़ ʢྫɿ:0-0<3FENPO $713>ʣ :0-0 ը૾܈ˠ#&7ˠ%෺ମݕग़ ʢྫɿ#&7'PSNFS<-J &$$7>ʣ #&7'PSNFS Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 History BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attentio Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone History BEV 𝐵𝑡−1 BEV Q Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 (a) Overall Architecture Backbone Det & Seg Heads History BEV 𝐵𝑡−1 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 ೖྗը૾ʹର͢Δ%෺ମݕग़ #&7ۭؒ

#&7'PSNFS-FBSOJOH#JSET&ZF7JFX3FQSFTFOUBUJPOGSPN .VMUJ$BNFSB*NBHFTWJB4QBUJPUFNQPSBM5SBOTGPSNFST

w #&7͸෺ମͷҐஔͱେ͖͞Λਖ਼֬ʹදݱͰ͖ࣗಈӡసλεΫʹద੾ w ը૾͔Β#&7ಛ௃Λநग़͢Δطଘख๏͸ਂ౓৘ใʹґଘˠਂ౓ʹґଘ͠ͳ͍ํ๏ΛఏҊ w #&7ΫΤϦɼ3(#ը૾ɼ̍࣌ࠁલͷ#&7ಛ௃ྔΛ༻͍ͯಛ௃நग़͢Δ#&7'PSNFSΛఏҊ 4QBUJBM"UUFOUJPO ɿ#&7ΫΤϦͱը૾ؒͷؔ܎Λଊ͑ͯಛ௃நग़
5FNQPSBM"UUFOUJPO ɿ#&7ΫΤϦͱ࣌ࠁલͷ#&7ಛ௃ྔؒͷؔ܎Λଊ͑ͯಛ௃நग़ #&7'PSNFS<-J &$$7> Temporal Attention Multi-Camera Images at Timestamp 𝒕 Lookup & Aggregate BEV Queries Ego Spatial Attention BEV at Timestamp 𝒕 − 𝟏 There is a car ! What's in here at timestamp 𝒕 ? BEV at Timestamp 𝒕

w ैདྷͷ5SBOTGPSNFSߏ଄ʢσίʔμʣΛϕʔεͱͯ͠#&7'PSNFSͷΤϯίʔμΛߏங #&7ΫΤϦɼ5FNQPSBM"UUFOUJPO૚ɼ4QBUJBM"UUFOUJPO૚Λಋೖ ը૾͸όοΫϘʔϯʢྫɿ3FT/FUʣͰಛ௃நग़ w ૚ͷΤϯίʔμʹΑΔಛ௃நग़ͷޙɼλεΫʹԠͨ͡ϔουʹΑΓೝࣝ݁ՌΛ༧ଌ #&7'PSNFSͷϞσϧߏ଄
Feed Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡

w άϦουܗঢ়ͷֶशՄೳͳύϥϝʔλͰߏ੒͞ΕͨࣗंҐஔΛத৺ͱͨ͠#&7ۭؒ w #&7'PSNFS΁ೖྗ͢ΔલʹֶशՄೳͳ1PTJUJPO&NCFEEJOHΛద༻ #&7'PSNFSͷϞσϧߏ଄ɿ#&7ΫΤϦ Feed Forward Add &
Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Q ∈ ℝH×W×C H, W ɿ#&7ۭؒͷେ͖͞ CɿΫΤϦʢ#&7ಛ௃ྔʣͷ࣍ݩ਺

w ը૾ۭؒͷಛ௃ྔɹΛ$SPTT"UUFOUJPOʹΑΓ#&7ۭؒ΁౤Ө ϚϧνϏϡʔ͸ෳ਺ͷը૾͔Βߏ੒͞Ε͍ͯΔͨΊ.VUJM)FBE"UUFOUJPOͷܭࢉίετ͕૿Ճ ˠܭࢉίετͷ࡟ݮΛ໨తͱͯ͠ఏҊ͞Εͨ%FGPSNBCMF"UUFOUJPOΛ࢖༻ w #&7ΫΤϦΛΫΤϦɼը૾ۭؒͷಛ௃ྔΛΩʔɾόϦϡʔͱͯ͠࢖༻ #&7'PSNFSͷϞσϧߏ଄ɿ4QBUJBM$SPTT"UUFOUJPO Feed
Forward Add & Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 Ft

𝒱 hit ɿࢀর఺ͷ౤Өઌͷը૾ Nref ɿࢀর఺ͷ਺ QpɿҐஔɹɹɹɹͷ#&7ΫΤϦ p ∈ (x′
, y′ ) t ɿ࣌ࠁɹʹ͓͚Δɹ൪໨ͷը૾ͷಛ௃ྔ Fi t i w #&7ΫΤϦ্ͷ%ࢀর఺Λ౤Ө͢Δ͜ͱͰը૾ͷࢀর఺Λઃఆ͠$SPTT"UUFOUJPOΛద༻ #&7'PSNFSͷϞσϧߏ଄ɿ4QBUJBM$SPTT"UUFOUJPO SCA(Qp , Ft ) = 1 | 𝒱 hit | ∑ i∈ 𝒱 hit Nref ∑ j=1 DeformAttn(Qp , 𝒫 (p, i, j), Fi t ) × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (b) Spatial Cross-Attention Hit Views 𝒱hit et & Seg Heads Ґஔ p (xij , yij ) ൪໨ͷը૾ i zij ⋅ [xij yij 1] T = Ti ⋅ [x′ y′ z′ j 1] T ΧϝϥߦྻɹΛ༻͍ͯ %ࢀর఺Λը૾ۭؒ΁౤Ө Ti $SPTT"UUFOUJPO ը૾্ͷ$SPTT"UUFOUJPOͷࢀর఺ #&7ۭؒͷߴ͞ํ޲Λߟྀ ϚϧνϏϡʔΛߟྀ

w #&7ۭؒํ޲ͱ࣌ܥྻํ޲ͷؔ܎ੑΛଊ͑ͯಛ௃நग़ʢ࣌ࠁ໨͸#&7ۭؒํ޲ͷΈʣ 4FMG"UUFOUJPOͱͯ͠%FGPSNBCMF"UUFOUJPOΛ࢖༻ #&7'PSNFSͷϞσϧߏ଄ɿ5FNQPSBM4FMG"UUFOUJPO Feed Forward Add &
Norm Add & Norm Spatial Cross-Attention Temporal Self-Attention Add & Norm Multi-view Input at Time 𝑡 × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (a) Overall Architecture (b) Spatial Cross-Attention (c) Temporal Self-Attention Hit Views 𝒱hit Backbone (𝑥, 𝑦) Det & Seg Heads History BEV 𝐵𝑡−1 History BEV 𝐵𝑡−1 BEV Queries 𝑄 BEV Queries 𝑄 Current BEV 𝐵𝑡 Multi-Camera Features 𝐹𝑡 TSA(Qp , {Q, B′ t−1 }) = ∑ V∈{Q,B′ t−1 } DeformAttn(Qp , p, V) Qɿݱࡏͷ࣌ࠁɹͷ#&7ΫΤϦ t B′ t−1 ɿ࣌ࠁલɹɹͷ#&7ಛ௃ྔ t − 1 QpɿҐஔɹɹɹɹͷ#&7ΫΤϦ p ∈ (x, y)

w OV4DFOFTͷ5FTUTFUʹର͢ΔੑೳΛ఺܈Λ༻͍ͨख๏ - ɼը૾Λ༻͍ͨख๏ $ ͱൺֱ #&7'PSNFSͷֶशޮՌ Table 1:
3D detection results on nuScenes test set. ⇤ notes that VoVNet-99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [31]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# SSN [55] L - 0.569 0.463 - - - - - CenterPoint-Voxel [52] L - 0.655 0.580 - - - - - PointPainting [43] L&C - 0.581 0.464 0.388 0.271 0.496 0.247 0.111 FCOS3D [45] C R101 0.428 0.358 0.690 0.249 0.452 1.434 0.124 PGD [44] C R101 0.448 0.386 0.626 0.245 0.451 1.509 0.127 BEVFormer-S C R101 0.462 0.409 0.650 0.261 0.439 0.925 0.147 BEVFormer C R101 0.535 0.445 0.631 0.257 0.405 0.435 0.143 DD3D [31] C V2-99⇤ 0.477 0.418 0.572 0.249 0.368 1.014 0.124 DETR3D [47] C V2-99⇤ 0.479 0.412 0.641 0.255 0.394 0.845 0.133 BEVFormer-S C V2-99⇤ 0.495 0.435 0.589 0.254 0.402 0.842 0.131 BEVFormer C V2-99⇤ 0.569 0.481 0.582 0.256 0.375 0.378 0.126 Table 2: 3D detection results on nuScenes val set. “C” indicates Camera. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# FCOS3D [45] C R101 0.415 0.343 0.725 0.263 0.422 1.292 0.153 PGD [44] C R101 0.428 0.369 0.683 0.260 0.439 1.268 0.185 DETR3D [47] C R101 0.425 0.346 0.773 0.268 0.383 0.842 0.216 ը૾Λೖྗͱ͢Δैདྷ๏ͱൺ΂ͯߴ͍ೝࣝੑೳΛൃش

w OV4DFOFTͷ5FTUTFUʹର͢ΔੑೳΛ఺܈Λ༻͍ͨख๏ - ɼը૾Λ༻͍ͨख๏ $ ͱൺֱ #&7'PSNFSͷֶशޮՌ Table 1:
3D detection results on nuScenes test set. ⇤ notes that VoVNet-99 (V2-99) [21] was pre-trained on the depth estimation task with extra data [31]. “BEVFormer-S” does not leverage temporal information in the BEV encoder. “L” and “C” indicate LiDAR and Camera, respectively. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# SSN [55] L - 0.569 0.463 - - - - - CenterPoint-Voxel [52] L - 0.655 0.580 - - - - - PointPainting [43] L&C - 0.581 0.464 0.388 0.271 0.496 0.247 0.111 FCOS3D [45] C R101 0.428 0.358 0.690 0.249 0.452 1.434 0.124 PGD [44] C R101 0.448 0.386 0.626 0.245 0.451 1.509 0.127 BEVFormer-S C R101 0.462 0.409 0.650 0.261 0.439 0.925 0.147 BEVFormer C R101 0.535 0.445 0.631 0.257 0.405 0.435 0.143 DD3D [31] C V2-99⇤ 0.477 0.418 0.572 0.249 0.368 1.014 0.124 DETR3D [47] C V2-99⇤ 0.479 0.412 0.641 0.255 0.394 0.845 0.133 BEVFormer-S C V2-99⇤ 0.495 0.435 0.589 0.254 0.402 0.842 0.131 BEVFormer C V2-99⇤ 0.569 0.481 0.582 0.256 0.375 0.378 0.126 Table 2: 3D detection results on nuScenes val set. “C” indicates Camera. Method Modality Backbone NDS" mAP" mATE# mASE# mAOE# mAVE# mAAE# FCOS3D [45] C R101 0.415 0.343 0.725 0.263 0.422 1.292 0.153 PGD [44] C R101 0.428 0.369 0.683 0.260 0.439 1.268 0.185 DETR3D [47] C R101 0.425 0.346 0.773 0.268 0.383 0.842 0.216 ఺܈Λೖྗͱ͢Δ44/ͱಉఔ౓ͷೝࣝੑೳΛൃش

1MBOOJOHPSJFOUFE"VUPOPNPVT%SJWJOH 6OJ"%

w ैདྷͷγϯάϧλεΫϞσϧͷ࿈ܞ΍ϚϧνλεΫϞσϧʹΑΔγεςϜ͸ϦεΫ͕ଘࡏ γϯάϧλεΫϞσϧɿ্ྲྀͷαϒλεΫʹ͓͚ΔΤϥʔ͕ԼྲྀͷαϒλεΫʹӨڹ ϚϧνλεΫϞσϧɹɿαϒλεΫͷ࿈ܞෆ଍ʹΑΓਫ਼౓͕௿Լ w αϒλεΫͷ૊Έ߹ΘͤํΛࣗಈӡసंͷϓϥϯχϯάʹ࠷దԽͨ͠6OJ"%ΛఏҊ 6OJ fi
FE"VUPOPNPVT%SJWJOH 6OJ"% <)V $713> Design Approach Perception Prediction Plan Det. Track Map Motion Occ. (b) NMP [101] 3 3 3 NEAT [19] 3 3 BEVerse [105] 3 3 3 (c.1) [14,16,78,97] 3 (c.2) PnPNet† [57] 3 3 3 ViP3D† [30] 3 3 3 P3 [82] 3 3 MP3 [11] 3 3 3 ST-P3 [38] 3 3 3 LAV [15] 3 3 3 3 (c.3) UniAD (ours) 3 3 3 3 3 3 Table 1. Tasks comparison and taxonomy. “Design” column is classiﬁed as in Fig. 1. “Det.” denotes 3D object detection, “Map” rate on the of these mo To this e Driving alg toward a sa and Table 1 spirit. We a mere engin based desig bounding b receptive ﬁ stream pred and encode ࣗಈӡసϑϨʔϜϫʔΫͷൺֱ αϒλεΫͱϑϨʔϜϫʔΫͷ૊Έ߹Θͤ

w શͯͷαϒλεΫΛ#&7ΫΤϦʢ#&7ಛ௃ʣͰ઀ଓ͢ΔϑϨʔϜϫʔΫΛఏҊ w αϒλεΫؒͷ࿈ܞΛߟྀͨ͠ϓϥϯχϯάࢦ޲ʹج͍ͮͯઃܭ #BDLCPOF ɿϚϧνϏϡʔը૾͔Βಛ௃நग़ثʹΑΓ#&7ಛ௃Λநग़ 1FSDFQUJPO ɿ෺ମΛଊ͑ͨτϥοΫΫΤϦɼϚοϓ৘ใΛଊ͑ͨϚοϓΫΤϦΛநग़
1SFEJDUJPO ɿ෺ମͱϚοϓͷ૬ޓ࡞༻Λଊ͑ͯະདྷͷಈ͖Λ༧ଌˠ0DDVQBODZNBQΛ༧ଌ 1MBOOJOH ɿ༧ଌ͞Εͨ෺ମͷಈ͖ͱ0DDVQBODZNBQ͔Βࣗं྆ͷܦ࿏Λܭը 6OJ"%ͷϞσϧߏ଄

w ϚϧνϏϡʔͷը૾͔Β#&7ಛ௃ྔɹΛநग़ w ಛ௃நग़ثͱֶͯ͠शࡁΈͷ#&7'PSNFSΛ࢖༻ #&7ಛ௃ྔ͕நग़ՄೳͰ͋Ε͹ผͷϞσϧͰ΋࢖༻Մೳ 6OJ"%ͷϞσϧߏ଄ɿ#BDLCPOF B

w ͭͷϞδϡʔϧ͔Βपғͷঢ়گΛೝࣝ 5SBDL'PSNFS ɿ෺ମݕग़ʢं྆ɼาߦऀɼτϥοΫʣͱ.VMUJ0CKFDU5SBDLJOH .BQ'PSNFS ɿ#&7ۭؒͷϚοϓηάϝϯςʔγϣϯʢंઢɼ෼཭ଳɼަࠩ఺ɼ૸ߦՄೳΤϦΞʣ 6OJ"%ͷϞσϧߏ଄ɿ1FSDFQUJPO

w ͭͷϞδϡʔϧ͔Βߏ੒ .PUJPO'PSNFS ɿݕग़ɾ௥੻෺ମͱϚοϓͷ૬ޓ࡞༻Λଊ͑ɺ෺ମͷະདྷͷಈ͖Λ༧ଌ 0DD'PSNFS ɿݕग़ɾ௥੻෺ମͷະདྷͷಈ͖Λߟྀͯ͠઎༗Ϛοϓ 0DDVQBODZNBQ Λ༧ଌ
6OJ"%ͷϞσϧߏ଄ɿ1SFEJDUJPO

w ݕग़ɾ௥੻ͨ͠෺ମͷಈ͖ͱ0DDVQBODZNBQΛߟྀͯࣗ͠ं྆ͷܦ࿏Λܭը 6OJ"%ͷϞσϧߏ଄ɿ1MBOOJOH

w ஈ֊ͷֶशʹΑΓֶशΛ҆ఆԽ ɽ1FSDFQUJPOϞδϡʔϧʢ5SBDL'PSNFSͱ.BQ'PSNFSʣΛΤϙοΫֶश ɽ͢΂ͯͷϞδϡʔϧΛΤϙοΫֶश 6OJ"%ͷֶशํ๏ ஈ֊໨ͷֶश L1 = Ltrack
+ Lmap Ltrack Lmap

w ஈ֊ͷֶशʹΑΓֶशΛ҆ఆԽ ɽ1FSDFQUJPOϞδϡʔϧʢ5SBDL'PSNFSͱ.BQ'PSNFSʣΛΤϙοΫֶश ɽ͢΂ͯͷϞδϡʔϧΛΤϙοΫֶश 6OJ"%ͷֶशํ๏ ஈ֊໨ͷֶश L2 = Ltrack
+ Lmap + Lmotion + Locc + Lplan Ltrack Lmap Lplan Lmotion Locc

w ֤ϞδϡʔϧʹΑΔೝࣝੑೳ΁ͷӨڹΛධՁ OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ ෳ਺λεΫͷ࿈ܞʹΑΔޮՌɿఆྔతධՁ ID Modules Tracking Mapping
Motion Forecasting Occupancy Prediction Planning Track Map Motion Occ. Plan AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# 0⇤ 3 3 3 3 3 0.356 1.328 893 0.302 0.675 0.858 1.270 0.186 55.9 34.6 47.8 26.4 1.154 0.941 1 3 0.348 1.333 791 - - - - - - - - - - - 2 3 - - - 0.305 0.674 - - - - - - - - - 3 3 3 0.355 1.336 785 0.301 0.671 - - - - - - - - - 4 3 - - - - - 0.815 1.224 0.182 - - - - - - 5 3 3 0.360 1.350 919 - - 0.751 1.109 0.162 - - - - - - 6 3 3 3 0.354 1.339 820 0.303 0.672 0.736(-9.7%) 1.066(-12.9%) 0.158 - - - - - - 7 3 - - - - - - - - 60.5 37.0 52.4 29.8 - - 8 3 3 0.360 1.322 809 - - - - - 62.1 38.4 52.2 32.1 - - 9 3 3 3 3 0.359 1.359 1057 0.304 0.675 0.710(-3.5%) 1.005(-5.8%) 0.146 62.3 39.4 53.1 32.2 - - 10 3 - - - - - - - - - - - 1.131 0.773 11 3 3 3 3 0.366 1.337 889 0.303 0.672 0.741 1.077 0.157 - - - - 1.014 0.717 12 3 3 3 3 3 0.358 1.334 641 0.302 0.672 0.728 1.054 0.154 62.3 39.5 52.8 32.3 1.004 0.430 Table 2. Detailed ablations on the effectiveness of each task. We can conclude that two perception sub-tasks greatly help motion forecasting, and prediction performance also beneﬁts from unifying the two prediction modules. With all prior representations, our goal- planning boosts signiﬁcantly to ensure safety. UniAD outperforms naive MTL solution by a large margin for prediction and planning tasks, and it also owns the superiority that no substantial perceptual performance drop occurs. Only main metrics are shown for brevity. “avg.L2” and “avg.Col” are the average values across the planning horizon. ⇤: ID-0 is the MTL scheme with separate heads for each task.

w ֤ϞδϡʔϧʹΑΔೝࣝੑೳ΁ͷӨڹΛධՁ OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ ෳ਺λεΫͷ࿈ܞʹΑΔޮՌɿఆྔతධՁ ID Modules Tracking Mapping
Motion Forecasting Occupancy Prediction Planning Track Map Motion Occ. Plan AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# 0⇤ 3 3 3 3 3 0.356 1.328 893 0.302 0.675 0.858 1.270 0.186 55.9 34.6 47.8 26.4 1.154 0.941 1 3 0.348 1.333 791 - - - - - - - - - - - 2 3 - - - 0.305 0.674 - - - - - - - - - 3 3 3 0.355 1.336 785 0.301 0.671 - - - - - - - - - 4 3 - - - - - 0.815 1.224 0.182 - - - - - - 5 3 3 0.360 1.350 919 - - 0.751 1.109 0.162 - - - - - - 6 3 3 3 0.354 1.339 820 0.303 0.672 0.736(-9.7%) 1.066(-12.9%) 0.158 - - - - - - 7 3 - - - - - - - - 60.5 37.0 52.4 29.8 - - 8 3 3 0.360 1.322 809 - - - - - 62.1 38.4 52.2 32.1 - - 9 3 3 3 3 0.359 1.359 1057 0.304 0.675 0.710(-3.5%) 1.005(-5.8%) 0.146 62.3 39.4 53.1 32.2 - - 10 3 - - - - - - - - - - - 1.131 0.773 11 3 3 3 3 0.366 1.337 889 0.303 0.672 0.741 1.077 0.157 - - - - 1.014 0.717 12 3 3 3 3 3 0.358 1.334 641 0.302 0.672 0.728 1.054 0.154 62.3 39.5 52.8 32.3 1.004 0.430 Table 2. Detailed ablations on the effectiveness of each task. We can conclude that two perception sub-tasks greatly help motion forecasting, and prediction performance also beneﬁts from unifying the two prediction modules. With all prior representations, our goal- planning boosts signiﬁcantly to ensure safety. UniAD outperforms naive MTL solution by a large margin for prediction and planning tasks, and it also owns the superiority that no substantial perceptual performance drop occurs. Only main metrics are shown for brevity. “avg.L2” and “avg.Col” are the average values across the planning horizon. ⇤: ID-0 is the MTL scheme with separate heads for each task. ˠશͯͷλεΫΛߟྀͯ͠ద੾ͳܨ͕ΓͰֶश͢Δ͜ͱ͕ॏཁ

w ੺ؙͱ੨ؙͰғΜͩ෺ମɿτϥοΩϯάϞδϡʔϧʹ͓͍ͯೝࣝʹࣦഊ ෳ਺λεΫͷ࿈ܞʹΑΔޮՌɿఆੑతධՁ ೝࣝʹࣦഊ͍ͯ͠Δ৔߹ʹ͓͍ͯ΋1MBOOFSͷ஫ҙॏΈͰ͸ڧ͘஫໨ ˠೝࣝʹࣦഊͨ͠ةݥͳγʔϯʹର͢Δ6OJ"%ͷϩόετੑ ϚϧνϏϡʔը૾ͱ෺ମݕग़݁Ռ Ϛοϐϯάͱಈ͖༧ଌͷ݁Ռ 1MBOOFSͷ஫ҙॏΈ t

1"3"%SJWF1BSBMMFMJ[FE"SDIJUFDUVSFGPS3FBMUJNF "VUPOPNPVT%SJWJOH

w ϦΞϧλΠϜࣗಈӡసͷͨΊʹઃܭ͞Εͨ৽͍͠ΞʔΩςΫνϟ ΫΤϦϕʔεͷϚϧνλεΫʹΑΔ&&Ϟσϧ 1"3"%SJWF<8FOH $713> UniAD (CVPR ‘23)
VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA Latent features Compact outputs Latent + Compact Test time o Figure 2. Visual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modula [12], VAD [13], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2 placements due to different inter-module connections, and (3) information flow with different representations such as latent features and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is des parallelized for runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and inter Table 4. Ablation on information flow when the BEV feature maps are not used in planning. High-dimensional latent queries tend to carry more information for planning performance in this case. Methods Col. Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Aveall 1.0s 2.0s 3.0s Aveall Baseline 0.00 0.07 0.51 0.13 0.24 0.55 1.07 0.53 Baseline - BEV6 3.41 8.09 7.91 5.88 2.83 5.37 7.61 4.66 (4) Baseline - BEV + “Mot-Plan” bbox 0.00 0.29 3.94 0.97 0.34 1.15 2.53 1.10 (4) Baseline - BEV + “Mot-Plan” query 0.00 0.10 0.46 0.14 0.24 0.58 1.09 0.54 (5) Baseline - BEV + “Map-Plan” BEV 0.12 1.07 3.44 1.22 0.95 1.82 2.63 1.59 (5) Baseline - BEV + “Map-Plan” query 0.02 0.20 0.65 0.20 0.28 0.62 1.15 0.58 (6) Baseline - BEV + “Occ-Plan” BEV 0.48 1.75 4.84 1.85 1.96 3.75 5.41 3.26 (6) Baseline - BEV + “Occ-Plan” query 0.14 0.27 1.00 0.38 0.42 0.82 1.47 0.78 6 This model also removes the edge (4) and we omit for simplicity, i.e., the only input to planning is the high-level command. We aim to use this model to serve as the basis to validate which information added can help the most in planning, in the case of not using the BEV features. Mapping Occupancy Prediction Motion Prediction Planning BEV features (1) (2) Mapping Occupancy Prediction Motion Prediction Planning BEV features (3) (4) (5) Improved baseline after removing edges (1)(2) Alternative designs with various inter-module connectivity (6) (7) (0) Figure 4. We highlight all inter-module connectivities that build a conditional dependency between modules. For example, edge (1) refers to using outputs from the mapping module for motion pre- Table 5. Ablation experiments on the necessity of modules. Scenarios Methods Col. # L2 # Map Comp. (%) # Aveall Aveall Offroad OffLane val Baseline - edge (4) 7 0.20 0.5734 0.32 1.20 Baseline - Map. 0.16 0.5332 0.71 3.03 Baseline - Occ. 0.64 0.8174 4.19 4.43 Baseline - Mot. 0.60 0.8561 4.05 4.49 val Baseline - Occ. + “Mot-Plan” 0.14 0.5483 0.38 0.82 Baseline - Mot. + “Occ-Plan” 0.25 0.5953 0.80 1.33 7 Similar to Table 4 row 2, where we eliminate the edge (4) as the base model to fairly compare the impact of removing each upstream module on planning. 8-9 for edge (5), we pass either the compact semantic BEV map denoting the road and lane geometry information or the latent queries of the map elements to downstream planning. Similarly, in rows 10-11 for edge (6), we compare the use of compact BEV occupancy maps or object queries for planning. Interestingly, despite that passing the compact BEV outputs leads to slightly better performance than passing the high-dimensional query features, we find that it is not nec- essary to pass either the compact outputs or the per-module latent queries due to the potential redundancy with information already flowing from the BEV feature maps into the planning module via the edge (0). ෳࡶԽʂ AD (CVPR ‘23) VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA-Drive Latent features Compact outputs Latent + Compact Test time only ual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modular AV, i.e., UniAD ], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2) distinct module e to different inter-module connections, and (3) information flow with different representations such as high-dimensional and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is designed completely r runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and interpretability. ,FFQJU4JNQMF 4UVQJEʢ,*44

w ೝࣝ ༧ଌ ܭըͷ֤ϞδϡʔϧΛಉ࣌ʹֶशʢϚϧνλεΫʣ Ϟδϡʔϧؒͷ৘ใͷड͚౉͠͸τʔΫϯԽ͞Εͨ#&7ΫΤϦΛհͯ͠҉໧తʹߦ͏ ςετ࣌͸Ұ෦ͷϞδϡʔϧΛແޮʹ͢Δ͜ͱͰ࣮ߦ଎౓Λ޲্ʢഒߴ଎Խʣ 1"3"%SJWF<8FOH
$713> Mapping Occupancy Prediction Motion Prediction Planning BEV features Map queries Motion queries Occupancy queries Plan queries BEV queries C C C C Ego vehicle’s information C Cross-attention operation Can be deactivated at runtime Figure 5. PARA-Drive architecture. Perception, prediction, and planning modules are co-trained in parallel. No dependency is in- suring that relevant inform A distinct advantage from prior work with hyb operational independence training, the planning hea perception and prediction cant ﬂexibility during infe motion, and occupancy c duced frame rate only whe ducting safety checks, or design signiﬁcantly boosts tion planning module, en and thereby enhancing ov

1"3"%SJWF<8FOH $713> Table 6. Main comparison with state-of-the-art approaches
under our standardized evaluation methodology on the nuScenes val dataset. PARA-Drive achieves superior performance consistently on all metrics and targeted scenarios. Scenarios Evaluation Methodology Methods Using Ego’s States Collision Rates (%) # L2 (m) # Map Comp. Errors (%) # 1.0s 2.0s 3.0s Ave1,2,3s Aveall 1.0s 2.0s 3.0s Ave1,2,3s Aveall Offroad Ofﬂane val Standardized UniAD [12] No 0.32 0.29 0.73 0.45 0.40 0.4774 0.8947 1.4703 0.9474 0.8317 0.91 1.74 VAD [13] No 0.02 0.26 0.83 0.37 0.30 0.4084 0.8577 1.4596 0.9086 0.7830 1.03 1.93 PARA-Drive No 0.00 0.12 0.65 0.26 0.17 0.2581 0.5927 1.1196 0.6568 0.5574 0.12 0.83 AD-MLP [30] Yes 0.00 0.14 0.70 0.28 0.20 0.2267 0.5847 1.1782 0.6632 0.5568 1.21 2.45 PARA-Drive+ Yes 0.00 0.09 0.49 0.19 0.13 0.2035 0.5195 1.0425 0.5885 0.4939 0.11 0.78 targeted Standardized UniAD [12] No 0.00 0.00 0.73 0.24 0.15 0.4698 1.0921 1.9162 1.1594 0.9935 - - VAD [13] No 0.00 0.29 0.87 0.39 0.34 0.5305 1.2015 2.0696 1.2672 1.0840 - - PARA-Drive No 0.00 0.00 0.72 0.24 0.14 0.3844 0.9729 1.8805 1.0793 0.9082 - - AD-MLP [30] Yes 0.00 0.58 3.62 1.40 0.94 0.3309 0.9938 2.0559 1.1269 0.9360 - - PARA-Drive+ Yes 0.00 0.00 0.29 0.10 0.05 0.2856 0.7575 1.4608 0.8346 0.7018 - - Table 7. Comparison of performance on perception and prediction. Methods Detection and Motion Prediction Mapping Occupancy mAP " NDS " AMOTA " minADE # IoU-lane " IoU-road " IoU-n " VPQ-n " UniAD, R101 [12] 0.38 0.50 0.36 0.73 0.30 0.67 62.3 52.8 Ours, R101 0.37 0.48 0.35 0.72 0.33 0.71 63.6 55.6 Ours, R50 0.32 0.44 0.30 0.89 0.32 0.70 58.9 51.4 4.1. Additional Results and Analysis Table 6 summarizes the performance of all recent state-of- the-art approaches. On our standardized evaluation protocols, PARA-Drive sets the new state-of-the-art on all metrics. Also, our results re-established the benchmark for a fair comparison across prior art. Given the new benchmark, although we observed that VAD still outperforms UniAD Table 8. Performance on old existing evaluation protocols Eval. Methods Col. Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Ave123s 1.0s 2.0s 3.0s Ave123s UniAD GT 0.35 0.38 0.35 0.36 - - - - VAD 0.02 0.28 0.85 0.38 0.50 1.02 1.68 1.07 UniAD 0.05 0.17 0.71 0.31 0.48 0.96 1.65 1.03 Ours 0.07 0.25 0.60 0.30 0.40 0.77 1.31 0.83 VAD GT 1.02 0.96 0.91 0.96 - - - - VAD 0.07 0.17 0.41 0.22 0.41 0.70 1.05 0.72 UniAD 0.12 0.13 0.28 0.17 0.48 0.74 1.07 0.76 Ours 0.14 0.23 0.39 0.25 0.25 0.46 0.74 0.48 BEV features for planning together with other tasks might not necessarily lead to the negative transfer issue as observed in prior work [9, 19] for co-training of perception- ˠ1"3"%SJWF͕ैདྷ๏ΑΓߴਫ਼౓ Table 6. Main comparison with state-of-the-art approaches under our standardized evaluation methodo PARA-Drive achieves superior performance consistently on all metrics and targeted scenarios. Scenarios Evaluation Methodology Methods Using Ego’s States Collision Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Ave1,2,3s Aveall 1.0s 2.0s 3.0s Av val Standardized UniAD [12] No 0.32 0.29 0.73 0.45 0.40 0.4774 0.8947 1.4703 0 VAD [13] No 0.02 0.26 0.83 0.37 0.30 0.4084 0.8577 1.4596 0 PARA-Drive No 0.00 0.12 0.65 0.26 0.17 0.2581 0.5927 1.1196 0 AD-MLP [30] Yes 0.00 0.14 0.70 0.28 0.20 0.2267 0.5847 1.1782 0 PARA-Drive+ Yes 0.00 0.09 0.49 0.19 0.13 0.2035 0.5195 1.0425 0 targeted Standardized UniAD [12] No 0.00 0.00 0.73 0.24 0.15 0.4698 1.0921 1.9162 1 VAD [13] No 0.00 0.29 0.87 0.39 0.34 0.5305 1.2015 2.0696 1 PARA-Drive No 0.00 0.00 0.72 0.24 0.14 0.3844 0.9729 1.8805 1 AD-MLP [30] Yes 0.00 0.58 3.62 1.40 0.94 0.3309 0.9938 2.0559 1 PARA-Drive+ Yes 0.00 0.00 0.29 0.10 0.05 0.2856 0.7575 1.4608 0 Table 7. Comparison of performance on perception and prediction. Methods Detection and Motion Prediction Mapping Occupancy mAP " NDS " AMOTA " minADE # IoU-lane " IoU-road " IoU-n " VPQ-n " UniAD, R101 [12] 0.38 0.50 0.36 0.73 0.30 0.67 62.3 52.8 Ours, R101 0.37 0.48 0.35 0.72 0.33 0.71 63.6 55.6 Ours, R50 0.32 0.44 0.30 0.89 0.32 0.70 58.9 51.4 4.1. Additional Results and Analysis Table 8. Performance on old Eval. Methods Col. Rates (% 1.0s 2.0s 3.0s UniAD GT 0.35 0.38 0.35 VAD 0.02 0.28 0.85 UniAD 0.05 0.17 0.71 Ours 0.07 0.25 0.60 VAD GT 1.02 0.96 0.91 ˠ֤λεΫͷੑೳ΋ಉ౳Ҏ্

5PLFOJ[FUIF8PSMEJOUP0CKFDUMFWFM,OPXMFEHFUP "EESFTT-POHUBJM&WFOUTJO"VUPOPNPVT%SJWJOH

w ैདྷͷ&&Ϟσϧ͸-POHUBJMΠϕϯτʢൃੜස౓͕௿͍͕ॏཁੑ͕ߴ͍Πϕϯτʣ͕ۤख --.ͷ׆༻͕ظ଴͞ΕΔ͕ɼ--.ͷੑೳ͸ೖྗςΩετʹ͓͚Δγʔϯઆ໌ͷ඼࣭ͱৄࡉʹґଘ ˠηϯαೖྗ͔Βநग़ͨ͠τʔΫϯʹΑΓγʔϯͷৄࡉΛදݱ ˠ1BSB%3*7&ͷग़ྗτʔΫϯΛར༻͢Δ͜ͱͰɼ-POHUBJMΠϕϯτͷهड़Λߴਫ਼౓Խ 50,&/<5JBO $P3-> Tokenize
the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving Ran Tian1,2 Boyi Li1,2 Xinshuo Weng1 Yuxiao Chen1 Edward Schmerling1 Yue Wang1,3 Boris Ivanovic1 Marco Pavone1,4 1NVIDIA 2UC Berkeley 3University of Southern California 4Stanford University Large Language Model (LLaMA) TOKEN: A Multi-Modal LLM with Object-Centric Tokenization for Autonomous Driving <Multi-view video, HD-map, …, Symbolic representation> Object-centric Tokenization … … Scene-centric Tokenization A: … … Q: Given the input information, please…? Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the ego vehicle? Scene Understanding Hierarchical Planning A: 1) critical objects grounding…; 2) interaction planning…; 3) lane-level decision-making…; 4) motion plan generation…. Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? Reasoning Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9). It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. 0959v1 [cs.AI] 1 Jul 2024

w ͭͷϞδϡʔϧ͔Βߏ੒ 0CKFDUDFOUSJD5PLFOJ[FS ɿηϯαೖྗ͔Β෺ମϨϕϧͷ৘ใΛநग़ʢ1"3"%SJWFʣ 4DFOFDFOUSJD5PLFOJ[FS ɿηϯαೖྗ͔ΒγʔϯϨϕϧͷ৘ใΛநग़ "EBQUFS
ɿ֤5PLFOJ[FSͷग़ྗΛςΩετͷຒΊࠐΈۭؒʹΞϥΠϝϯτ --. ɿ--.ͷߴ͍ཧղྗͱҰൠԽೳྗΛ༻͍࣭ͯ໰ʹରͯ͠ճ౴ʢ--B."ʣ 50,&/ͷϞσϧߏ଄ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tokens using pre-trained object-centric querying transformers Detection Mapping Prediction Transformer- based driving model … Downstream driving tasks for training token extraction … BEV features Track querying transformer … … Prediction querying transformer … … Map querying transformer … … Occupancy querying transformer … … C C C C C Cross attention !!"#$% !'#( !')!*)+ !)$$. "!"#$% "'#( ")$$. "')!*)+ 1"3"%SJWFͷ ֤Ϟδϡʔϧͷ ग़ྗτʔΫϯΛར༻ 1"3"%SJWF

w ͭͷεςοϓ͔Β50,&/Λֶश 1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश 3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"ͱ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model

w ͭͷεςοϓ͔Β50,&/Λֶश 1SFUSBJOJOH ɿ4DFOF6OEFSTUBOEJOH2"Λ༻͍ͯ"EBQUFSΛֶश 3FBTPOJOH fi OFUVOJOH ɿ3FBTPOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 4DFOF6OEFSTUBOEJOH2" Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehic should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at ( 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic co are blocking the ego vehicle’s lane. The autonomous vehicle should bypass traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 4DFOF6OEFSTUBOEJOH2" γʔϯཧղʹؔ͢Δճ౴Λੜ੒͢ΔλεΫ

1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 3FBTPOJOH2" 1.9, 12.1) vehicle? g into ego Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 3FBTPOJOH2" ࣭໰จʹର͢Δཧ༝ͷઆ໌จΛੜ੒͢ΔλεΫ

1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 1MBOOJOH2" hicle pay attention 7) ? Why? is blocking the The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … icle pay attention ? Why? ith velocity (-1.6, in front of the wer is: Yes. 1MBOOJOH2" ߦಈͷܭըΛੜ੒͢ΔλεΫ

w γʔϯཧղɼ෺ମͷάϥϯσΟϯάɼϓϥϯχϯάͷੑೳΛධՁ 7JEFP--B." 7*-" ɿϚϧνϞʔμϧ--.ͷैདྷ๏ #&750,&/ ɿ#&7ಛ௃Λ"EBQUFSͷೖྗͱͨ͠50,&/ --.ͷ׆༻ʹΑΔޮՌɿఆྔతධՁ
ࣗಈӡసʹಛԽͨ͠Τϯίʔμͷ࢖༻ɼ0CKFDUDFOUSJDͳτʔΫϯԽʹΑΓ ैདྷͷϚϧνϞʔμϧ--.Λ௒͑ΔੑೳΛൃش categorical questions in the DriveLM dataset [7]. To evaluate the model’s ability localize and reason about critical objects, we use precision and recall to measure its grounding ability (we use Hungarian matching to match the predictions with the ground truth), and use accuracy to measure its ability to identify whether an object is critical given the object’s center location in the ego frame. We consider three variants of trajectory L2 error: the overall, turning, and progress errors, which are calculated from the original L2 distance, heading difference, and longitudinal-weighted L2 distance between the prediction and GT1. We use the average collision rate over the entire horizon to measure the safety of a motion plan. More details about our evaluation protocol can be found in App. B. Scene understanding " Critical object grounding " Traj L2 (m) # Method Obj. class. Lane-object asso. acc. Acc. Precision Recall Import. class. 1s 2s 3s Ave123s Aveall Video-LLaMA 0.28 0.39 0.38 0.22 0.27 0.58 0.27 1.72 6.34 3.01 2.39 VILA-1.5 0.37 0.22 0.42 0.19 0.16 0.55 0.28 1.56 4.41 2.09 1.66 BEV-TOKEN 0.68 0.64 0.61 0.58 0.62 0.76 0.39 1.01 2.02 1.14 0.96 TOKEN 0.92 0.68 0.76 0.87 0.76 0.92 0.26 0.71 1.47 0.81 0.68 PARA-Drive NA NA NA NA NA NA 0.23 0.68 1.50 0.80 0.66 Table 1: Quantitative evaluation of the scene understanding, critical object grounding, and planning tasks. TOKEN signiﬁcantly outperforms baseline VLMs due to its use of driving-task pre-trained features and object-centric tokenization. Bold numbers denote the best results in each column, and the numbers shaded in green indicate signiﬁcant improvements. We also show the PARA-Drive’s planning performance as reference (shaded in grey). Results. We present the quantitative evaluation of each model’s scene understanding, object ground-

w छྨͷϩϯάςʔϧΠϕϯτʹ͓͍ͯϓϥϯχϯάੑೳΛධՁ --.ͷ׆༻ʹΑΔޮՌɿఆྔతධՁ શͯͷ-POHUBJMΠϕϯτʹ͓͍ͯҰ؏ͯ͠1"3"%SJWFΛ௒͑ΔੑೳΛୡ੒ Traj L2 (m) # Heading
L2 (rad) # Lon. weighted traj L2 (m) # Collision (%) # Long-tail Method 1s 2s 3s Ave123s Aveall 1s 2s 3s Ave123s Aveall 1s 2s 3s Ave123s Aveall Aveall 3-point turn PARA-Drive 0.50 1.38 2.76 1.55 1.29 0.40 0.83 1.03 1.48 0.90 0.78 2.10 4.19 2.36 1.96 5.33 TOKEN 0.39 1.29 2.60 1.43 1.18 0.21 0.35 0.71 0.42 0.36 0.68 2.15 4.33 2.39 1.98 4.00 TOKEN ⇤ 0.20 0.73 1.77 0.90 0.73 0.35 0.41 0.37 0.38 0.33 0.37 1.28 2.93 1.53 1.24 0.00 Resume motion PARA-Drive 0.14 0.79 2.30 1.08 0.85 0.11 0.45 0.57 0.38 0.35 0.26 1.46 4.18 1.97 1.56 0 from full stop TOKEN 0.13 0.70 1.58 0.80 0.65 0.09 0.24 0.31 0.22 0.19 0.24 1.24 2.66 1.38 1.13 0 Overtake PARA-Drive 0.27 0.89 1.94 1.03 0.85 0.05 0.11 0.18 0.11 0.10 0.50 1.56 3.37 1.81 1.50 2.3 TOKEN 0.29 0.77 1.63 0.90 0.74 0.04 0.07 0.11 0.07 0.09 0.53 1.36 2.86 1.58 1.31 0 Construction zone PARA-Drive 0.38 0.93 1.65 0.99 0.84 0.05 0.09 0.11 0.08 0.07 0.69 1.61 2.84 1.71 1.47 6.70 TOKEN 0.26 0.63 1.28 0.72 0.60 0.02 0.04 0.06 0.04 0.04 0.47 1.03 2.08 1.19 1.00 2.22 Table 2: TOKEN signiﬁcantly outperforms PARA-Drive in long-tail scenarios. PARA-Drive TOKEN GT t=2s t=4s t=6s t=8s t=10s Prompt: The autonomous vehicle needs to make a left 3-point turn. What objects are important for the autonomous vehicle's planning? What are these objects, their (x,y) locations? and how should the autonomous vehicle interact with them? Please plan the autonomous vehicle's 3s future trajectory using 6 waypoints, one every 0.5 seconds. TOKEN*

--.ͷ׆༻ʹΑΔޮՌɿఆੑతධՁ TOKEN: There is 1 important object: car at
(-1.9, 8.1). The autonomous vehicle should proceed with caution and prepare to overtake the car at (-1.9, 8.1) from the autonomous vehicle's left side. It should use the adjacent right lane to continue driving, decelerate and steer slightly to right, …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted] ো֐෺ͷ௥͍ӽ͠ ˠো֐෺Λආ͚ͨޙʹݩͷंઢʹ໭ΔΑ͏ʹϓϥϯχϯά 1SPNQU5IFBVUPOPNPVTWFIJDMFOFFETUPLFFQGPSXBSEBMPOHUIFSPBE8IBUPCKFDUTBSFJNQPSUBOUGPSUIFBVUPOPNPVTWFIJDMFT QMBOOJOH 8IBUBSFUIFTFPCKFDUT UIFJS Y Z MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN 1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసं͸ಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳ෺ମ͸ԿͰ͔͢ʁ ͜ΕΒͷ෺ମͱ͜ΕΒͷ Y Z ࠲ඪ͸ԿͰ͋Γɼࣗಈӡసं͸͜ΕΒͷ෺ମͱͲͷΑ͏ʹ૬ޓ࡞༻͢΂͖Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳ෺ମ͕ͭ͋Γ·͢ ͷंɻࣗಈӡసं͸৻ॏʹਐΈɼࠨଆʹ͋Δ ͷंΛ௥͍ӽ͢४උΛ͢Δඞཁ͕͋Γ·͢ɻ ྡ઀͢ΔӈଆͷंઢΛ࢖༻ͯ͠૸ߦΛଓ͚ɼݮ଎ͯ͠ӈʹΘ͔ͣʹϋϯυϧΛ੾ͬͯʜ<লུ> 50,&/5IFSFJTJNQPSUBOUPCKFDUDBSBU 5IFBVUPOPNPVTWFIJDMFTIPVMEQSPDFFEXJUIDBVUJPOBOEQSFQBSFUP PWFSUBLFUIFDBSBU GSPNUIFBVUPOPNPVTWFIJDMFTMFGUTJEF*UTIPVMEVTFUIFBEKBDFOUSJHIUMBOFUPDPOUJOVFESJWJOH EFDFMFSBUFBOETUFFSTMJHIUMZUPSJHIU ʜ<PNJUUFE>

--.ͷ׆༻ʹΑΔޮՌɿఆੑతධՁ ަ௨ίʔϯͷճආ ˠྡͷंઢͷத৺Λ૸ΔΑ͏ͳܗͰަ௨ίʔϯΛճආ 1SPNQU5IFBVUPOPNPVTWFIJDMFOFFETUPLFFQGPSXBSEBMPOHUIFSPBE8IBUPCKFDUTBSFJNQPSUBOUGPSUIFBVUPOPNPVTWFIJDMFT QMBOOJOH 8IBUBSFUIFTFPCKFDUT UIFJS Y Z
MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN 1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసं͸ಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳ෺ମ͸ԿͰ͔͢ʁ ͜ΕΒͷ෺ମͱ͜ΕΒͷ Y Z ࠲ඪ͸ԿͰ͋Γɼࣗಈӡసं͸͜ΕΒͷ෺ମͱͲͷΑ͏ʹ૬ޓ࡞༻͢΂͖Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳ෺ମ͕ͭ͋Γ·͢ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɻ͜ΕΒͷަ௨ίʔϯ͸ɼࣗं྆ͷंઢ Λ࠹͍Ͱ͍·͢ɻࣗಈӡసं͸ ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ ௨ίʔϯΛӈଆ͔Βճආ͢Δඞཁ͕͋Γ·͢ʜ<লུ> 50,&/5IFSFBSFJNQPSUBOUPCKFDUTUSB ff i DDPOFBU USB ff i DDPOFBU USB ff i DDPOFBU USB ff i DDPOFBU 5IFTFUSB ffi D DPOFTBSFCMPDLJOHUIFFHPWFIJDMFTMBOF5IFBVUPOPNPVTWFIJDMFTIPVMECZQBTTUIFUSB ffi DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTT UIFUSB ff i DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIFUSB ffi DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIF USB ff i DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEFʜ<PNJUUFE> TOKEN: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side, bypass the traffic cone at (2.1, 11.7) from the autonomous vehicle's right side, bypass the traffic cone at (1.7, 16.8) from the autonomous vehicle's right side, bypass the traffic cone at (1.6, 19.8) from the autonomous vehicle's right side …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted]

7JTVBM1PJOU$MPVE'PSFDBTUJOHFOBCMFT 4DBMBCMF"VUPOPNPVT%SJWJOH 7J%"3

w ΫΤϦϕʔεͷ#&71FSDFQUJPOख๏ͷͨΊͷࣗݾڭࢣ͋Γֶशʢࣄલֶश๏ʣΛఏҊ w աڈ͔Βݱࡏ·ͰͷϚϧνϏϡʔը૾Λೖྗͱͯ͠ະདྷͷ఺܈Λ༧ଌ͢Δ͜ͱͰࣄલֶश ະདྷͷ఺܈༧ଌʹ͸ҙຯ৘ใɼ%ߏ଄ɼ࣌ܥྻΛଊ͑Δ͜ͱ͕ඞཁ ˠ#&7ಛ௃ྔʹزԿֶతɾ࣌ؒతͳ৘ใ͕ຒΊࠐ·Εɼ໨తͷҟͳΔෳ਺ͷαϒλεΫʹߩݙ 7J%"3<:BOH $713>
Visual Point Cloud Forecasting History Visual Inputs Future Point Cloud Predictions ! ! − # ! − $ ! + # ! + & ! + ' … … … … … … ViDAR: A Visual Autonomous Driving Pre-training Model 1.88 1.25 0.74 0.67 0.27 0.23 62.8 65.2 31.2 33.2 26.9 31.6 38.3 42.1 51.7 54.8 Previous SOTA ViDAR Point Cloud Forecasting Motion Forecasting Future Occupancy Prediction 3D Object Detection Semantic Occupancy Prediction Multi-Object Tracking Map Segmentation Planning

w 6OJ"%ͱ7J%"3ͷࣄલֶशΛಋೖͨ͠6OJ"%ͷೝࣝੑೳΛൺֱ ࣄલֶशͰ͸ϑϨʔϜ෼ͷϚϧνϏϡʔը૾Λೖྗͯ͠ϑϨʔϜ෼ͷະདྷͷ఺܈Λ༧ଌ OV4DFOFTͷ7BMJEBUJPOTFUʹର͢ΔೝࣝੑೳΛධՁ 7J%"3ʹΑΔࣄલֶशͷޮՌ Method Detection
Tracking Mapping Motion Forecasting Future Occupancy Prediction Planning NDS " mAP " AMOTA" AMOTP# IDS# IoU-lane" IoU-road" minADE# minFDE# MR# IoU-n." IoU-f." VPQ-n." VPQ-f." avg.L2# avg.Col.# UniAD 49.36 37.96 38.3 1.32 1054 31.3 69.1 0.75 1.08 0.158 62.8 40.1 54.6 33.9 1.12 0.27 ViDAR 52.57 42.33 42.0 1.25 991 33.2 71.4 0.67 0.99 0.149 65.4 42.1 57.3 36.4 0.91 0.23 Table 10. Performance gain of ViDAR for joint perception, prediction, and planning. ViDAR consistently improves UniAD [28] on all tasks towards end-to-end autonomous driving, validating its effectiveness for scalable visual autonomous driving. 0.301 0.346 0.394 0.426 0.279 0.329 0.377 0.400 mAP 0.321 0.387 0.451 0.518 0.249 0.316 0.408 0.477 0.200 0.350 0.500 1/8 1/4 1/2 1/1 NDS Training Subset ViDAR ImageNet-cls Groups of Latent R. N/A 1 2 4 8 16 NDS (%) 40.20 39.18 43.36 45.53 47.01 47.58 Table 11. Ablation of Latent Rendering for downstream ﬁne- tuning. We compare the performance of 3D detection pre-trained by ViDAR without the Latent Rendering operation (denoted as “N/A”) and by ViDAR with different groups of Latent Rendering. ࣄલֶशΛಋೖ͢Δ͜ͱͰશͯͷλεΫʹ͓͍ͯೝࣝੑೳ͕վળ

w %ೝࣝɿ#&7'PSNFS<-J &$$7> $SPTT"UUFOUJPOʹΑΓը૾͔Β#&7ۭؒ΁ͷ౤Өͱ࣌ܥྻͷؔ܎Λଊ͑Δ#&71FSDFQUJPOख๏ w &&ʹΑΔࣗಈӡసɿ 6OJ"%<)V $713>
શͯͷαϒλεΫΛ#&7ΫΤϦͰ઀ଓ͢ΔϑϨʔϜϫʔΫʹΑΓෳ਺ͷαϒλεΫΛ4UFQͰ&&ֶश 1"3"%SJWF<8FOH $713> ೝࣝ ༧ଌ ܭըͷ֤ϞδϡʔϧΛಉ࣌ʹϚϧνλεΫతʹ&&ֶश ςετ࣌͸Ұ෦ͷϞδϡʔϧΛແޮʹ͢Δ͜ͱͰ࣮ߦ଎౓Λ޲্ʢഒߴ଎Խʣ w &&ʴ--.ʹΑΔࣗಈӡసɿ50,&/<)V $P3-> 1"3"%SJWFͷग़ྗτʔΫϯͱ--.Λ༻͍ͯ-POHUBJMΠϕϯτͷهड़ͱϓϥϯχϯάΛߴਫ਼౓Խ w ࣗݾڭࢣ͋Γֶशɿ7J%"3<:BOH $713> ༷ʑͳαϒλεΫͰߴ͍ࣄલֶशͷޮՌΛൃش͢Δ#&71FSDFQUJPOͷͨΊͷࣗݾڭࢣ͋Γֶश ·ͱΊɿ5SBOTGPSNFSʹΑΔ#&71FSDFQUJPO

TransformerによるBEV Perception

TransformerによるBEV Perception

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Featured

Transcript