*NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classification, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. 5FYUJOQVUʢ࣭จʣ 5FYUPVUQVUʢճจʣ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ೖྗը૾ɿ
5FNQPSBM"UUFOUJPO ɿ#&7ΫΤϦͱ࣌ࠁલͷ#&7ಛྔؒͷؔΛଊ͑ͯಛநग़ #&7'PSNFS<-J &$$7> Temporal Attention Multi-Camera Images at Timestamp 𝒕 Lookup & Aggregate BEV Queries Ego Spatial Attention BEV at Timestamp 𝒕 − 𝟏 There is a car ! What's in here at timestamp 𝒕 ? BEV at Timestamp 𝒕
, y′  ) t ɿ࣌ࠁɹʹ͓͚Δɹ൪ͷը૾ͷಛྔ Fi t i w #&7ΫΤϦ্ͷ%ࢀরΛӨ͢Δ͜ͱͰը૾ͷࢀরΛઃఆ͠$SPTT"UUFOUJPOΛద༻ #&7'PSNFSͷϞσϧߏɿ4QBUJBM$SPTT"UUFOUJPO SCA(Qp , Ft ) = 1 | 𝒱 hit | ∑ i∈ 𝒱 hit Nref ∑ j=1 DeformAttn(Qp , 𝒫 (p, i, j), Fi t ) × 6 (𝑥′, 𝑦′) (𝑥′, 𝑦′, 𝑧𝑗 ′) (b) Spatial Cross-Attention Hit Views 𝒱hit et & Seg Heads Ґஔ p (xij , yij ) ൪ͷը૾ i zij ⋅ [xij yij 1] T = Ti ⋅ [x′  y′  z′  j 1] T ΧϝϥߦྻɹΛ༻͍ͯ %ࢀরΛը૾ۭؒӨ Ti $SPTT"UUFOUJPO ը૾্ͷ$SPTT"UUFOUJPOͷࢀর #&7ۭؒͷߴ͞ํΛߟྀ ϚϧνϏϡʔΛߟྀ
VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA Latent features Compact outputs Latent + Compact Test time o Figure 2. Visual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modula [12], VAD [13], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2 placements due to different inter-module connections, and (3) information flow with different representations such as latent features and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is des parallelized for runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and inter Table 4. Ablation on information flow when the BEV feature maps are not used in planning. High-dimensional latent queries tend to carry more information for planning performance in this case. Methods Col. Rates (%) # L2 (m) # 1.0s 2.0s 3.0s Aveall 1.0s 2.0s 3.0s Aveall Baseline 0.00 0.07 0.51 0.13 0.24 0.55 1.07 0.53 Baseline - BEV6 3.41 8.09 7.91 5.88 2.83 5.37 7.61 4.66 (4) Baseline - BEV + “Mot-Plan” bbox 0.00 0.29 3.94 0.97 0.34 1.15 2.53 1.10 (4) Baseline - BEV + “Mot-Plan” query 0.00 0.10 0.46 0.14 0.24 0.58 1.09 0.54 (5) Baseline - BEV + “Map-Plan” BEV 0.12 1.07 3.44 1.22 0.95 1.82 2.63 1.59 (5) Baseline - BEV + “Map-Plan” query 0.02 0.20 0.65 0.20 0.28 0.62 1.15 0.58 (6) Baseline - BEV + “Occ-Plan” BEV 0.48 1.75 4.84 1.85 1.96 3.75 5.41 3.26 (6) Baseline - BEV + “Occ-Plan” query 0.14 0.27 1.00 0.38 0.42 0.82 1.47 0.78 6 This model also removes the edge (4) and we omit for simplicity, i.e., the only input to planning is the high-level command. We aim to use this model to serve as the basis to validate which informa- tion added can help the most in planning, in the case of not using the BEV features. Mapping Occupancy Prediction Motion Prediction Planning BEV features (1) (2) Mapping Occupancy Prediction Motion Prediction Planning BEV features (3) (4) (5) Improved baseline after removing edges (1)(2) Alternative designs with various inter-module connectivity (6) (7) (0) Figure 4. We highlight all inter-module connectivities that build a conditional dependency between modules. For example, edge (1) refers to using outputs from the mapping module for motion pre- Table 5. Ablation experiments on the necessity of modules. Scenarios Methods Col. # L2 # Map Comp. (%) # Aveall Aveall Offroad OffLane val Baseline - edge (4) 7 0.20 0.5734 0.32 1.20 Baseline - Map. 0.16 0.5332 0.71 3.03 Baseline - Occ. 0.64 0.8174 4.19 4.43 Baseline - Mot. 0.60 0.8561 4.05 4.49 val Baseline - Occ. + “Mot-Plan” 0.14 0.5483 0.38 0.82 Baseline - Mot. + “Occ-Plan” 0.25 0.5953 0.80 1.33 7 Similar to Table 4 row 2, where we eliminate the edge (4) as the base model to fairly compare the impact of removing each upstream module on planning. 8-9 for edge (5), we pass either the compact semantic BEV map denoting the road and lane geometry information or the latent queries of the map elements to downstream planning. Similarly, in rows 10-11 for edge (6), we compare the use of compact BEV occupancy maps or object queries for plan- ning. Interestingly, despite that passing the compact BEV outputs leads to slightly better performance than passing the high-dimensional query features, we find that it is not nec- essary to pass either the compact outputs or the per-module latent queries due to the potential redundancy with infor- mation already flowing from the BEV feature maps into the planning module via the edge (0). ෳࡶԽʂ AD (CVPR ‘23) VAD (ICCV ‘23) OccNet (ICCV ‘23) PARA-Drive Latent features Compact outputs Latent + Compact Test time only ual comparison between our PARA-Drive with recent state-of-the-art architectures for end-to-end modular AV, i.e., UniAD ], OccNet [25]). We highlight three major design variations: (1) the inclusion of a different set of modules, (2) distinct module e to different inter-module connections, and (3) information flow with different representations such as high-dimensional and compact outputs from upstream modules. Differing from prior work, our proposed PARA-Drive is designed completely r runtime efficiency, while achieving state-of-the-art driving performance and maintaining safety and interpretability. ,FFQJU4JNQMF 4UVQJEʢ,*44
$713> Mapping Occupancy Prediction Motion Prediction Planning BEV features Map queries Motion queries Occupancy queries Plan queries BEV queries C C C C Ego vehicle’s information C Cross-attention operation Can be deactivated at runtime Figure 5. PARA-Drive architecture. Perception, prediction, and planning modules are co-trained in parallel. No dependency is in- suring that relevant inform A distinct advantage from prior work with hyb operational independence training, the planning hea perception and prediction cant flexibility during infe motion, and occupancy c duced frame rate only whe ducting safety checks, or design significantly boosts tion planning module, en and thereby enhancing ov
the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving Ran Tian1,2 Boyi Li1,2 Xinshuo Weng1 Yuxiao Chen1 Edward Schmerling1 Yue Wang1,3 Boris Ivanovic1 Marco Pavone1,4 1NVIDIA 2UC Berkeley 3University of Southern California 4Stanford University Large Language Model (LLaMA) TOKEN: A Multi-Modal LLM with Object-Centric Tokenization for Autonomous Driving <Multi-view video, HD-map, …, Symbolic representation> Object-centric Tokenization … … Scene-centric Tokenization A: … … Q: Given the input information, please…? Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the ego vehicle? Scene Understanding Hierarchical Planning A: 1) critical objects grounding…; 2) interaction planning…; 3) lane-level decision-making…; 4) motion plan generation…. Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? Reasoning Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9). It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road….[omitted]. Please plan its 3-second motion plan. A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. 0959v1 [cs.AI] 1 Jul 2024
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 4DFOF6OEFSTUBOEJOH2" Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehic should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at ( 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic co are blocking the ego vehicle’s lane. The autonomous vehicle should bypass traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 4DFOF6OEFSTUBOEJOH2" γʔϯཧղʹؔ͢ΔճΛੜ͢ΔλεΫ
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 3FBTPOJOH2" 1.9, 12.1) vehicle? g into ego Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: What lane is the vehicle at (11.9, 12.1) driving in relative to the autonomous vehicle? A: Vehicle at (11.9, 12.1) is merging into ego vehicle’s lane. Scene Understanding Reasoning Q: Should the autonomous vehicle pay attention to the object located at (-3.2, 6.7) ? Why? A: The object is a car and is blocking the autonomous vehicle's lane. The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … Q: Should the autonomous vehicle pay attention to the object located at (3.1, 8.1) ? Why? A: The object is a pedestrian with velocity (-1.6, 0.2) m/s. Thus, it is crossing in front of the autonomous vehicle. The answer is: Yes. Q: What is the observed status of the object at (-3.8, 6.0)? A: Stationary Q: What is the type of the object at (-3.8, 6.0)? A: Pedestrian. 3FBTPOJOH2" ࣭จʹର͢Δཧ༝ͷઆ໌จΛੜ͢ΔλεΫ
1MBOOJOH fi OFUVOJOH ɿ1MBOOJOH2"Λ༻͍ͯ"EBQUFSͱ-P3"Λֶश 50,&/ͷֶशํ๏ Adapter Large Language Model (LLaMA) … <Multi-view video, HD-map, …, Symbolic representation> Scene-centric Tokenization Object-centric Tokenization … A: … … Q: Given the input information, please…? LoRA Extract tok Transform based driv model 1MBOOJOH2" hicle pay attention 7) ? Why? is blocking the The answer is: Yes. Hierarchical Planning Q: The autonomous vehicle needs to keep forward along the road… [omitted] A: There is 1 important objects: car at (8.6, 3.9). The autonomous vehicle should yield to the car at (8.6, 3.9. It should remain roughly static, and… Q: The autonomous vehicle needs to keep forward along the road. … [omitted] A: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side … icle pay attention ? Why? ith velocity (-1.6, in front of the wer is: Yes. 1MBOOJOH2" ߦಈͷܭըΛੜ͢ΔλεΫ
ࣗಈӡసʹಛԽͨ͠Τϯίʔμͷ༻ɼ0CKFDUDFOUSJDͳτʔΫϯԽʹΑΓ ैདྷͷϚϧνϞʔμϧ--.Λ͑ΔੑೳΛൃش categorical questions in the DriveLM dataset [7]. To evaluate the model’s ability localize and reason about critical objects, we use precision and recall to measure its grounding ability (we use Hungarian matching to match the predictions with the ground truth), and use accuracy to measure its ability to identify whether an object is critical given the object’s center location in the ego frame. We consider three variants of trajectory L2 error: the overall, turning, and progress errors, which are calculated from the original L2 distance, heading difference, and longitudinal-weighted L2 distance between the prediction and GT1. We use the average collision rate over the entire horizon to measure the safety of a motion plan. More details about our evaluation protocol can be found in App. B. Scene understanding " Critical object grounding " Traj L2 (m) # Method Obj. class. Lane-object asso. acc. Acc. Precision Recall Import. class. 1s 2s 3s Ave123s Aveall Video-LLaMA 0.28 0.39 0.38 0.22 0.27 0.58 0.27 1.72 6.34 3.01 2.39 VILA-1.5 0.37 0.22 0.42 0.19 0.16 0.55 0.28 1.56 4.41 2.09 1.66 BEV-TOKEN 0.68 0.64 0.61 0.58 0.62 0.76 0.39 1.01 2.02 1.14 0.96 TOKEN 0.92 0.68 0.76 0.87 0.76 0.92 0.26 0.71 1.47 0.81 0.68 PARA-Drive NA NA NA NA NA NA 0.23 0.68 1.50 0.80 0.66 Table 1: Quantitative evaluation of the scene understanding, critical object grounding, and planning tasks. TOKEN significantly outperforms baseline VLMs due to its use of driving-task pre-trained features and object-centric tokenization. Bold numbers denote the best results in each column, and the numbers shaded in green indicate significant improvements. We also show the PARA-Drive’s planning performance as reference (shaded in grey). Results. We present the quantitative evaluation of each model’s scene understanding, object ground-
(-1.9, 8.1). The autonomous vehicle should proceed with caution and prepare to overtake the car at (-1.9, 8.1) from the autonomous vehicle's left side. It should use the adjacent right lane to continue driving, decelerate and steer slightly to right, …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted] োͷ͍ӽ͠ ˠোΛආ͚ͨޙʹݩͷंઢʹΔΑ͏ʹϓϥϯχϯά 1SPNQU5IFBVUPOPNPVTWFIJDMFOFFETUPLFFQGPSXBSEBMPOHUIFSPBE8IBUPCKFDUTBSFJNQPSUBOUGPSUIFBVUPOPNPVTWFIJDMFT QMBOOJOH 8IBUBSFUIFTFPCKFDUT UIFJS Y Z MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN 1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసंಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳମԿͰ͔͢ʁ ͜ΕΒͷମͱ͜ΕΒͷ Y Z ࠲ඪԿͰ͋Γɼࣗಈӡసं͜ΕΒͷମͱͲͷΑ͏ʹ૬ޓ࡞༻͖͢Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳମ͕ͭ͋Γ·͢ ͷंɻࣗಈӡసं৻ॏʹਐΈɼࠨଆʹ͋Δ ͷंΛ͍ӽ͢४උΛ͢Δඞཁ͕͋Γ·͢ɻ ྡ͢ΔӈଆͷंઢΛ༻ͯ͠ߦΛଓ͚ɼݮͯ͠ӈʹΘ͔ͣʹϋϯυϧΛͬͯʜ<লུ> 50,&/5IFSFJTJNQPSUBOUPCKFDUDBSBU 5IFBVUPOPNPVTWFIJDMFTIPVMEQSPDFFEXJUIDBVUJPOBOEQSFQBSFUP PWFSUBLFUIFDBSBU GSPNUIFBVUPOPNPVTWFIJDMFTMFGUTJEF*UTIPVMEVTFUIFBEKBDFOUSJHIUMBOFUPDPOUJOVFESJWJOH EFDFMFSBUFBOETUFFSTMJHIUMZUPSJHIU ʜ<PNJUUFE>
MPDBUJPOT BOEIPXTIPVMEUIFBVUPOPNPVTWFIJDMFJOUFSBDUXJUIUIFN 1MFBTFQMBOʜ<PNJUUFE> 1SPNQUࣗಈӡసंಓ࿏ʹԊͬͯલਐ͠ଓ͚Δඞཁ͕͋Γ·͢ɻࣗಈӡసंͷϓϥϯχϯάʹॏཁͳମԿͰ͔͢ʁ ͜ΕΒͷମͱ͜ΕΒͷ Y Z ࠲ඪԿͰ͋Γɼࣗಈӡసं͜ΕΒͷମͱͲͷΑ͏ʹ૬ޓ࡞༻͖͢Ͱ͔͢ʁܭը͍ͯͩ͘͠͞ʜ<লུ> 50,&/ॏཁͳମ͕ͭ͋Γ·͢ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɼ ͷަ௨ίʔϯɻ͜ΕΒͷަ௨ίʔϯɼࣗं྆ͷंઢ Λ࠹͍Ͱ͍·͢ɻࣗಈӡసं ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ௨ίʔϯΛӈଆ͔Βճආ͠ɼ ͷަ ௨ίʔϯΛӈଆ͔Βճආ͢Δඞཁ͕͋Γ·͢ʜ<লུ> 50,&/5IFSFBSFJNQPSUBOUPCKFDUTUSB ff i DDPOFBU USB ff i DDPOFBU USB ff i DDPOFBU USB ff i DDPOFBU 5IFTFUSB ffi D DPOFTBSFCMPDLJOHUIFFHPWFIJDMFTMBOF5IFBVUPOPNPVTWFIJDMFTIPVMECZQBTTUIFUSB ffi DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTT UIFUSB ff i DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIFUSB ffi DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEF CZQBTTUIF USB ff i DDPOFBU GSPNUIFBVUPOPNPVTWFIJDMFTSJHIUTJEFʜ<PNJUUFE> TOKEN: There are 4 important objects: traffic cone at (2.2, 13.4), traffic cone at (2.1, 11.7), traffic cone at (1.7, 16.8), traffic cone at (1.6, 19.8). These traffic cones are blocking the ego vehicle’s lane. The autonomous vehicle should bypass the traffic cone at (2.2, 13.4) from the autonomous vehicle's right side, bypass the traffic cone at (2.1, 11.7) from the autonomous vehicle's right side, bypass the traffic cone at (1.7, 16.8) from the autonomous vehicle's right side, bypass the traffic cone at (1.6, 19.8) from the autonomous vehicle's right side …[omitted] PARA-Drive TOKEN GT Prompt: The autonomous vehicle needs to keep forward along the road. What objects are important for the autonomous vehicle's planning? What are these objects, their (x, y) locations, and how should the autonomous vehicle interact with them? Please plan…[omitted]