Deep Learningを用いた経路予測の研究動向

Slide 1

Slide 1 text

αʔϕΠ࿦จɿ%FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌͷݚڀಈ޲ ຳӜେߊɹฏ઒ཌྷɹࢁԼོٛɹ౻٢߂࿱ த෦େֶɹػց஌֮ϩϘςΟΫεάϧʔϓ

Slide 12

Slide 12 text

ߴ଎ಓ࿏্Ͱྡ઀͢Δࣗಈंಉ࢜ͷΠϯλϥΫγϣϯΛߟྀͨ͠༧ଌख๏ ˔ ΠϯλϥΫγϣϯ৘ใʹۭؒతҙຯ߹͍Λ࣋ͨͤΔ$POWPMVUJPO4PDJBM1PPMJOHΛఏҊ -45.&ODPEFSͰಘͨي੻ಛ௃ྔΛݻఆαΠζͷ4PDJBM5FOTPSʹ֨ೲ $//ͰΠϯλϥΫγϣϯͷಛ௃ྔΛٻΊΔ ༧ଌंͷಛ௃ྔͱ࿈݁͠ɼ-45.%FDPEFSͰܦ࿏Λ༧ଌ $POWPMVUJPOBM4PDJBM1PPMJOH 12 Figure 3. Proposed Model: The encoder is an LSTM with shared weights that learns vehicle dynamics based on track histories. The convolutional social pooling layers learn the spatial interdependencies of of the tracks. Finally, the maneuver based decoder outputs a multi-modal predictive distribution for the future motion of the vehicle being predicted Convolutional Social Pooling for Vehicle Trajectory Prediction Nachiket Deo Mohan M. Trivedi University of California, San Diego La Jolla, 92093 [email protected] [email protected] Abstract Forecasting the motion of surrounding vehicles is a crit- ical ability for an autonomous vehicle deployed in complex traffic. Motion of all vehicles in a scene is governed by the traffic context, i.e., the motion and relative spatial config- uration of neighboring vehicles. In this paper we propose an LSTM encoder-decoder model that uses convolutional social pooling as an improvement to social pooling layers for robustly learning interdependencies in vehicle motion. Additionally, our model outputs a multi-modal predictive distribution over future trajectories based on maneuver classes. We evaluate our model using the publicly available NGSIM US-101 and I-80 datasets. Our results show improvement over the state of the art in terms of RMS values of prediction error and negative log-likelihoods of true fu- Figure 1. Imagine the blue vehicle is an autonomous vehicle in the traffic scenario shown. Our proposed model allows it to make multi-modal predictions of future motion of it’s surrounding vehicles, along with prediction uncertainty shown here for the red v1 [cs.CV] 15 May 2018 N. Deo, et al., “Convolutional Social Pooling for Vehicle Trajectory Prediction,” CVPRW, 2018.

Slide 13

Slide 13 text

าߦऀͷࢹઢ৘ใΛ׆༻ͨ͠ܦ࿏༧ଌख๏ ˔ ಄෦Λத৺ͱͨ͠ࢹ໺֯಺ͷଞର৅ͷΈ1PPMJOHॲཧ ༧ଌର৅ͷ಄෦ํ޲ɼଞର৅ͱͷڑ཭஋͔Β1PPMJOHॲཧ͢Δର৅Λબ୒ ˔ ي੻ɼ಄෦ํ޲ɼΠϯλϥΫγϣϯ৘ใΛ-45.΁ೖྗ ࢹ໺֯಺ʹ͍Δଞର৅ͱͷিಥΛආ͚Δܦ࿏༧ଌΛ࣮ݱ ࢹઢ৘ใΛ೚ҙʹมߋ͢Δ͜ͱͰɼ೚ҙํ޲ʹ޲͔ͬͨܦ࿏༧ଌ͕Մೳ .9-45.<*)BTBO $713 > 13 3. Our approach In this section we present the MX-LSTM, capable of jointly forecasting positions and head orientations of an individual thanks to the presence of two information streams: Tracklets and vislets. 3.1. Tracklets and vislets Given a subject i, a tracklet (see Fig. 1a) ) is formed by consecutive (x, y) positions on the ground plane, {x(i) t }t=1,...,T , x(i) t = (x, y) ∈ R2, while a vislet is formed by anchor points {a(i) t }t=1,...,T , with a(i) t = (ax , ay ) ∈ R2 indicating a reference point at a ﬁxed distance r from the corresponding x(i) t , towards which the face is oriented1. In b) J d a) ) (i t a ) (i t x ) (i t D r ) ( 1 i t x c) 1 t x t a t x t D t E t Z (i) (i) e(x,i) t = φ x(i) t , Wx e(a,i) t = φ a(i) t , Wa where the embedding function φ consists in a jection through the embedding weigths Wx and D-dimensional vector, multiplied by a RELU n where D is the dimension of the hidden space. 3.2. VFOA social pooling The social pooling introduced in [3] is an eff to let the LSTM capture how people move in scene avoiding collisions. This work considers a interest area around the single pedestrian, in wh den states of the the neighbors are considered those which are behind the pedestrian. In our ca prove this module using the vislet information ing which individuals to consider, by building a tum of attention (VFOA), that is a triangle origin x(i) t , aligned with a(i) t , and with an aperture given gle γ and a depth d; these parameters have been cross-validation on the training partition of the T dataset (see Sec. 5). Our view-frustum social pooling is a No × N sor, in which the space around the pedestrian is d Figure 3. Qualitative results: a) MX-LSTM b) Ablation qualitative study on Individual MX-LSTM (better in color). I. Hasan, et al., “MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,” CVPR, 2018.

Slide 14

Slide 14 text

άϧʔϓʹؔ͢ΔΠϯλϥΫγϣϯΛߟྀͨ͠ܦ࿏༧ଌख๏ ˔ ӡಈ܏޲͕ྨࣅ͢Δาߦऀಉ࢜ΛάϧʔϓͱΈͳ͢ ˔ ༧ଌର৅͕ଐ͢ΔάϧʔϓҎ֎ͷݸਓͷ৘ใΛ1PPMJOH ҟͳΔάϧʔϓͱͷিಥΛආ͚Δܦ࿏Λ༧ଌ (SPVQ-45. 14 Group LSTM 7 Fig. 3. Representation of the Social hidden-state tensor Hi t . The black dot represents the pedestrian of interest pedi. Other pedestrians pedj (∀j = i) are shown in different color codes, namely green for pedestrians belonging to the same set, and red for pedestrians belonging to a different Group LSTM 9 ing to the studies in interpersonal distances [15, 10], socially correlated people tend to stay closer in their personal space and walk together in crowded environments as compared to pacing with unknown pedestrians. Pooling only unrelated pedestrians will focus more on macroscopic inter-group interactions rather than intra-group dynamics, thus allowing the LSTM network to improve the trajectory prediction performance. Collision avoidance influences the future motion of pedestrians in a similar manner if two pedestrians are walking together as in a group. In Tables 2, 3 and Fig. 4, we display some demos of predicted trajectories which highlight how our Group-LSTM is able to predict pedestrian trajectories with better precision, showing how the prediction is improved when we pool in the social tensor of each pedestrian only pedestrians not belonging to his group. In Table 2, we show how the prediction of two pedestrians walking together in the crowd improves when they are not pooled in each other’s pooling layer. When the two pedestrians are pooled together, the network applies on them the typical repulsion force to avoid colliding with each other. Since they are in the same group, they allow the other pedestrian to stay closer in they personal space. In Fig. 4 we display the sequences of two groups walking toward each other. In Table 3, we show how the prediction for the two groups is improved with respect to the Social LSTM. While both prediction are not very accurate, our Group LSTM perform better because it is able to forecast how pedestrian belonging to the same group will stay together when navigating the environment. Name Scene Our Group-LSTM Social-LSTM ETH Univ Frame 2425 Table 2. ETH dataset: the prediction is improved when pooling in the social tensor of each pedestrian only pedestrians not belonging to his group. The green dots represent the ground truth trajectories; the blue crosses represent the predicted paths. 5 Conclusion In this work, we tackle the problem of pedestrian trajectory prediction in crowded scenes. We propose a novel approach, which combines the coherent filtering algorithm with the LSTM networks. The coherent filtering is used to identify pedestrians walking together in a crowd, while the LSTM network is used to predict the future trajectories 10 Niccol´ o Bisagno, Bo Zhang and Nicola Conci (a) (b) (c) (d) Fig. 4. Sequences taken from the UCY dataset. It displays an interaction example between two groups, which will be further analyzed in Table 3. Name Scene Our Group-LSTM Social-LSTM UCY Univ Frame 1025 Table 3. We display how the prediction is improved for two groups walking in opposite direc- tions. The green dots represent the ground truth trajectories, while the blue crosses represent the predicted paths. N. Bisagno, et al., “Group LSTM: Group Trajectory Prediction in Crowded Scenarios,” ECCVW, 2018.

Slide 21

Slide 21 text

4PDJBM("/ΛϕʔεϞσϧͱͨ͠าߦऀͷܦ࿏༧ଌͱάϧʔϓͷݕग़ ˔ ܦ࿏༧ଌ ("/͸4PDJBM("/ͱಉ౳ͷϞσϧ ΠϯλϥΫγϣϯ͸TPGUBUUFOUJPOΛ࢖༻ ˔ άϧʔϓݕग़ ༧ଌ͞Εͨܦ࿏͔ΒU4/&Ͱ࣍ݩѹॖ %#4$"/Ͱۙ๣ͷาߦऀूஂΛάϧʔϓͱͯ͠ݕग़ (%("/<5'FSOBOEP "$$7 > 21 2 T. Fernando et al. have been made to incorporate handcrafted physics based features such as relative distance between pedestrians, trajectory shape and motion based features to model their social a nity. Hall et. al [16] proposed a proxemic theory for such physical interactions based on di↵erent distance boundaries; however recent works [31,32] have shown these quantisations fail in cluttered environments. Furthermore, proximity doesn’t always describe the group membership. For instance two pedestrians sharing a common goal may start their trajectories in two distinct source positions, however, meet in the middle. Hence we believe being reliant on a handful of handcrafted features to be sub-optimal [1,10,19]. P1 1 P1 Tobs P2 1 P2 Tobs Pk 1 Pk Tobs Observed Trajectories P1 1 P1 Tobs P2 1 P2 Tobs Pk 1 Pk Tobs Predicted Trajectories yk Tpred y1 Tpred y2 Tpred Trajectory Predictor yk t = G ([c*,k t ,z t ]) Feature extractor k t = f( G ([c*,k t ,zt ]) ) Dimensionality Reduction ηk = t-SNE( [ k 1, k 2, …, k Tpred ] ) Clustering [β1, β2,…., βk]= DBSCAN( [η1, η2,…., ηk]) Clustered Solution 1 2 k Fig. 1. Proposed group detection framework: After observing short segments of trajectories for each pedestrian in the scene, we apply the proposed trajectory prediction algorithm to forecast their future trajectories. The context representation generated at GD-GAN 13 (a) GVEII - Frame 2127 (b) GVEII- Frame 2320 (c) CBE - Frame 2603 (d) CBE - Frame 2910 Fig. 5. Qualitative results from the proposed GD-GAN methods for sequences from the CBE and GVEII datasets. Connected coloured blobs indicate groups of pedestrians. T. Fernando, et al., “GD-GAN: Generative Adversarial Networks for Trajectory Prediction and Group Detection in Crowds,” ACCV, 2018. GD-GAN 5 LSTM encoder h’1 h’T [w1] [w2] [at] Left Right Front Ch,k t [wj] LSTM encoder LSTM encoder LSTM encoder h’1 h’T h1 hT h’1 h’T x C*,k t Cs,k t Fig. 2. Proposed neighbourhood modelling scheme [8]: A sample surveillance scene is shown on the left. The trajectory of the pedestrian of interest, k , is shown in green, and has two neighbours (in purple) to the left, one in front and none on right. The neighbourhood encoding scheme shown on the right: Trajectory information is encoded with LSTM encoders. A soft attention context vector Cs,k t is used to embed trajectory information from the pedestrian of interest, and a hardwired attention context vector Ch,k t is used for neighbouring trajectories. In order to generate Cs,k t we use a soft attention function denoted at in the above ﬁgure, and the hardwired weights are denoted by w. The merged context vector C⇤,k t is then generated by merging Cs,k t and Ch,k t . Let the trajectory of the pedestrian k, from frame 1 to Tobs be given by, pk = [p1, . . . , pTobs ], (1)

Slide 22

Slide 22 text

ݱ࣌ࠁͷΠϯλϥΫγϣϯ৘ใ͔Β༧ଌର৅ͷະདྷͷ༧ଌܦ࿏Λߋ৽ ˔ 4UBUFTSF fi OFNFOUNPEVMF಺ͷͭͷػߏͰߴਫ਼౓ͳܦ࿏༧ଌΛ࣮ݱ ଞର৅ͱͷিಥΛ๷͙1FEFTUSJBOBXBSFBUUFOUJPO 1" ଞର৅ͷಈ͖͔Βɼ༧ଌର৅ࣗ਎͕ܦ࿏Λબ୒͢Δ.PUJPOHBUF .( ˔ .(ͰিಥΛىͦ͜͠͏ͳର৅ͷಈ͖͔Βܦ࿏Λબ୒ ˔ 1"Ͱ༧ଌର৅ۙྡͷଞର৅ʹண໨ 43-45.<1;IBOH $713 > 22 5,36,39]. Vemula from the hidden gives an impor- et al. [33] utilize ght the important pairwise velocity who are in simi- ms to selects mo- strian during the lly aware neigh- d in previous ap- ramework. This lution Networks LSTM LSTM LSTM LSTM SR t t+1 LSTM LSTM States refinement module LSTM states Input the location to LSTM Ouput the prediction ... selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input trajectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise samples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. Figure 6. Selected feature patterns by motion gate. Each row is related to a hidden neuron (feature) of LSTM. Column 1: Activa- tion trajectory pattern of the hidden feature. Column 2-6: Pairwise trajectory examples (end with solid dots) having high activation to 3) Row 3: Thi considers more tion. 4) Row 4 hidden feature attention on th walk towards h Pedestrian- ples of the ped LSTM in Fig.7 to the close ne tention, 2) the often largely fo refinement ten bors with grou longer time ran Figure 7. Illustr magenta represe the dashed circle ment. Larger cir represents the ta ones are his/her their walking dir 5. Conclusio selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input trajectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise samples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. 3) Row 3: This case is similar to row 2. This gate element considers more distant neighbor walking in opposite direction. 4) Row 4: The neighbor in blue is static, the selected hidden feature shows that pedestrian i in red potentially pay attention on this stationary neighbor in case he is about to walk towards him/her. Pedestrian-wise attention. We illustrate some examples of the pedestrian-wise attention expected by our SR- LSTM in Fig.7. It shows that 1) dominant attention is paid to the close neighbors, while the others also take slight attention, 2) the attention given by the first refinement layer often largely focuses on the close neighbors, and the second refinement tends to strengthen the effect of farther neighbors with group behavior or may influence the pedestrian in longer time range. 1FEFTUSJBOBXBSFBUUFOUJPO .PUJPOHBUF ༧ଌର৅ ༧ଌର৅ P. Zhang, et al., “SR-LSTM: State Re fi nement for LSTM towards Pedestrian Trajectory Prediction,” CVPR, 2019.

Slide 25

Slide 25 text

ΠϯλϥΫγϣϯΛ࣌ؒํ޲΁఻ൖͨ͠༧ଌख๏ ΠϯλϥΫγϣϯΛߟྀ͢ΔͨΊʹ(SBQI"UUFOUJPO/FUXPSL ("5 Λద༻ ˔ ("5ɿάϥϑߏ଄ΛऔΓೖΕͨ"UUFOUJPOʹجͮ͘(SBQI$POWPMVUJPOBM/FUXPSLT γʔϯશମʹ͍Δଞର৅ͷؔ܎ͷॏཁ౓Λ"UUFOUJPOػߏͰֶश ˔ ("5ͰٻΊͨಛ௃Λ࣌ؒํ޲ʹ఻೻͢Δ͜ͱͰۭ࣌ؒͷΠϯλϥΫγϣϯΛߟྀ িಥͷՄೳੑ͕͋Δର৅ͷ৘ใΛաڈͷܦ࿏͔Βಋग़Մೳ 45("5<:)VBOH *$$7 > 25 GAT GAT GAT GAT c c c ] ] ] '/670 '/670 Encoder State Decoder '/670 GAT Graph Attention Network Concat Noise c · · · AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPXgoePFYwX5AG8pms2nXbnbD7kQoof/BiwdFvPp/vPlv3LY5aOuDgcd7M8zMC1PBDXret7Oyura+sVnaKm/v7O7tVw4OW0ZlmrImVULpTkgME1yyJnIUrJNqRpJQsHY4up367SemDVfyAccpCxIykDzmlKCVWj0aKTT9StWreTO4y8QvSBUKNPqVr16kaJYwiVQQY7q+l2KQE42cCjYp9zLDUkJHZMC6lkqSMBPks2sn7qlVIjdW2pZEd6b+nshJYsw4CW1nQnBoFr2p+J/XzTC+DnIu0wyZpPNFcSZcVO70dTfimlEUY0sI1dze6tIh0YSiDahsQ/AXX14mrfOaf1Hz7i+r9ZsijhIcwwmcgQ9XUIc7aEATKDzCM7zCm6OcF+fd+Zi3rjjFzBH8gfP5A60zjys= tobs AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqYI4BLx4jmAckS5idTJIhszPLTK8QlnyEFw+KePV7vPk3TpI9aGJBQ1HVTXdXlEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0axptMS206EbVcCsWbKFDyTmI4jSPJ29Hkbu63n7ixQqtHnCY8jOlIiaFgFJ3Uxn6mIzvrlyt+1V+ArJMgJxXI0eiXv3oDzdKYK2SSWtsN/ATDjBoUTPJZqZdanlA2oSPedVTRmNswW5w7IxdOGZChNq4UkoX6eyKjsbXTOHKdMcWxXfXm4n9eN8VhLcyESlLkii0XDVNJUJP572QgDGcop45QZoS7lbAxNZShS6jkQghWX14nratqcF31H24q9VoeRxHO4BwuIYBbqMM9NKAJDCbwDK/w5iXei/fufSxbC14+cwp/4H3+ALeBj8c= t2 AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWMF0xbaUDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsbm1vbObmmvvH9weHRcOTltmyTTjPsskYnuhtRwKRT3UaDk3VRzGoeSd8LJ3dzvPHFtRKIecZryIKYjJSLBKFrJx0Fenw0qVbfmLkDWiVeQKhRoDSpf/WHCspgrZJIa0/PcFIOcahRM8lm5nxmeUjahI96zVNGYmyBfHDsjl1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7YheKsvr5N2veZd19yHm2qzUcRRgnO4gCvw4BaacA8t8IGBgGd4hTdHOS/Ou/OxbN1wipkz+APn8wfJU46h t1 AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4Pcmw2qNbfuLkDWiVeQGhRoDapf/WHCspgrZJIa0/PcFIOcahRM8lmlnxmeUjahI96zVNGYmyBfHDsjF1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7EheKsvr5P2Vd27rrsPN7Vmo4ijDGdwDpfgwS004R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wfHzo6g t3 AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4P8ejao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKNGkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvuu4+3NSajSKOMpzBOVyCB7fQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx/K2I6i z M-LSTM G-LSTM Figure 2. The architecture of our proposed STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encod Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and GAT. The Intermediate St encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories bas edestrians in a scene are considered as nodes on the aph at every time-step. The edges on the graph repre- st of human-human interactions. ~ h1 AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMF+wFtKJvtpl262Q27k0IJ+RlePCji1V/jzX/jts1Bqw8GHu/NMDMvTAQ36HlfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByN/c7U6YNV/IRZwkLYjKSPOKUoJV6/Smj2TgfZH4+qNa8ureA+5f4BalBgeag+tkfKprGTCIVxJie7yUYZEQjp4LllX5qWELohIxYz1JJYmaCbHFy7p5ZZehGStuS6C7UnxMZiY2ZxaHtjAmOzao3F//zeilGN0HGZZIik3S5KEqFi8qd/+8OuWYUxcwSQjW3t7p0TDShaFOq2BD81Zf/kvZF3b+sew9XtcZtEUcZTuAUzsGHa2jAPTShBRQUPMELvDroPDtvzvuyteQUM8fwC87HN46CkWw= ~ h5 AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIf6LLoxmUF+4A0lMn0ph06mQkzk0IJ+Qw3LhRx69e482+ctllo64ELh3Pu5d57woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjtpapotCikkvVDYkGzgS0DDMcuokCEoccOuH4fuZ3JqA0k+LJTBMIYjIULGKUGCv5vQnQbJT3s+u8X625dXcOvEq8gtRQgWa/+tUbSJrGIAzlRGvfcxMTZEQZRjnklV6qISF0TIbgWypIDDrI5ifn+MwqAxxJZUsYPFd/T2Qk1noah7YzJmakl72Z+J/npya6DTImktSAoItFUcqxkXj2Px4wBdTwqSWEKmZvxXREFKHGplSxIXjLL6+S9kXdu6y7j1e1xl0RRxmdoFN0jjx0gxroATVRC1Ek0TN6RW+OcV6cd+dj0Vpyiplj9AfO5w+UlpFw ~ h4 AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0m0oMeiF48V7AekoWy2k3bpZjfsbgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJR8tUUWhTyaXqhUQDZwLahhkOvUQBiUMO3XByP/e7U1CaSfFkZgkEMRkJFjFKjJX8/hRoNs4HWSMfVGtu3V0ArxOvIDVUoDWofvWHkqYxCEM50dr33MQEGVGGUQ55pZ9qSAidkBH4lgoSgw6yxck5vrDKEEdS2RIGL9TfExmJtZ7Foe2MiRnrVW8u/uf5qYlug4yJJDUg6HJRlHJsJJ7/j4dMATV8ZgmhitlbMR0TRaixKVVsCN7qy+ukc1X3ruvuY6PWvCviKKMzdI4ukYduUBM9oBZqI4okekav6M0xzovz7nwsW0tOMXOK/sD5/AGTEZFv ~ h3 AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0msoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJ56ruNeru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGRjJFu ~ h2 AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJp1H3ruru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGQB5Ft ~ h6 AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lU1GPRi8cK9gPSUDbbSbt0sxt2N4US8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqKLSo5FJ1Q6KBMwEtwwyHbqKAxCGHTji+n/mdCSjNpHgy0wSCmAwFixglxkp+bwI0G+X97DrvV2tu3Z0DrxKvIDVUoNmvfvUGkqYxCEM50dr33MQEGVGGUQ55pZdqSAgdkyH4lgoSgw6y+ck5PrPKAEdS2RIGz9XfExmJtZ7Goe2MiRnpZW8m/uf5qYlug4yJJDUg6GJRlHJsJJ79jwdMATV8agmhitlbMR0RRaixKVVsCN7yy6ukfVH3Luvu41WtcVfEUUYn6BSdIw/doAZ6QE3UQhRJ9Ixe0ZtjnBfn3flYtJacYuYY/YHz+QOWG5Fx ~ h1 AAAB+HicbVBNS8NAEJ34WetHox69LBbRU0lU0GPRi8cK9gPaGDbbTbt0swm7m0IN+SVePCji1Z/izX/jts1BWx8MPN6bYWZekHCmtON8Wyura+sbm6Wt8vbO7l7F3j9oqTiVhDZJzGPZCbCinAna1Exz2kkkxVHAaTsY3U799phKxWLxoCcJ9SI8ECxkBGsj+XalN6YkG+aP2WnuZ27u21Wn5syAlolbkCoUaPj2V68fkzSiQhOOleq6TqK9DEvNCKd5uZcqmmAywgPaNVTgiCovmx2eoxOj9FEYS1NCo5n6eyLDkVKTKDCdEdZDtehNxf+8bqrDay9jIkk1FWS+KEw50jGapoD6TFKi+cQQTCQztyIyxBITbbIqmxDcxZeXSeu85l7UnPvLav2miKMER3AMZ+DCFdThDhrQBAIpPMMrvFlP1ov1bn3MW1esYuYQ/sD6/AHv15NC ~12 AAAB+nicbVBNS8NAEJ34WetXqkcvi0XwVJIq6LHoxWMF+wFtCJvtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBlhRzgRtaaY57SaS4ijgtBOMb2d+Z0KlYrF40NOEehEeChYygrWRfLvSn1CS9TFPRjj3M7ee+3bVqTlzoFXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy+an5+jMKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0yiYEd/nlVdKu19yLmnN/WW3cFHGU4ARO4RxcuIIG3EETWkDgEZ7hFd6sJ+vFerc+Fq1rVjFzDH9gff4AT7aUBQ== ~13 AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ+5F7ts1p+7MgVaJW5AaFGj69ld/EJM0okITjpXquU6ivQxLzQineaWfKppgMsZD2jNU4IgqL5ufnqNTowxQGEtTQqO5+nsiw5FS0ygwnRHWI7XszcT/vF6qw2svYyJJNRVksShMOdIxmuWABkxSovnUEEwkM7ciMsISE23SqpgQ3OWXV0n7vO5e1J37y1rjpoijDMdwAmfgwhU04A6a0AICj/AMr/BmPVkv1rv1sWgtWcXMEfyB9fkDUTuUBg== ~11 AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ66b+3bNqTtzoFXiFqQGBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXumniiaYjPGQ9gwVOKLKy+an5+jUKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0KiYEd/nlVdI+r7sXdef+sta4KeIowzGcwBm4cAUNuIMmtIDAIzzDK7xZT9aL9W59LFpLVjFzBH9gff4ATjGUBA== ~14 AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIt6LHoxWMF+wFtCJPtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBqAoZ4K2NNOcdhNJIQo47QTj25nfmVCpWCwe9DShXgRDwUJGQBvJtyv9CSVZH3gygtzP3Hru21Wn5syBV4lbkCoq0PTtr/4gJmlEhSYclOq5TqK9DKRmhNO83E8VTYCMYUh7hgqIqPKy+ek5PjPKAIexNCU0nqu/JzKIlJpGgemMQI/UsjcT//N6qQ6vvYyJJNVUkMWiMOVYx3iWAx4wSYnmU0OASGZuxWQEEog2aZVNCO7yy6ukfVFzL2vOfb3auCniKKETdIrOkYuuUAPdoSZqIYIe0TN6RW/Wk/VivVsfi9Y1q5g5Rn9gff4AUsCUBw== ~15 AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclcQHuiy6cVnBPqAJYTK9bYdOJmFmUikxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHCmtON8Wyura+sbm6Wt8vbO7t6+XTloqTiVFJo05rHshEQBZwKammkOnUQCiUIO7XB0O/XbY5CKxeJBTxLwIzIQrM8o0UYK7Io3Bpp5hCdDkgeZe5kHdtWpOTPgZeIWpIoKNAL7y+vFNI1AaMqJUl3XSbSfEakZ5ZCXvVRBQuiIDKBrqCARKD+bnZ7jE6P0cD+WpoTGM/X3REYipSZRaDojoodq0ZuK/3ndVPev/YyJJNUg6HxRP+VYx3iaA+4xCVTziSGESmZuxXRIJKHapFU2IbiLLy+T1lnNPa859xfV+k0RRwkdoWN0ilx0heroDjVQE1H0iJ7RK3qznqwX6936mLeuWMXMIfoD6/MHVEWUCA== ~16 AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIV9Vj04rGC/YAmhM122i7dbMLuplJifooXD4p49Zd489+4bXPQ1gcDj/dmmJkXJpwp7Tjf1srq2vrGZmmrvL2zu7dvVw5aKk4lhSaNeSw7IVHAmYCmZppDJ5FAopBDOxzdTv32GKRisXjQkwT8iAwE6zNKtJECu+KNgWYe4cmQ5EHmXuaBXXVqzgx4mbgFqaICjcD+8noxTSMQmnKiVNd1Eu1nRGpGOeRlL1WQEDoiA+gaKkgEys9mp+f4xCg93I+lKaHxTP09kZFIqUkUms6I6KFa9Kbif1431f1rP2MiSTUIOl/UTznWMZ7mgHtMAtV8YgihkplbMR0SSag2aZVNCO7iy8ukdVZzz2vO/UW1flPEUUJH6BidIhddoTq6Qw3URBQ9omf0it6sJ+vFerc+5q0rVjFziP7A+vwBVcqUCQ== n illustration of graph attention layer. It allows a node fferent importance to different nodes within a neigh- propose to use another LSTM to model the temp lations between interactions explicitly. We term t as G-LSTM: gt i = G-LSTM(gt 1 i , ˆ mt i ;Wg) where ˆ mt i is from Eq. 5. Wg is the G-LSTM we shared among all the sequences. In Encoder component, two LSTMs (M-L LSTM) are used to model the motion pattern of e trian, and the temporal correlations of interaction tively. We combine these two parts to accomplish of spatial and temporal information. At time-step are two hidden variables (mTobs i , gTobs i ) from two each pedestrian. In our implementation, these two ༧ଌର৅ ଞର৅ ଞର৅ ଞର৅ ଞର৅ ଞର৅ Y. Huang, et al., “STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction,” ICCV, 2019.

Slide 26

Slide 26 text

1SFEJDUFE&OEQPJOU$POEJUJPOFE/FUXPSL 1&$/FU ˔ ༧ଌ࠷ऴ஍఺ ΤϯυϙΠϯτ Λॏࢹֶͨ͠शΛߦ͏ܦ࿏༧ଌख๏ ɹɹͰΤϯυϙΠϯτΛ༧ଌ͠ɼ1BTU&ODPEJOHͷग़ྗͱ࿈݁ DPODBUFODPEJOH ࿈݁ͨ͠ಛ௃ྔ͔Β4PDJBM1PPMJOH಺ͷ֤ύϥϝλಛ௃ྔΛऔಘ าߦऀYาߦऀͷ4PDJBM.BTLͰาߦऀؒͷΠϯλϥΫγϣϯΛٻΊΔ DPODBUFODPEJOHͱΠϯλϥΫγϣϯ৘ใ͔ΒͰܦ࿏Λ༧ଌ 1&$/FU<,.BOHBMBN &$$7 > 26 6 K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik, A. Gaidon Fig. 2. Architecture of PECNet: PECNet uses past history, Ti along with ground truth endpoint Gc to train a VAE for multi-modal endpoint inference. Ground-truth Dlatent PECNet: Pedestrian Endpoint Conditioned Trajectory Prediction Network 13 lower prediction error than way-points in the middle! This in a nutshell, con- firms the motivation of this work. E↵ect of Number of samples (K): All the previous works use K = 20 samples (except DESIRE which uses K = 5) to evaluate the multi-modal predictions for metrics ADE & FDE. Referring to Figure 5, we see the expected decreas- ing trend in ADE & FDE with time as K increases. Further, we observe that our proposed method achieves the same error as the previous works with much smaller K. Previous state-of-the-art achieves 12.58 [39] ADE using K = 20 samples which is matched by PECNet at half the number of samples, K = 10. This further lends support to our hypothesis that conditioning on the inferred way- point significantly reduces the modeling complexity for multi-modal trajectory forecasting, providing a better estimate of the ground truth. Lastly, as K grows large (K ! 1) we observe that the FDE slowly gets closer to 0 with more number of samples, as the ground truth Gc is eventually found. However, the ADE error is still large (6.49) because of the errors in the rest of the predicted trajectory. This is in accordance with the observed ADE (8.24) for the oracle conditioned on the last observed point (i.e. 0 FDE error) in Fig. 4. Design choice for VAE: We also evaluate our design choice of using the inferred future way-points ˆ Gc for training subsequent modeules (social pooling & prediction) instead of using the ground truth Gc . As mentioned in Section 3.2, this is also a valid choice for training PECNet end to end. Empirically, we find Fig. 6. Visualizing Multimodality: We show visualizations for some multi-modal and diverse predictions produced by PECNet. White represents the past 3.2 seconds while red & cyan represents predicted & ground truth future respectively over next ೖྗ஋ ਅ஋ ༧ଌ஋ K. Mangalam, et al., “It is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction,” ECCV, 2020. Pfuture

Slide 27

Slide 27 text

ෳ਺ର৅Λಈతͳάϥϑߏ଄Ͱޮ཰తʹϞσϧԽ ˔ /)&ɿ؍ଌ࣌ࠁͷ/PEFಛ௃Λ-45.΁ೖྗ ˔ /'&ɿֶश࣌ʹ/PEFͷະདྷͷਅͷي੻ΛΤϯίʔυ͢ΔͨΊʹ#J-45.Λద༻ ˔ &&ɿಛఆൣғ಺ͷશର৅͔Β"UUFOUJPOΛٻΊΔ ॏཁ౓ͷߴ͍&EHF৘ใΛऔಘ ࣌ࠁຖʹ&EHF৘ใ͸มಈ ˔ ֤ಛ௃͔Β%FDPEFSͰܦ࿏Λ༧ଌ ಺෦ͷ$7"&ͰϚϧνϞʔμϧͳܦ࿏Λ༧ଌ (BVTTJBO.JYUVSF.PEFMͰ༧ଌܦ࿏Λચ࿅ ؀ڥ৘ใΛ௥Ճͨ͠5SBKFDUSPO<54BM[NBOO &$$7 > ͕ൃද 5SBKFDUSPO<#*WBOPWJD *$$7 > 27 Overall, we chose to make our model part of the “graph as architecture” methods, as a result of their stateful graph representation (leading to efﬁcient iterative predictions online) and modularity (enabling model reuse and extensive parameter sharing). 3. Problem Formulation In this work, we are interested in jointly reasoning and generating a distribution of future trajectories for each agent in a scene simultaneously. We assume that each scene is preprocessed to track and classify agents as well as obtain their spatial coordinates at each timestep. As a result, each agent i has a classiﬁcation type Ci (e.g. “Pedestrian”). Let Xt i = (xt i , yt i ) represent the position of the ith agent at time t and let Xt 1,...,N represent the same quantity for all agents in a scene. Further, let X (t1:t2) i = (Xt1 i , Xt1+1 i , . . . , Xt2 i ) denote a sequence of values for time steps t 2 [t1, t2]. As in previous works [1, 16, 49], we take as input the previous trajectories of all agents in a scene X (1:tobs) 1,...,N and aim to produce predictions b X (tobs+1:tobs+T ) 1,...,N that match the true future trajectories X (tobs+1:tobs+T ) 1,...,N . Note that we have not assumed N to be static, i.e. we can have N = f(t). !" /$ !% /& !' /( ! "-! % !' -!% Legend Modeled Node !" /$ Node $ is of type !" !" -!% Edge is of type !" -!% Edge being created Normal Edge !" /) !" -!% Attention !(#$%) !(#$') !(#) !(#(') !(#(%) !(#()) F C F C EE NHE NFE Encoder ℎ+ ℎ, -(.|0) 1(.|0, 3) F C . !(#) ., ℎ+ 4 !(#(') ! #(' ., ℎ+ 4 !(#(%) ! #(% ., ℎ+ 4 !(#(') GMM GMM GMM 4 !(#(%) 4 !(#()) Decoder 5) 5) 5) 6(#$%) + 8(#$%) 6(#$') + 8(#$') 6(#) + 8(#) 5' -5) 9(#$%) 9(#$') 9(#) 5% -5) Legend LSTM Cell Modulating Function FC Fully-Connected Layer Projection to a GMM Concatenation Randomly sampled Train time only Predict time only Train and Predict GMM M M M M M M M + Figure 2. Top: An example graph with four nodes. a is our mod- T. Salzmann, et al., “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data,” ECCV, 2020. B. Ivanovic, et al., “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs,” ICCV, 2019.

Slide 30

Slide 30 text

Ҡಈର৅ؒͷؔ܎ͱपลͷ੩త؀ڥͱͷؔ܎ͷ྆ํΛ໌ࣔతʹϞσϧԽ ˔ ੩త؀ڥ৘ใɿࣄલֶशࡁΈͷ6/FUʹΑΔηάϝϯςʔγϣϯը૾ ˔ Ҡಈର৅ؒͷؔ܎ɿ4PDJBMBXBSF$POUFYUNPEVMF 4$. ͷ/POMPDBMCMPDLͰଊ͑Δ /POMPDBMCMPDLɿಈը૾ೝࣝͷͨΊͷTFMGBUUFOUJPOߏ଄ ͋Δۭؒൣғͷςϯιϧಛ௃͔Β/POMPDBMCMPDLͰΠϯλϥΫγϣϯಛ௃Λଊ͑Δ %4$.1<$5BP &$$7 > 30 C. Tao, et al., “Dynamic and Static Context-Aware LSTM for Multi-agent Motion Prediction,” ECCV, 2020. Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction 3 (a) Illustration of multi-agent motion prediction. (b) Comparisons of di↵erent approaches. (c) Comparisons of the Relative Average Distance Error (ADE) on various datasets. Fig. 1. (a) illustrates motion prediction in a real-world scenario, where both the dynamic motion context across agents and static context are involved. (b) compares various methods including LSTM, social-aware models (e.g. SGAN [5]) and DSCMP. The queue mechanism in DSCMP enriches the scope of dynamic context at each frame, 6 C. Tao, Q. Jiang, L. Duan, P. Luo Fig. 2. The overview of our framework (DSCMP). Given a sequence of the observed motions, we construct agent-specific queue to store the LSTM features of previous frames within a queue length. The queue length is set to 3 as example. (a) For each iteration during observation, the current motion state and queues are encoded via ICM. (b) The queues are updated by appending the latest features and popping the earliest ones. In the SCM, the queues are adaptively refined by considering the pairwise relations of features in the neighbours’ queues. (c) The semantic map of static context is incorporated with observed motion to generate a learnable latent variable z. (d) We concatenate the last hidden feature in the encoder and the latent variable to predict the motions via LSTM decoder. 3.2 Individual Context Module Based on the aforementioned queues, we firstly capture the temporal correlations of trajectories from individual level. In order to handle with multiple inputs in i is often based only on the current and the latest time steps (e.g., j = i or i − 1). The non-local operation is also different from a fully- connected (fc) layer. Eq.(1) computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between xj and xi is not a function of the input data in fc, unlike in non- local layers. Furthermore, our formulation in Eq.(1) supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from xi to yi at the position i). A non-local operation is a flexible building block and can Ƨ: 1×1×1 ƴ: 1×1×1 g: 1×1×1 1×1×1 softmax z T×H×W×1024 T×H×W×512 T×H×W×512 T×H×W×512 THW×512 512×THW THW×THW THW×512 THW×512 T×H×W×512 T×H×W×1024 x Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction 13 (a) Walk in parallel (b) Turning (c) Face-to-face Fig. 5. Comparison among our model, SGAN and STGAT in di↵erent scenarios. These

Slide 31

Slide 31 text

၆ᛌࢹ఺ͷาߦऀͷܦ࿏ͱਓশࢹ఺ͷը૾Λ༻͍ͨܦ࿏༧ଌ ˔ γϛϡϨʔλ্Ͱ၆ᛌࢹ఺ͷσʔλΛਓশࢹ఺ʹϨϯλϦϯά ˔ ͭͷ"UUFOUJPOػߏ͸TFMGBUUFOUJPOͰߏ੒ 4PDJBMBXBSF"UUFOUJPO.PEVMFɿาߦऀؒͷΠϯλϥΫγϣϯΛϞ デ ϧԽ 7JFXBXBSF"UUFOUJPO.PEVMFɿҰਓশࢹ఺ʹ͓͚Δาߦऀͷܦ࿏ͱࢹ֮తಛ௃ͷؔ܎Λଊ͑Δ '75SBK<)#J &$$7 > 31 6 H. Bi et al. FvSim LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Resnet-18 Social-aware Attention Module View-aware Attention Module Traj-Encoder MLP MLP MLP MLP View-Encoder Traj-Decoder sort noise concat MLP multi-layer perceptron noise concat MLP multi-layer perceptron ↵i M LN MLP LN hi te,t Hi te,t Wh te,n WH te,n ↵i te,1 te,1↵i te,n i ei soci Add Add Social-aware Attention Module Fig. 3. (Upper) Pipeline overview of the FvTraj model. Given the pedestrian trajectories in the FvTraj: Using First-person View Fig. 7. Visualization of the predicted attention weig mechanism with four heads (n = 4) for each pedest the average attention weights (green circles) of th attention module at Tobs. Note, the green circles’ the red circles represent the position of the focus p the focus pedestrian whose attention weights are p other pedestrians in the scene. t3 t4 t7 t3 t4 t7 t0 t1 (a) Fig. 8. Visualization of learned attention weights b in observation period and simulated ﬁrst-person vie Each scene shows at least one of the scenarios am sion avoiding. Note, the colored circles’ radii are p color show the corresponding step, respectively), trian whose attention weights are predicted, and th No new experimental results involved here. FvTraj: Using First-person View for Pedestrian Trajectory Pre Fig. 7. Visualization of the predicted attention weights by FvTraj. We use the mul mechanism with four heads (n = 4) for each pedestrian in our implementation. H the average attention weights (green circles) of the four head attentions used in attention module at Tobs. Note, the green circles’ radii are proportional to the at the red circles represent the position of the focus pedestrian at Tobs, the red traje the focus pedestrian whose attention weights are predicted, and the blue trajector other pedestrians in the scene. t3 t4 t7 t3 t4 t7 t0 t1 t2 t0 t1 t2 t0 t1 t0 t1 t7 (a) (b) Fig. 8. Visualization of learned attention weights by our view-aware attention m Visualization of the predicted attention weights by FvTraj. We use the multi-head attention nism with four heads (n = 4) for each pedestrian in our implementation. Here, we visualize rage attention weights (green circles) of the four head attentions used in the social-aware n module at Tobs. Note, the green circles’ radii are proportional to the attention weights, circles represent the position of the focus pedestrian at Tobs, the red trajectories represent us pedestrian whose attention weights are predicted, and the blue trajectories represent the edestrians in the scene. t3 t4 t7 t4 t7 t0 t1 t2 t0 t1 t2 t0 t1 t7 t0 t1 t7 (a) (b) (c) t0 t7 observed predicted t1 t2 t3 t4 t5 t6 H. Bi, et al., “How Can I See My Future? FvTraj: Using First-person View for Pedestrian Trajectory Prediction,” ECCV, 2020.

Slide 33

Slide 33 text

-45.ͷήʔτߏ଄ʹ৮ൃ͞Εͨ3FMBUJPO(BUF.PEVMF 3(. ΛఏҊ ˔ ۭ࣌ؒతಛ௃͔ΒҠಈର৅ؒͷؔ܎ͱपลͷ੩త؀ڥͱͷؔ܎ͷ྆ํΛଊ͑Δ Ҡಈର৅ؒͷΠϯλϥΫγϣϯ͸ը૾্ʹөΔର৅͔ΒٻΊΔ ੩త؀ڥ΋ಉ༷ ˔ ֤ؔ܎ͷۭ࣌ؒతಛ௃͸ͭͷΤϯίʔμͰٻΊΔ 4QBUJBM#FIBWJPS&ODPEFSɿ%$POWͰ֤ؔ܎ͷ֤࣌ࠁͷΠϯλϥΫγϣϯΛٻΊΔ 5FNQPSBM*OUFSBDUJPO&ODPEFSɿ%$POWͰۭ࣌ؒͷಛ௃Λଊ͑Δ 3(.<$$IPJ *$$7 > 33 C. Choi, et al., “Looking to relations for future trajectory forecast,” ICCV, 2019. sequence of images, the GRE visually analyzes spatial behavior of road users and their temporal interactions vironments. The subsequent RGM of GRE infers pair-wise relations from these interactions and determines re meaningful from a target agent’s perspective. The aggregated relational features are used to generate hrough the TPN. Then, the following SRN further refines these initial predictions with a guidance of their ies. We additionally embed the uncertainty of the problem into our system at test time. nation are highly correlated. However, as an interaction oriented approaches, these generalize well to unseen locations as the onsider the road layout. Inference definition of ‘object’ in [34] to a spatio- representation extracted from each region d grid over time. It enables us to visu- human-human interactions where there ad users interacting with each other over space interactions from their interactive vironments, and (iii) environmental repre- oding structural information of the road. ations between objects (i.e., local spatio- ) are inferred from a global perspective. sign a new operation function to control so that the network can extract descriptive s by looking at relations that have a high nce the future motion of the target. mporal Interactions j j ͙ t=t0 -Ĳ+1 t=t0 t=t0 -Ĳ+1 t=t0 e.g., Human-space interactions e.g., Human-human interactions Time Time Time t=t0 t=t0 -Ĳ+1 Spatial representations, S I i j ͙ s i s j Spatio-temporal features, O c d d SBE SBE Spatial Behavior Encoder TIE Temporal Interaction Encoder ͙ ͙ c d d c c Time Figure 3: We model human-human and human-space interactions by visually encoding spatio-temporal features from each region of the discretized grid. tanh tanh sig tanh ݍ௞ FC with non-linear concatenate element-wise multiplication ݂௜௝ ௞ ݎ௜௝ ܱ ݋௜ ݋௝ ݋௜௝ ; ௞ LSTM Figure 4: The relation gate module controls information flow through multiple switches and determines not only whether the given object pair has meaningful relations from a spatio-temporal perspective, but also how important their relations are with respect to the motion context of the target. instead of inferring every relational interactions of all road users. In this view, we propose a module which is able to address the benefits of discriminatory information process with respect to their relational importance. ual road users generate unique relational same set of objects O with a distinct mot 4. Future Trajectory Prediction The proposed approach aims to predi ture locations Y k = {Y k t0+1 , Y k t0+2 , ..., Yt road user k using Xk = {I, X k }. Rath numerical coordinates of future locations of likelihood heatmaps following the succ estimation in [38, 25, 5]. The following s the proposed method learns future locatio 4.1. Trajectory Prediction Network To effectively identify the pixel-leve we specifically design a trajectory predic with a set of deconvolutional layers. Det architecture are described in the supple 3(.ͷུ֓ਤ ۭ࣌ؒతಛ௃ͷநग़ (a) (b) (c) (d) Figure 7: The proposed approach properly encodes (a) human-human and (b) human-space interactions by inferring relationa (a) (b) (c) (d) Figure 7: The proposed approach properly encodes (a) human-human and (b) human-space interactions by inferring relational behavior from a physical environment (highlighted by a dashed arrow). However, we sometimes fail to predict a future

Slide 36

Slide 36 text

ਪ࿦࣌ʹ༧ଌର৅ͷ໨ඪΛ৚݅෇͚Δ13&EJDUJPO$POEJUJPOFEPO(PBMT 13&$0( ΛఏҊ ˔ ໨ඪɿલʹਐΉ ఀࢭ͢Δ FUD ˔ ৚݅෇͚ΒΕͨ໨ඪʹΑΓɼଞର৅ͷ༧ଌܦ࿏͕มԽ 3//Λϕʔεͱͨ͠໬౓ਪ࿦ͷ&TUJNBUJOH4PDJBMGPSFDBTU1SPCBCJMJUJFTͰ৚݅෇͚͢Δ 13&$0( 36 N. Rhinehart, et al., “PRECOG: Prediction conditioned on goals in visual multi-agent settings,” ICCV, 2019. Zr 1 Zr 2 · · · Zr T Sr 1 Sr 2 · · · Sr T Sh 1 Sh 2 · · · Sr T Zh 1 Zh 2 · · · Zh T (a) ESP forecasting zr 1 zr 2 · · · zr T sr 1 Sr 2 · · · Sr T Sh 1 Sh 2 · · · Sr T Zh 1 Zh 2 · · · Zh T (b) PRECOG planning (c) ESP model implementation Figure 3: Our factorized latent variable model of forecasting and planning shown for 2 agents. In Fig. 3a our model uses latent variable Za t+1 to represent variation in agent a’s plausible scene-conditioned reactions to all agents St , causing uncertainty in every agents’ future states S. Variation exists because of unknown driver goals and different driving styles observed in the training data. Beyond forecasting, our model admits planning robot decisions by deciding Zr = zr (Fig. 3b). Shaded nodes represent observed or determined variables, and square nodes represent robot decisions [2]. Thick arrows represent grouped dependencies of non-Makovian St “carried forward” (a regular edge exists between any pair of nodes linked by a chain of thick edges). Note Z factorizes across agents, isolating the robot’s reaction variable zr. Human reactions remain uncertain (Zh is unobserved) and uncontrollable (the robot cannot decide Zh), and yet the robot’s decisions zr will still inﬂuence human drivers Sh 2:T (and vice-versa). Fig. 3c shows our implementation. See Appendix C for details. of other drivers from rich perceptual information. Towards these capabilities, we present a probabilistic forecasting model of future interactions between a variable number of agents. We perform both standard forecasting and the novel task of conditional forecasting, which reasons about how all agents will likely respond to the goal of a controlled agent (here, the AV). We train models on real and simulated data to forecast vehicle trajectories given past positions and LI- DAR. Our evaluation shows that our model is substantially more accurate in multi-agent driving scenarios compared to existing state-of-the-art. Beyond its general ability to perform conditional forecasting queries, we show that our model’s predictions of all agents improve when conditioned on knowledge of the AV’s goal, further illustrating its capa- bility to model agent interactions. 1. Introduction Autonomous driving requires reasoning about the future behaviors of agents in a variety of situations: at stop signs, roundabouts, crosswalks, when parking, when merging etc. In multi-agent settings, each agent’s behavior af- fects the behavior of others. Motivated by people’s ability to reason in these settings, we present a method to forecast multi-agent interactions from perceptual data, such as images and LIDAR. Beyond forecasting the behavior of all agents, we want our model to conditionally forecast how other agents are likely to respond to different decisions each agent could make. We want to forecast what other agents would likely do in response to a robot’s intention to achieve a goal. This reasoning is essential for agents to make good decisions in multi-agent environments: they must reason how their future decisions could affect the multi-agent system around them. Examples of forecasting and conditioning forecasts on robot goals are shown in Fig. 1 and Fig. 2. Videos of the outputs of our approach are available at https://sites.google.com/view/precog. Left Front Right Figure 1: Forecasting on nuScenes [4]. The input to our model is a high-dimensional LIDAR observation, which informs a distribution over all agents’ future trajectories. Forecasting Conditional Forecast: Set Car 1 Goal=Ahead Conditional Forecast: Set Car 1 Goal=Stop Goal=Ahead Goal=Stop Car 1 Car 1 Car 1 Car 2 Car 3 Car 3 Car 3 Figure 2: Conditioning the model on different Car 1 goals pro- duces different predictions: here it forecasts Car 3 to move if Car 1 yields space, or stay stopped if Car 1 stays stopped. arXiv:1905.01296v3 [cs.CV] 30

Slide 38

Slide 38 text

$//Λ༻͍ͨ༧ଌख๏ ˔ աڈͷي੻৘ใΛΤϯίʔυ͠ɼεύʔεͳϘΫηϧʹ֨ೲ ˔ $POWPMVUJPOͱ.BY1PPMJOHΛෳ਺ճߦ͍ɼ%FDPOWPMVUJPOͰ༧ଌܦ࿏Λग़ྗ ˔ -PDBUJPOCJBTNBQʹΑΓಛఆγʔϯͷ෺ମ৘ใͱજࡏతಛ௃දݱͰཁૉੵΛͱΔ ಛఆγʔϯʹΑͬͯมԽ͢ΔาߦऀͷৼΔ෣͍Λߟྀ #FIBWJPS$//<4:J &$$7 > 38 output of CNN, as they are of variable lengths and observed in different periods. 3 Pedestrian Behavior Modeling and Prediction The overall framework is shown in Fig. 2. The input to our system is pedestrian walking paths in previous frames (colored curves in Fig. 2(a)). They could be obtained by simple trackers such as KLT [41]. They are then encoded into a displacement volume (Fig. 2(b)) with the proposed walking behavior encoding scheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume as Fig. 2. System flowchart. (a) Pedestrian walking paths in previous frames. Three examples are shown in different colors. Rectangles indicate current locations of pedestrians. (b) The displacement volume encoded from pedestrians’ past walking paths in (a). (c) Behavior-CNN. (d) The predicted displacement volume by Behavior-CNN. (e) Pre- dicted future pedestrian walking paths decoded from (d). Three bottom convolution layers, conv1, conv2, and conv3, are to be con- volved with input data of size X × Y × 2M. conv1 contains 64 filters of size 3 × 3 × 2M, while both conv2 and conv3 contain 64 filters of size 3 × 3 × 64. Zeros are padded to each convolution input in order to guarantee feature maps of these layers be of the same spatial size with the input. The three bottom convolution layers are followed by max pooling layers max-pool with stride 2. The output size of max-pool is X/2 × Y/2 × 64. In this way, the receptive field of the network can be doubled. Large receptive field is necessary for the task of pedestrian walking behavior modeling because each individual’s behavior are significantly influenced by his/her neighbors. A learnable location bias map of size X/2×Y/2 is channel-wisely added to each of the pooled feature maps. Every spatial location has one independent bias value shared across channels. With the location bias map, location information of the scene can be automatically learned by the proposed Behavior-CNN. As for the three top convolution layers, conv4 and conv5 contain 64 filters of size 3 × 3 × 64, while conv6 contains 2M∗ filters of size 3 × 3 × 64 to output the predicted displacement volume. Zeros are also S. Yi, et al., “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks,” ECCV, 2016.

Slide 39

Slide 39 text

༧ଌର৅पลͷہॴྖҬͷใुϚοϓ͔Β༧ଌର৅͕Ҡಈ͢ΔՄೳੑΛදݱ ˔ 4QBUJBM.BUDIJOH/FUXPSLɿύονྖҬؒͷྨࣅ౓͔ΒใुϚοϓΛਪఆ ˔ 0SJFOUBUJPO/FUXPSLɿ༧ଌର৅ͷ޲͖Λਪఆ 7JTVBM1BUI1SFEJDUJPO<4)VBOH 5*1 > 39 S. Huang, et al., “Deep learning driven visual path prediction from a single image,” TIP, 2016. 5 Fig. 3. The overview of our framework. Spatial Matching Network and Orientation Network are two CNNs, which respectively model the spatial and temporal contexts. We repeatedly input images of the object and local environment patches into Spatial Matching Network to generate the reward map of the scene. Intuitively, it helps us decide whether the object could reach certain areas of the scene. Orientation Network estimates the object’s facing orientation, which indicates the object’s preferred moving direction in the future. Then we incorporate this analysis and infer the most likely future paths with a uniﬁed path planning scheme. for an object I object and a scene image I by repeatedly inputing all the local environment patches q with the same Facing orientation șesti 8 (a) Original Image (b) Reward Map (c) Predicted Paths Fig. 7. Qualitative results generated by our algorithm. Each row represents a sample. The left column shows the input images. Red boxes on them denote the given objects. The middle column shows the generated reward maps. The right column shows predicted top-10 paths. Our framework can output discriminative reward maps and make accurate predictions on a diverse set of scenes. row represents a sample. The left column is the input images. The middle column shows the reward maps generated by our algorithm, in which those green areas are accessible (high reward) while pink areas are obstacles (low reward). We can see in the maps that the grass, tree and house are detected as low reward, while the road and parking lot are of high reward. Notice the fourth and ﬁfth maps, where the sidewalk is recognized as high reward area for the corresponding pedestrians. The right column shows the predicted paths for the top-1 predictions and the black lines represent the other top-10 predictions. Visually, the predicted paths are close to our human’s inference. Notice how the predicted paths avoid other objects (cars, pedestrians) or obstacles (grass, trees, buildings) and go along the road. Furthermore, we can see that our framework is able to make correct prediction of the destination. In the third image, the red car will be parked in the square. In the fourth image, the person probably wants to walk across the street. A correct destination estimation will

Slide 40

Slide 40 text

ਓশࢹ఺ʹ͓͚Δର໘ͷาߦऀͷͨΊͷҐஔ༧ଌ ˔ ਓশࢹ఺ಛ༗ͷखֻ͔ΓΛҐஔ༧ଌʹར༻ ର໘ͷาߦऀͷҐஔʹӨڹ͢ΔΤΰϞʔγϣϯ ର໘ͷาߦऀͷεέʔϧ ର໘ͷาߦऀͷ࢟੎ ˔ ্هͭͷ৘ใΛ༻͍ͨϚϧνετϦʔϜϞσϧͰকདྷͷҐஔ༧ଌ 'VUVSFMPDBMJ[BUJPOJO fi STUQFSTPOWJEFPT<5:BHJ $713 > 40 Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream ܺ௜௡ ൌ ሺ࢞௧బି்౦౨౛౬ାଵ ǡ ڮ ࢞௧బ ሻ ܧ௜௡ ൌ ሺࢋ௧బା்౦౨౛౬ାଵ ǡ ڮ ࢋ௧బ ሻ ܲ௜௡ ൌ ሺ࢖௧బି்౦౨౛౬ାଵ ǡ ڮ ࢖௧బ ሻ ܺ௢௨௧ ൌ ሺ࢞௧బାଵ െ ࢞௧బ ǡ ڮ ǡ ࢞௧బା்౜౫౪౫౨౛ െ ࢞௧బ ሻ Figure 3. Proposed Network Architecture. Blue blocks corre- tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in ﬁrst-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target people jointly. Given a simple assumption that heights of people do not differ too much, scales of observed people can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous target locations. Then, we extend each location lt ∈ R2 + of ݐ଴ a) Input ݐ଴ ൅ ܶ௙௨௧௨௥௘ b) Prediction ݐ଴ െ ܶ௣௥௘௩ ൅ ͳ e) Pose ࢖௧ c-1) Location ࢒௧ c-2) Scale ݏ௧ d) Ego-motion ࢋ௧ Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream ܺ௜௡ ൌ ሺ࢞௧బି்౦౨౛౬ାଵ ǡ ڮ ࢞௧బ ሻ ܧ௜௡ ൌ ሺࢋ௧బା்౦౨౛౬ାଵ ǡ ڮ ࢋ௧బ ሻ ܲ௜௡ ൌ ሺ࢖௧బି்౦౨౛౬ାଵ ǡ ڮ ࢖௧బ ሻ ܺ௢௨௧ ൌ ሺ࢞௧బାଵ െ ࢞௧బ ǡ ڮ ǡ ࢞௧బା்౜౫౪౫౨౛ െ ࢞௧బ ሻ Figure 3. Proposed Network Architecture. Blue blocks correspond to convolution/deconvolution layers while gray blocks describe intermediate deep features. tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in ﬁrst-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target people jointly. Given a simple assumption that heights of people do not differ too much, scales of observed people can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous target locations. Then, we extend each location lt ∈ R2 + of a target person by adding the scale information of that person st ∈ R+, i.e., xt = (lt , st ) . Then, the ‘location- scale’ input stream in Figure 3 learns time evolution in Xin = (xt0−Tprev+1, . . . , xt0 ), and the output stream generates Xout = (xt0+1 − xt0 , . . . , xt0+Tfuture − xt0 ). Ground Truth Ours Social LSTM Input NNeighbor ݐ ൌ ݐ଴ െ ͻ ݐ ൌ ݐ଴ െ ͷ ݐ ൌ ଴ ݐ ൌ ݐ଴ ൅ ͸ ݐ ൌ ݐ଴ ൅ ͳͲ Past observations Predictions (a) (b) (c) (d) (e) Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of target people (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predict locations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We compared T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.

Slide 46

Slide 46 text

ΦϓςΟΧϧϑϩʔը૾͔Βาߦऀͷܦ࿏Λ༧ଌ ˔ 3FT/FUΛόοΫϘʔϯϞσϧͱͯ͠ར༻ ΞϊςʔγϣϯίετΛݮΒͨ͢Ίʹطଘͷาߦऀݕग़ͱ௥੻ख๏ͰσʔλΛ૿෯ ˔ ϥϕϧͷͳ͍σʔλ͔ΒࣄલֶशΛߦ͏ ˔ λʔήοτͷσʔληοτʹ fi OFUVOJOH͢Δ͜ͱͰɼ༧ଌਫ਼౓ͷ޲্ΛਤΔ %51<04UZMFT *7 > 46 O. Styles, et al., “Forecasting pedestrian trajectory with machine-annotated training data,” IV, 2019. trian Trajectory with Machine-Annotated Training Data Olly Styles1, Arun Ross2 and Victor Sanchez1 of pedestrian trajectory is utonomous vehicles and can ality of advanced driver as- progress has been made in orecasting pedestrian trajec- em due to the unpredictable e space of potentially useful t a deep learning approach ing using a single vehicle- dels that have revolutionized ave seen limited application due to the lack of richly ess the lack of training data ne annotation scheme that sing a large dataset without propose Dynamic Trajectory asting pedestrian trajectory DTP is trained using both ta, and anticipates dynamic near models. Experimental the proposed model. Fig. 1. We propose a model and training regime for pedestrian trajectory forecasting. Due to a lack of annotated training data, our model is trained Fig. 4. Example successful (top 2 rows) and unsuccessful (bottom row) trajectory forecasts on the JAAD test set. See main text for discussion. Best viewed in color. FPL see a reduction in error with our proposed CV correction term Ct (rather than directly predicting the location displacement Lt+n ). DTP attains the best performance. Once frame-wise detections are obtained, detections are associated across frames using the Deepsort [32] tracking- by-detection algorithm resulting in a series of bounding boxes and tracking identifiers. We use the same setup as Fig. 4. Example successful (top 2 rows) and unsuccessful (bottom row) trajectory forecasts on the JAAD test set. See main text for discussion. Best viewed in color. FPL see a reduction in error with our proposed CV correction term Ct (rather than directly predicting the location displacement Lt+n ). DTP attains the best performance. TABLE I INPUT MODALITY COMPARISION. Input modality MSE DE@5 DE@10 DE@15 CA 1426 15.3 28.3 52.8 CV 1148 16.0 26.4 47.5 RGB frame 1042 11.6 24.9 45.2 Optical flow frame 873 11.1 23.0 41.2 5 optical flow frames 651 9.4 19.3 35.6 9 optical flow frames 610 9.2 18.7 34.6 TABLE II MODEL COMPARISION. Model CV correction term MSE DE@15 FPL [33] 7 1405 ± 182 49.5 ± 2.9 FPL [33] 3 881 ± 44 41.3 ± 1.2 DTP 7 1404 ± 94 54.6 ± 2.6 DTP 3 610 ± 21 34.6 ± 0.5 C. Machine annotation Implementation. We annotate pedestrian bounding boxes in the BDD-10K dataset using two popular off-the-shelf object detectors, YOLOv3 [25] and Faster-RCNN [27]. Al- though the detectors are capable of detecting a variety of objects, we use the pedestrian class only. Our aim here is to evaluate the robustness of our proposed system to multiple automated detectors, rather than to compare detector performance directly. Nonetheless, for consistency, we train Once frame-wise detections are obtained, detections are associated across frames using the Deepsort [32] tracking- by-detection algorithm resulting in a series of bounding boxes and tracking identifiers. We use the same setup as the JAAD dataset and discard detections with height fewer than 50 pixels, and tracks shorter than 25 frames. Using this annotation scheme, we find a total of 16,900 valid non- overlapping pedestrian tracks using YOLOv3 and 13,200 using Faster-RCNN. Evaluation. We use an 80%-20% training-validation split for BDD-10K. We pre-train DTP on BDD-10K using the same hyperparameters as outlined in Section IV-B. Once performance on the validation set saturates, the model is fine- tuned on the JAAD training set. We evaluate the trajectory forecasting performance with and without pre-training rather than the pedestrian detection quality, owing to the lack of human-annotated bounding boxes. Results. The impact of machine-annotated pre-training using the YOLOv3 detector before fine-tuning on the human- annotated JAAD dataset is shown in Table III. TABLE III IMPACT OF PRE-TRAINING ON BDD-10K WITH YOLOV3. Model Pre-training with machine annotation MSE DE@15 FPL [33] 7 881 ± 44 41.3 ± 1.2 FPL [33] 3 805 ± 46 40.1 ± 1.2 DTP 7 610 ± 21 34.6 ± 0.5 DTP 3 539 ± 13 32.7 ± 0.4 We evaluate the impact of pre-training dataset size and

Slide 47

Slide 47 text

ҰൠಓΛ૸ΔࣗಈंͷͨΊͷஈ֊ͷεςʔδʹΑΔܦ࿏༧ଌख๏ ˔ εςʔδɿকདྷՄೳੑͷ͋Δܦ࿏Λ༧ଌ $//&OD%FDͰΤϯυϙΠϯτΛ༧ଌ ΤϯυϙΠϯτͱաڈͷܦ࿏͔ΒΧʔϓϑΟοςΟϯάͰ༧ଌܦ࿏Λੜ੒ ˔ εςʔδɿީิʹڍ͕ͬͨ༧ଌܦ࿏ʹରͯ͠࠷ྑͷܦ࿏ͷ෼ྨͱิਖ਼Λߦ͏ ֤ܦ࿏͸DMBTTJ fi DBUJPOϞσϧͰ෼ྨ ܦ࿏ͷҠಈྔΛਪఆͯ͠ܦ࿏ͷิਖ਼Λߦ͏ 51/FU<-'BOH $713 > 47 L. Fang, et al., “TPNet: Trajectory Proposal Network for Motion Prediction,” CVPR, 2020. !"#$ %&"#$ Base Feature Encoding F C First Stage Proposal Generation Module F C F C Regression F C F C Classification Max- Score- Out Second Stage Proposal Generation End Point Regression CNN CNN-ED CNN-ED Proposal Refinement Past Position Future Position Prediction Regressed End Point Center Lane Line Proposals Figure 2. Framework of the Trajectory Proposal Network (TPNet). In the first stage, a rough end point is regressed to reduce the searching space and then proposals are generated. In the second stage, proposals are classified and refined to generate final predictions. The dotted proposals are the proposals that lie outside of the movable area, which will be further punished. Safety Past Position Future Position Prediction of TPNet Prediction of Uber Prediction of TPNet without prior knowledge Regressed end point First stage proposal Reference Lines Diversity Two-Stage ! = 1 % ! = 2 % Past Position Future Position Proposal Traversed End Point Regressed End Point Curvature Point Figure 3. Illustration of proposal generation. Proposals are generated around the end point predicted in the first stage. γ is used to control the shape of the proposal. ting degree, we conduct experiments with different degrees and calculate the fitting errors for trajectories of time length T = Tobs +Tpre, where Tobs is the length for the history observations and Tpre is the length for the future predictions. We choose cubic curve with a balance of accuracy and complexity. The average fitting error is 0.048 m for pedestrians on ApolloScape dataset, and 0.068 m for vehicles on Ar- goverse dataset, which is accurate enough for most cases (detailed analysis can be found in supplemental material). Since the curve is sensitive to the parameters and difficult Figure 4. Illustration of mul information. The reference lines that the vehicle could grid. By varying the val for each possible end po Finally, proposals are using Eq. 1. proposals

Slide 51

Slide 51 text

ݻఆΧϝϥ΍υϩʔϯͰࡱӨ ˔ ๛෋ͳσʔλΛऔಘͰ͖ΔͨΊɼඇৗʹେن໛ͳσʔληοτ ၆ᛌࢹ఺ͷσʔληοτ 51 lts from NN+map(prior) m- aseline. The orange trajectory Red represents ground truth for n represents the multiple fore- 3 s. Top left: The car starts to Argoverse Dataset The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections Julian Bock1, Robert Krajewski1, Tobias Moers2, Steffen Runde1, Lennart Vater1 and Lutz Eckstein1 Fig. 1: Exemplary result of road user trajectories in the inD dataset. The position and speed of each road user is measured accurately over time and shown by bounding boxes and tracks. For privacy reasons, the buildings were made unrecognizable. Abstract —Automated vehicles rely heavily on data-driven methods, especially for complex urban environments. Large datasets of real world measurement data in the form of road user trajectories are crucial for several tasks like road user prediction models or scenario-based safety validation. So far, though, this demand is unmet as no public dataset of urban road user trajectories is available in an appropriate size, quality and variety. By contrast, the highway drone dataset (highD) has recently shown that drones are an efficient method for acquiring naturalistic road user trajectories. Compared to driving studies or ground-level infrastructure sensors, one major advantage of using a drone is the possibility to record naturalistic behavior, as road users do not notice measurements taking place. Due to the ideal viewing angle, an entire intersection scenario can be measured with significantly less occlusion than with sensors at ground level. Both the class and the trajectory of each road user can be extracted from the video recordings with high precision using state-of-the-art deep neural networks. Therefore, we propose the creation of a comprehensive, large-scale urban intersection dataset with naturalistic road user behavior using camera-equipped drones as successor of the highD dataset. The resulting dataset contains more than 11500 road users including vehicles, bicyclists and pedestrians at intersections in Germany and is called inD. The dataset consists of 10 hours of measurement data from four intersections and is available online for non- commercial research at: http://www.inD-dataset.com 1The authors are with the Automated Driving Department, Institute for Automotive Engineering RWTH Aachen University (Aachen, Germany). (E-mail: {bock, krajewski, steffen.runde, vater, eckstein}@ika.rwth- aachen.de). 2The author is with the Automated Driving Department, fka GmbH (Aachen, Germany). (E-mail: [email protected]). Index Terms —Dataset, Trajectories, Road Users, Machine Learning I. INTRODUCTION Automated driving is expected to reduce the number and severity of accidents significantly [13]. However, intersections are challenging for automated driving due to the large complexity and variety of scenarios [15]. Scientists and companies are researching how to technically handle those scenarios by an automated driving function and how to proof safety of these systems. An ever-increasing proportion of the approaches to tackle both challenges are data-driven and therefore large amounts of measurement data are required. For example, recent road user behaviour models, which are used for prediction or simulation, use probabilistic approaches based on large scale datasets [2], [11]. Furthermore, current approaches for safety validation of highly automated driving such as scenario- based testing heavily rely on large-scale measurement data on trajectory level [3], [5], [17]. However, the widely used ground-level or on-board measurement methods have several disadvantages. These include that road users can be (partly) occluded by other road users and do not behave naturally as they notice being part of a measurement due to conspicuous sensors [5]. We propose to use camera-equipped drones to record road user movements at urban intersections (see Fig. 2). Drones with high-resolution cameras allow to record traffic from a arXiv:1911.07602v1 [cs.CV] 18 Nov 2019 inD Dataset ysis. The red trajectories are single-future method predictions and the yellow-orange heatmaps are ctions. The yellow trajectories are observations and the green ones are ground truth multi-future etails. Method Single-Future Multi-Future Our full model 18.51 / 35.84 166.1 / 329.5 The Forking Paths Dataset Figure 7: Example output of the motion prediction solution supplied as part of the software development kit. A convolution neural network takes rasterised scenes around nearby vehicles as input, and predicts their future motion. ity and multi-threading to make it suitable for distributed machine learning. Customisable scene visualisation and rasterisation. We provide several functions to visualise and rasterise Lyft Level 5 Dataset • αϯϓϧ਺ɿ300K • γʔϯ਺ɿ113 • ର৅छྨ • ௥Ճ৘ใ - car - ंઢ৘ใɼ஍ਤσʔλɼηϯαʔ৘ใ • αϯϓϧ਺ɿ13K • γʔϯ਺ɿ4 • ର৅छྨ - pedestrian, car, cyclist • αϯϓϧ਺ɿ3B • γʔϯ਺ɿ170,000 • ର৅छྨ • ௥Ճ৘ใ - pedestrian, car, cyclist - ߤۭ৘ใ - ηϚϯςΟοΫϥϕϧ • αϯϓϧ਺ɿ0.7K • γʔϯ਺ɿ7 • ର৅छྨ • ௥Ճ৘ใ - pedestrian - ෳ਺ܦ࿏৘ใ - ηϚϯςΟοΫϥϕϧ ҰൠಓΛࡱӨͨ͠σʔληοτ ަࠩ఺ΛࡱӨͨ͠σʔληοτ ҰൠಓΛࡱӨͨ͠σʔληοτ γϛϡϨʔλͰ࡞੒͞Εͨσʔληοτ J. Houston, et al., “One Thousand and One Hours: Self-driving Motion Prediction Dataset,” CoRR, 2020. J. Bock, et al., “The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections,” CoRR, 2019. M.F. Chang, et al., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” CVPR, 2019. J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.

Slide 52

Slide 52 text

ࣗಈंલํͷҠಈର৅ͷܦ࿏༧ଌΛ໨త ंࡌΧϝϥࢹ఺ͷσʔληοτ 52 • αϯϓϧ਺ɿ1.8K • γʔϯ਺ɿ53 • ର৅छྨ • ௥Ճ৘ใ - pedestrian - ं྆৘ใɼΠϯϑϥετϥΫνϟ ҰൠಓΛࡱӨͨ͠σʔληοτ Apolloscape Dataset Figure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding box with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2. centric views captured from a mobile platform. In the TITAN dataset, every participant (individuals, vehicles, cyclists, etc.) in each frame is localized using a bounding box. We annotated 3 labels (person, 4- wheeled vehicle, 2-wheeled vehicle), 3 age groups for person (child, adult, senior), 3 motion-status labels for both 2 and 4-wheeled vehicles, and door/trunk status labels for 4- wheeled vehicles. For action labels, we created 5 mutually exclusive person action sets organized hierarchically (Fig- ure 2). In the first action set in the hierarchy, the annota- tor is instructed to assign exactly one class label among 9 atomic whole body actions/postures that describe primitive action poses such as sitting, standing, standing, bending, etc. The second action set includes 13 actions that involve single atomic actions with simple scene context such as jay- walking, waiting to cross, etc. The third action set includes 7 complex contextual actions that involve a sequence of atomic actions with higher contextual understanding, such as getting in/out of a 4-wheel vehicle, loading/unloading, etc. The fourth action set includes 4 transportive actions that describe the act of manually transporting an object by agent i at each past time step from 1 to Tobs, where (cu, cv ) and (lu, lv ) represent the center and the dimension of the bounding box, respectively. The proposed TITAN framework requires three inputs as follows: Ii t=1:Tobs for the action detector, xi t for both the interaction encoder and past object location encoder, and et = {αt, ωt } for the ego- motion encoder where αt and ωt correspond to the acceler- ation and yaw rate of the ego-vehicle at time t, respectively. During inference, the multiple modes of future bounding box locations are sampled from a bi-variate Gaussian generated by the noise parameters, and the future ego-motions ˆ et are accordingly predicted, considering the multi-modal nature of the future prediction problem. Henceforth, the notation of the feature embedding function using multi-layer perceptron (MLP) is as follows: Φ is without any activation, and Φr, Φt, and Φs are associated with ReLU, tanh, and a sigmoid function, respectively. 4.1. Action Recognition We use the existing state-of-the-art method as backbone TITAN Dataset Linear 123 477 1365 950 3983 223 857 2303 1565 6111 LSTM 172 330 911 837 3352 289 569 1558 1473 5766 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 5615 PIEtraj 58 200 636 596 2477 110 399 1248 1183 4780 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is calculated over all predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding boxes for the entire predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s last Linear 0.87 2.28 4.27 10.76 LSTM 1.50 1.91 3.00 6.89 PIEspeed 0.63 1.44 2.65 6.77 Table 4: Speed prediction errors over varying time steps on the PIE dataset. Last stands for the last time step. The results are reported in km/h. is generally better on bounding box centers due to the fewer degrees of freedom. Context in trajectory prediction. We first evaluate the proposed speed prediction stream, PIEspeed, by comparing this model with two baseline models, a linear Kalman filter and a vanilla LSTM model. We use MSE metric and re- port the results in km/h. Table 4 shows the results of our experiments. The linear model achieves reasonable perfor- PIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. There are six scenarios with different road conditions and traffic situations. We only show the trajectories of several instances in each scenario. The ground truth (GT) is drawn in green and the prediction results of other methods (ED,SL,SA) are shown with different dashed lines. The prediction trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of the cases. stance layer to capture the trajectories and interactions for instances and use a category layer to summarize the simi- ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧ਺ɿ81K • γʔϯ਺ɿ100,000 • ର৅छྨ - pedestrian, car, cyclist ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧ਺ɿ645K • γʔϯ਺ɿ700 • ର৅छྨ • ௥Ճ৘ใ - pedestrian, car, cyclist - ߦಈϥϕϧɼาߦऀͷ೥ྸ Y. Ma, et al., “Traf fi cPredict: Trajectory Prediction for Heterogeneous Traf fi c-Agents,” AAAI, 2019. A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019. S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020. nuScenes: A multimodal dataset for autonomous driving Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom nuTonomy: an APTIV company [email protected] Abstract Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in com- puter vision tasks such as object detection, tracking and seg- mentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online1. 1. Introduction Figure 1. An example from the nuScenes dataset. We see 6 different camera views, lidar and radar data, as well as the human annotated semantic map. At the bottom we show the human writ- ten scene description. Multimodal datasets are of particular importance as no single type of sensor is sufficient and the sensor types are complementary. Cameras allow accurate measurements of edges, color and lighting enabling classification and localization on the image plane. However, 3D localization from images is challenging [13, 12, 57, 80, 69, 66, 73]. Lidar pointclouds, on the other hand, contain less semantic infor- nuScenes Dataset ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧ਺ɿ645K • γʔϯ਺ɿ700 • ର৅छྨ • ௥Ճ৘ใ - truck, bicycle, car, etc. - ηϯαʔ৘ใɼ஍ਤσʔλɼ఺܈৘ใɼΤΰϞʔγϣϯ H. Caesar, et al., “nuScenes: A multimodal dataset for autonomous driving,” CVPR, 2020.

Slide 54

Slide 54 text

༧ଌ͞ΕۣͨܗྖҬͱਅͷۣܗྖҬͷத৺࠲ඪͰධՁ ˔ ंࡌΧϝϥө૾ʹ͓͚Δܦ࿏༧ଌͰར༻ ˔ ۣܗྖҬͷॏͳΓ཰͔Β'஋ͰධՁ΋Ͱ͖Δ ධՁࢦඪ 54 %JTQMBDFNFOU&SSPS /FHBUJWFMPHMJLFMJIPPE .FBO4RVBSF&SSPS িಥ཰ (a) Average Displacement Error (b) Final Displacement Error "%& '%& ਅ஋ͱ༧ଌ஋ͱͷϢʔΫϦουڑ཭ޡࠩ ˔ "WFSBHF%JTQMBDFNFOU&SSPS "%& ɿ༧ଌ࣌ࠁؒͷฏۉޡࠩ ˔ 'JOBM%JTQMBDFNFOU&SSPS '%& ɿ༧ଌ࠷ऴ࣌ࠁͷޡࠩ !"#$ … !"#& # of Samples: ' Prediction Horizon: ( Figure 5. An illustration of our probabilistic evaluation methodol- ogy. It uses kernel density estimates at each timestep to compute the log-likelihood of the ground truth trajectory at each timestep, averaging across time to obtain a single value. Figure 6. Mean NLL for each dataset. Error bars are bootstrapped 95% conﬁdence intervals. 2000 trajectories were sampled per model at each prediction timestep. Lower is better. ADE and FDE are useful metrics for comparing determinis- tic regressors, they are not able to compare the distributions produced by generative models, neglecting aspects such as variance and multimodality [40]. To bridge this gap in eval- Dataset ADE / FDE, SGAN [16] ETH 0.64 / 1.13 Hotel 0.43 / 0.91 Univ 0.53 / 1.12 Zara 1 0.29 / 0.58 Zara 2 0.27 / 0.56 Average 0.43 / 0.86 Table 1. Quantitative ADE and metric where N = 100. Both of our methods signi the ETH datasets, the UCY (P <.001; two-tailed t-test and SGAN’s mean NLL). On Full model is identical in pe same t-test). However, on th model performs worse than We believe that this is caused tions more often than in other truth trajectories to frequentl tions whereas SGAN’s highe to have density there. Acros uration outperforms our zbes model’s full multimodal mo for strong performance on th We also evaluated our m to determine how much the p prediction horizon. The resu be seen, our Full model sig at every timestep (P <.001; ence between our and SGAN ਪఆͨ͠෼෍ͷݩͰͷਅ஋ͷର਺໬౓ͷظ଴஋ ˔ "%&ͱ'%&Ͱෳ਺ܦ࿏ΛධՁ͢Δͷ͸ϚϧνϞʔμϧੑΛແࢹ ˔ /FHBUJWFMPHMJLFMJIPPEͰෳ਺ܦ࿏ͷ༧ଌͷධՁࢦඪͱͯ͠ར༻ USVUI QSFEJDUJPO .4& -OPSN -OPSN ਅ஋ ༧ଌ஋ ैདྷͷධՁࢦඪ ఏҊ͢ΔධՁࢦඪ ͭͷ%JTQMBDFNFOU&SSPSͰධՁ ඇઢܗܦ࿏ͷ%JTQMBDFNFOU&SSPSɼͭͷ෺ମͱͷিಥ཰ͰධՁ -OPSN -OPSN ਅ஋ ༧ଌ஋ ैདྷͷධՁࢦඪ ఏҊ͢ΔධՁࢦඪ ͭͷ%JTQMBDFNFOU&SSPSͰධՁ %JTQMBDFNFOU&SSPS͸શαϯϓϧʹର͠ฏۉΛٻΊΔ ˔ ΠϯλϥΫγϣϯ৘ใ͕Ͳͷ༧ଌܦ࿏ʹޮՌత͔ධՁͰ͖ͳ͍ ༧ଌ஋͕֤෺ମͱিಥ͔ͨ͠൱͔ͷিಥ཰ͰධՁ ˔ ಈత෺ମɿө૾தͷଞର৅ ˔ ੩త෺ମɿݐ෺΍໦ͳͲͷো֐෺ ಈత෺ମ ੩త෺ମ

Slide 74

Slide 74 text

%FFQ-FBSOJOHͷൃలʹΑΔσʔληοτͷେن໛Խ ܦ࿏༧ଌͷධՁࢦඪͷ࠶ߟ ˔ ༧ଌਫ਼౓͕ྑ͍㱠࠷΋ྑ͍༧ଌख๏ ˔ ίϛϡχςΟશମͰߟ͑௚͢ඞཁ͕͋Δ ෳ਺ܦ࿏Λ༧ଌ͢ΔΞϓϩʔνͷ૿Ճ ˔ .VMUJWFSTFΛච಄ʹෳ਺ܦ࿏Λॏࢹ͢Δܦ࿏༧ଌख๏͕૿Ճ͢Δʁ ࠓޙͷܦ࿏༧ଌ͸ʁ 74 2016 Social-LSTM [18] [A. Alahi+, CVPR, 2016] DESIRE [19] [N. Lee+, CVPR, 2017] Conv.Social-Pooling [20] [N. Deo+, CVPRW, 2018] SoPhie [32] [A. Sadeghian+, CVPR, 2019] Social-BiGAT [34] [V. Kosaraju+, NeurIPS, 2019] Social-STGCNN [44] [A. Mohamedl+, CVPR, 2020] Social-GAN [23] [A. Gupta+, CVPR, 2018] Next [31] [J. Liang+, CVPR, 2019] STGAT [33] [Y. Huang+, ICCV, 2019] Trajectron [35] [B. Ivanovic+, ICCV, 2019] Social-Attention [27] [A. Vemula+, ICRA, 2018] MATF [24] [T. Zhao+, CVPR, 2019] MX-LSTM [21] [I. Hasan+, CVPR, 2018] CIDNN [28] [Y. Xu+, CVPR, 2018] SR-LSTM [30] [P. Zhang+, CVPR, 2019] W/ ΠϯλϥΫγϣϯ W/O ΠϯλϥΫγϣϯ Group-LSTM [22] [N. Bisagno+, CVPR, 2018] Reciprocal network [25] [S. Hao+, CVPR, 2020] PECNet [26] [K. Mangalam+, ECCV, 2020] RSBG [38] [J. SUN+, CVPR, 2020] STAR [39] [C. Yu+, ECCV, 2020] Behavior CNN [45] [S. Yi+, ECCV, 2016] FPL [47] [T. Yagi+, CVPR, 2018] Fast and Furious [49] [W. Luo+, CVPR, 2018] OPPU [48] [A. Bhattacharyya+, CVPR, 2018] OASE [50] [H. Minoura+, VISAPP, 2019] Rule of the Road [51] [J. Hong+, CVPR, 2019] 2020 Trajectron++ [36] [T. Salzmann+, ECCV, 2020] ΞςϯγϣϯϞσϧ ϓʔϦϯάϞσϧ GD GAN [29] [T. Fernando +, ACCV, 2018] EvolveGraph [37] [J. Li+, NeurIPS, 2020] FVTraj [40] [T. Salzmann+, ECCV, 2020] RGM [41] [C. Choi +, ICCV, 2019] PRECOG [42] [N. Rhinehart +, ICCV, 2019] DSCMP [43] [C. Tao +, ECCV, 2020] Visual Path Prediction [46] [S. Huang +, TIP, 2016] Future Vehicle Localization [52] [Y. Yao +, ICRA, 2019] DTP [53] [O. Styles +, IV, 2019] TPNet [54] [L. Fang +, CVPR, 2020] ͦͷଞͷϞσϧ Multiverse [55] [J. Liang+, CVPR, 2020] SimAug [56] [J. Liang+, ECCV, 2020]

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text