Deep Learningを用いた経路予測の研究動向

B78199b075c1bbf42f205e04eb2bd764?s=47 himidev
October 09, 2020

Deep Learningを用いた経路予測の研究動向

Deep Learningを用いた経路予測のサーベイ

B78199b075c1bbf42f205e04eb2bd764?s=128

himidev

October 09, 2020
Tweet

Transcript

  1. αʔϕΠ࿦จɿ%FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌͷݚڀಈ޲ ຳӜେߊɹฏ઒ཌྷɹࢁԼོٛɹ౻٢߂࿱ 13.6ݚڀձ 0DUPCFS  த෦େֶɹػց஌֮ϩϘςΟΫεάϧʔϓ

  2. ༧ଌର৅ͷະདྷͷܦ࿏Λ༧ଌ͢Δٕज़ ܦ࿏༧ଌ 2 Ԡ༻ઌ ࣗಈӡస ɾࣄނ๷ࢭ ɾࣗ཯૸ߦ ɾφϏήʔγϣϯ ϩϘοτ

  3. ܦ࿏༧ଌͷΧςΰϦ 3 Bayesian-based Deep Learning-based Planning-based ಺෦ঢ়ଶ ؍ଌঢ়ଶ Update Update

    ex. model : Kalman Filter Past Future 1SFEJDUJPO NPEFM Input Output ex. model : LSTMɼCNN ex. Model : IRL, RRT* Start Goal X O X ؍ଌঢ়ଶʹϊΠζΛ෇༩ͨ͠஋͔Β ະདྷͷ಺෦ঢ়ଶΛߋ৽͠༧ଌ஋Λஞ࣍ਪఆ ༧ଌର৅ͷաڈͷي੻͔ΒະདྷͷߦಈΛֶश ελʔτ͔Βΰʔϧ·Ͱͷใु஋Λ࠷దԽ ຊαʔϕΠͷର৅
  4. %FFQ-FBSOJOHʹΑΔܦ࿏༧ଌͷඞཁͳཁૉ 4 Ұਓশࢹ఺ ंࡌΧϝϥࢹ఺ ௗᛌࢹ఺ View point and the yellow-orange

    heatmaps are n ones are ground truth multi-future Single-Future Multi-Future 18.51 / 35.84 166.1 / 329.5 28.68 / 49.87 184.5 / 363.2 PIE JAAD Method MSE CMSE CFMSE MSE CMSE CFM 0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5 Linear 123 477 1365 950 3983 223 857 2303 1565 611 LSTM 172 330 911 837 3352 289 569 1558 1473 576 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561 PIEtraj 58 200 636 596 2477 110 399 1248 1183 478 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s Linear 0.87 2.28 4.27 LSTM 1.50 1.91 3.00 PIEspeed 0.63 1.44 2.65 Long Short-Term Memory Convolutional Neural Network Gated Recurrent Unit Temporal Convolutional Network Model !" !# !$ !% !& !' UU JOQVUMBZFS UU IJEEFOMBZFS *OQVU(BUF UU PVUQVUMBZFS UU  IJEEFOMBZFS .FNPSZ$FMM 'PSHFU(BUF 0VUQVU(BUF ର৅Ϋϥε ର৅ؒͷΠϯλϥΫγϣϯ ੩త؀ڥ৘ใ Context
  5. %FFQ-FBSOJOHʹΑΔܦ࿏༧ଌͷඞཁͳཁૉ 5 Ұਓশࢹ఺ ंࡌΧϝϥࢹ఺ ௗᛌࢹ఺ View point ର৅Ϋϥε ର৅ؒͷΠϯλϥΫγϣϯ ੩త؀ڥ৘ใ

    Context Long Short-Term Memory Convolutional Neural Network Gated Recurrent Unit Temporal Convolutional Network Model and the yellow-orange heatmaps are n ones are ground truth multi-future Single-Future Multi-Future 18.51 / 35.84 166.1 / 329.5 28.68 / 49.87 184.5 / 363.2 PIE JAAD Method MSE CMSE CFMSE MSE CMSE CFM 0.5s 1s 1.5s 1.5s 1.5s 0.5s 1s 1.5s 1.5s 1.5 Linear 123 477 1365 950 3983 223 857 2303 1565 611 LSTM 172 330 911 837 3352 289 569 1558 1473 576 B-LSTM[5] 101 296 855 811 3259 159 539 1535 1447 561 PIEtraj 58 200 636 596 2477 110 399 1248 1183 478 Table 3: Location (bounding box) prediction errors over varying future time steps. MSE in pixels is ca predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of the bounding box predicted sequence and only the last time step respectively. MSE Method 0.5s 1s 1.5s Linear 0.87 2.28 4.27 LSTM 1.50 1.91 3.00 PIEspeed 0.63 1.44 2.65 !" !# !$ !% !& !' UU JOQVUMBZFS UU IJEEFOMBZFS *OQVU(BUF UU PVUQVUMBZFS UU  IJEEFOMBZFS .FNPSZ$FMM 'PSHFU(BUF 0VUQVU(BUF
  6. Ҡಈର৅ಉ࢜ͷিಥΛආ͚Δܦ࿏ͷ༧ଌ ˔ Ҡಈର৅ؒͷڑ཭஋΍ํ޲͔ΒΠϯλϥΫγϣϯ৘ใΛٻΊΔ ܦ࿏༧ଌʹ͓͚ΔΠϯλϥΫγϣϯͱ͸ 6

  7. %FFQ-FBSOJOHΛ༻͍ͨ༧ଌख๏ͷ܏޲ͱ෼ྨ 7 2016 interaction other 2020 Social-LSTM [A. Alahi+, CVPR,

    2016] DESIRE [N. Lee+, CVPR, 2017] Conv.Social-Pooling [N. Deo+, CVPRW, 2018] SoPhie [A. Sadeghian+, CVPR, 2019] Social-BiGAT [V. Kosaraju+, NeurIPS, 2019] Social-STGCNN [A. Mohamedl+, CVPR, 2020] Social-GAN [A. Gupta+, CVPR, 2018] Next [J. Liang+, CVPR, 2019] STGAT [Y. Huang+, ICCV, 2019] Trajectron [B. Ivanovic+, ICCV, 2019] Social-Attention [A. Vemula+, ICRA, 2018] Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019] MX-LSTM [I. Hasan+, CVPR, 2018] CIDNN [Y. Xu+, CVPR, 2018] SR-LSTM [P. Zhang+, CVPR, 2019] Group-LSTM [N. Bisagno+, CVPR, 2018] Reciprocal Network [S. Hao+, CVPR, 2020] PECNet [K. Mangalam+, ECCV, 2020] RSBG [J. SUN+, CVPR, 2020] STAR [C. Yu+, ECCV, 2020] Behavior CNN [S. Yi+, ECCV, 2016] Future localization in first-person videos [T. Yagi+, CVPR, 2018] Fast and Furious [W. Luo+, CVPR, 2018] OPPU [A. Bhattacharyya+, CVPR, 2018] Object Attributes and Semantic Segmentation [H. Minoura+, VISAPP, 2019] Rule of the Road [J. Hong+, CVPR, 2019] Multiverse [J. Liang+, CVPR, 2020] Trajectron++ [T. Salzmann+, ECCV, 2020] AttentionϞσϧ PoolingϞσϧ ۙ೥Ͱ͸"UUFOUJPOϞσϧͱෳ਺ܦ࿏Λ༧ଌ͢Δख๏͕ओྲྀ multimodal paths
  8. ΠϯλϥΫγϣϯΛ༻͍ͨܦ࿏༧ଌϞσϧ 8 ༧ଌର৅ͱଞର৅ؒͷҐஔ৘ใΛڞʹPooling͢Δ͜ͱͰɼ িಥΛආ͚Δܦ࿏༧ଌ͕Մೳ ଞର৅ʹ͍ͭͯͷAttentionΛٻΊΔ͜ͱͰɼ ༧ଌର৅͕୭ʹͲͷఔ౓ண໨ͯ͠ḷ͔ͬͨΛࢹ֮తʹଊ͑Δ͜ͱ͕Մೳ Pooling Ϟσϧ Attention Ϟσϧ

  9. ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠

    0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦ ຊαʔϕΠͷ໨త 9 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ
  10. ຊαʔϕΠͷ໨త 10 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w

    "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦
  11. ෳ਺ͷาߦऀͷҠಈܦ࿏Λಉ࣌ʹ༧ଌ ˔ าߦऀಉ࢜ͷিಥΛආ͚ΔͨΊʹ4PDJBM1PPMJOHMBZFS 41PPMJOH ΛఏҊ  ༧ଌର৅पลͷଞର৅ͷҐஔͱதؒ૚ग़ྗΛೖྗ  ࣍࣌ࠁͷ-45.ͷ಺෦ঢ়ଶʹาߦऀಉ࢜ͷۭؒతؔ܎͕อ࣋ 

    িಥΛආ͚Δܦ࿏༧ଌ͕Մೳ 4PDJBM-45.<""MBIJ $713 > 11 Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! Linear! Social-LSTM! GT! SF [73]! A. Alahi, et al., “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” CVPR, 2016.
  12. ର৅ؒͷΠϯλϥΫγϣϯʹՃ͑पғͷ؀ڥ৘ใΛߟྀ ˔ ަࠩ఺΍ಓԊ͍୺ͳͲͷো֐෺ྖҬΛආ͚Δܦ࿏༧ଌΛ࣮ݱ ˔ $7"&ͰΤϯίʔυ͢Δ͜ͱͰෳ਺ͷܦ࿏Λ༧ଌՄೳ 3BOLJOH3FpOFNFOU.PEVMFͰ༧ଌܦ࿏ʹϥϯΩϯά෇͚ ˔ ܦ࿏Λ൓෮తʹվળ͢Δ͜ͱͰ༧ଌਫ਼౓޲্ΛਤΔ %&4*3&</-FF $713

    > 12 Input    KLD Loss fc + soft max  r1 rt r2 fc Y  Sample Generation Module Ranking & Renement Module RNN Encoder1 GRU GRU GRU RNN Encoder2 GRU GRU GRU RNN Decoder1 GRU GRU GRU RNN Decoder2 GRU GRU GRU CVAE fc fc z X Y Regression Scoring fc fc fc Y  Y Recon Loss CNN SCF SCF SCF Feature Pooling (I) Iterative Feedback    concat mask  addition Figure 2. The overview of proposed prediction framework DESIRE. First, DESIRE generates multiple plausible prediction samples ˆ Y via a CVAE-based RNN encoder-decoder (Sample Generation Module). Then the following module assigns a reward to the prediction samples at each time-step sequentially as IOC frameworks and learns displacements vector ∆ ˆ Y to regress the prediction hypotheses (Ranking DESIRE-S Top1 DESIRE-S Top10 DESIRE-SI Top1 DESIRE-SI Top10 Linear RNN ED RNN ED-SI X Y Method Linear RNN ED RNN ED-SI CVAE 1 CVAE 10% DESIRE-S-IT DESIRE-S-IT DESIRE-S-IT DESIRE-S-IT DESIRE-SI-I DESIRE-SI-I DESIRE-SI-I DESIRE-SI-I Linear RNN ED RNN ED-SI CVAE 1 CVAE 10% DESIRE-S-IT Linear RNN ED RNN ED-SI X Y X Y (a) GT Figure 6. KITTI resul DESIRE-S Top1 DESIRE-S Top10 DESIRE-SI Top1 DESIRE-SI Top10 Linear RNN ED RNN ED-SI X Y ਅ஋ ༧ଌ஋ ΠϯλϥΫγϣϯ͋Γ Top1 Top10 ΠϯλϥΫγϣϯͳ͠ Top1 Top10 N. Lee, et al., “DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents,” CVPR, 2017.
  13. ߴ଎ಓ࿏্Ͱྡ઀͢Δࣗಈंಉ࢜ͷΠϯλϥΫγϣϯΛߟྀͨ͠༧ଌख๏ ˔ ΠϯλϥΫγϣϯ৘ใʹۭؒతҙຯ߹͍Λ࣋ͨͤΔ$POWPMVUJPO4PDJBM1PPMJOHΛఏҊ  -45.&ODPEFSͰಘͨي੻ಛ௃ྔΛݻఆαΠζͷ4PDJBM5FOTPSʹ֨ೲ  $//ͰΠϯλϥΫγϣϯͷಛ௃ྔΛٻΊΔ  ༧ଌंͷಛ௃ྔͱ࿈݁͠ɼ-45.%FDPEFSͰܦ࿏Λ༧ଌ $POWPMVUJPOBM4PDJBM1PPMJOH</%FP

    $7138 > 13 Figure 3. Proposed Model: The encoder is an LSTM with shared weights that learns vehicle dynamics based on track histories. The convolutional social pooling layers learn the spatial interdependencies of of the tracks. Finally, the maneuver based decoder outputs a multi-modal predictive distribution for the future motion of the vehicle being predicted Convolutional Social Pooling for Vehicle Trajectory Prediction Nachiket Deo Mohan M. Trivedi University of California, San Diego La Jolla, 92093 ndeo@ucsd.edu mtrivedi@ucsd.edu Abstract Forecasting the motion of surrounding vehicles is a crit- ical ability for an autonomous vehicle deployed in complex traffic. Motion of all vehicles in a scene is governed by the traffic context, i.e., the motion and relative spatial config- uration of neighboring vehicles. In this paper we propose an LSTM encoder-decoder model that uses convolutional social pooling as an improvement to social pooling lay- ers for robustly learning interdependencies in vehicle mo- tion. Additionally, our model outputs a multi-modal predic- tive distribution over future trajectories based on maneuver classes. We evaluate our model using the publicly available NGSIM US-101 and I-80 datasets. Our results show im- provement over the state of the art in terms of RMS values of prediction error and negative log-likelihoods of true fu- Figure 1. Imagine the blue vehicle is an autonomous vehicle in the traffic scenario shown. Our proposed model allows it to make multi-modal predictions of future motion of it’s surrounding ve- hicles, along with prediction uncertainty shown here for the red v1 [cs.CV] 15 May 2018 N. Deo, et al., “Convolutional Social Pooling for Vehicle Trajectory Prediction,” CVPRW, 2018.
  14. าߦऀͷࢹઢ৘ใΛ׆༻ͨ͠ܦ࿏༧ଌख๏ ˔ ಄෦Λத৺ͱͨ͠ࢹ໺֯಺ͷଞର৅ͷΈ1PPMJOHॲཧ  ༧ଌର৅ͷ಄෦ํ޲ɼଞର৅ͱͷڑ཭஋͔Β1PPMJOHॲཧ͢Δର৅Λબ୒ ˔ ي੻ɼ಄෦ํ޲ɼΠϯλϥΫγϣϯ৘ใΛ-45.΁ೖྗ  ࢹ໺֯಺ʹ͍Δଞର৅ͱͷিಥΛආ͚Δܦ࿏༧ଌΛ࣮ݱ 

    ࢹઢ৘ใΛ೚ҙʹมߋ͢Δ͜ͱͰɼ೚ҙํ޲ʹ޲͔ͬͨܦ࿏༧ଌ͕Մೳ .9-45.<*)BTBO $713 > 14 3. Our approach In this section we present the MX-LSTM, capable of jointly forecasting positions and head orientations of an in- dividual thanks to the presence of two information streams: Tracklets and vislets. 3.1. Tracklets and vislets Given a subject i, a tracklet (see Fig. 1a) ) is formed by consecutive (x, y) positions on the ground plane, {x(i) t }t=1,...,T , x(i) t = (x, y) ∈ R2, while a vislet is formed by anchor points {a(i) t }t=1,...,T , with a(i) t = (ax , ay ) ∈ R2 indicating a reference point at a fixed distance r from the corresponding x(i) t , towards which the face is oriented1. In b) J d a) ) (i t a ) (i t x ) (i t D r ) ( 1 i t x c) 1  t x t a t x t D t E t Z (i) (i) e(x,i) t = φ x(i) t , Wx e(a,i) t = φ a(i) t , Wa where the embedding function φ consists in a jection through the embedding weigths Wx and D-dimensional vector, multiplied by a RELU n where D is the dimension of the hidden space. 3.2. VFOA social pooling The social pooling introduced in [3] is an eff to let the LSTM capture how people move in scene avoiding collisions. This work considers a interest area around the single pedestrian, in wh den states of the the neighbors are considered those which are behind the pedestrian. In our ca prove this module using the vislet information ing which individuals to consider, by building a tum of attention (VFOA), that is a triangle origin x(i) t , aligned with a(i) t , and with an aperture given gle γ and a depth d; these parameters have been cross-validation on the training partition of the T dataset (see Sec. 5). Our view-frustum social pooling is a No × N sor, in which the space around the pedestrian is d Figure 3. Qualitative results: a) MX-LSTM b) Ablation qualitative study on Individual MX-LSTM (better in color). I. Hasan, et al., “MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses,” CVPR, 2018.
  15. άϧʔϓʹؔ͢ΔΠϯλϥΫγϣϯΛߟྀͨ͠ܦ࿏༧ଌख๏ ˔ ӡಈ܏޲͕ྨࣅ͢Δาߦऀಉ࢜ΛάϧʔϓͱΈͳ͢ ˔ ༧ଌର৅͕ଐ͢ΔάϧʔϓҎ֎ͷݸਓͷ৘ใΛ1PPMJOH  ҟͳΔάϧʔϓͱͷিಥΛආ͚Δܦ࿏Λ༧ଌ (SPVQ-45.</#JTBHOP &$$78 >

    15 Group LSTM 7 Fig. 3. Representation of the Social hidden-state tensor Hi t . The black dot represents the pedes- trian of interest pedi. Other pedestrians pedj (∀j = i) are shown in different color codes, namely green for pedestrians belonging to the same set, and red for pedestrians belonging to a different Group LSTM 9 ing to the studies in interpersonal distances [15, 10], socially correlated people tend to stay closer in their personal space and walk together in crowded environments as compared to pacing with unknown pedestrians. Pooling only unrelated pedestrians will focus more on macroscopic inter-group interactions rather than intra-group dynamics, thus allowing the LSTM network to improve the trajectory prediction performance. Collision avoidance influences the future motion of pedestrians in a similar manner if two pedestrians are walking together as in a group. In Tables 2, 3 and Fig. 4, we display some demos of predicted trajectories which highlight how our Group-LSTM is able to predict pedestrian trajectories with better precision, showing how the prediction is improved when we pool in the social tensor of each pedestrian only pedestrians not belonging to his group. In Table 2, we show how the prediction of two pedestrians walking together in the crowd improves when they are not pooled in each other’s pooling layer. When the two pedestrians are pooled together, the network applies on them the typical repulsion force to avoid colliding with each other. Since they are in the same group, they allow the other pedestrian to stay closer in they personal space. In Fig. 4 we display the sequences of two groups walking toward each other. In Table 3, we show how the prediction for the two groups is improved with respect to the Social LSTM. While both prediction are not very accurate, our Group LSTM perform better because it is able to forecast how pedestrian belonging to the same group will stay together when navigating the environment. Name Scene Our Group-LSTM Social-LSTM ETH Univ Frame 2425 Table 2. ETH dataset: the prediction is improved when pooling in the social tensor of each pedes- trian only pedestrians not belonging to his group. The green dots represent the ground truth tra- jectories; the blue crosses represent the predicted paths. 5 Conclusion In this work, we tackle the problem of pedestrian trajectory prediction in crowded scenes. We propose a novel approach, which combines the coherent filtering algorithm with the LSTM networks. The coherent filtering is used to identify pedestrians walking together in a crowd, while the LSTM network is used to predict the future trajectories 10 Niccol´ o Bisagno, Bo Zhang and Nicola Conci (a) (b) (c) (d) Fig. 4. Sequences taken from the UCY dataset. It displays an interaction example between two groups, which will be further analyzed in Table 3. Name Scene Our Group-LSTM Social-LSTM UCY Univ Frame 1025 Table 3. We display how the prediction is improved for two groups walking in opposite direc- tions. The green dots represent the ground truth trajectories, while the blue crosses represent the predicted paths. N. Bisagno, et al., “Group LSTM: Group Trajectory Prediction in Crowded Scenarios,” ECCVW, 2018.
  16. ("/Λ༻͍ͯෳ਺ܦ࿏Λ༧ଌ͢Δख๏ ˔ (FOFSBUPSɿෳ਺ͷ༧ଌܦ࿏ΛαϯϓϦϯά  -45.&ODPEFSͷಛ௃ྔΛ༻͍ͯɼ1PPMJOH.PEVMFͰΠϯλϥΫγϣϯ৘ใΛग़ྗ  ֤ग़ྗͱϊΠζϕΫτϧΛ࿈݁͠ɼ-45.%FDPEFSͰະདྷͷෳ਺ͷ༧ଌܦ࿏Λग़ྗ ˔ %JTDSJNJOBUPSɿ༧ଌܦ࿏ͱ࣮ࡍͷܦ࿏Λ൑ผ 

    ఢରతʹֶशͤ͞Δ͜ͱͰɼ࣮ࡍͷܦ࿏ͱ᱐͢༧ଌܦ࿏Λੜ੒͢Δ͜ͱΛظ଴ 4PDJBM("/<"(VQUB $713 > 16 Figure 2: System overview. Our model consists of three key components: Generator (G), Pooling Module, and Discriminator (D). G takes as input past trajectories Xi and encodes the history of the person i as Ht i . The pooling module takes as input all H tobs i and outputs a pooled vector Pi for each person. The decoder generates the future trajectory conditioned on H tobs i Figure 5: Comparison between our model w avoidance scenarios: two people meeting (1 meeting at an angle (4). For each example to pooling, SGAN-P predicts socially accep Figure 5: Comparison between our model without pooling (SGAN, top) and with pooli ༧ଌ෼෍ A. Gupta, et al., “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,” CVPR, 2018.
  17. าߦऀ΍ࣗಈं౳ͷҟͳΔҠಈର৅ͱͷΠϯλϥΫγϣϯΛߟྀͨ͠༧ଌख๏ ˔ ΠϯλϥΫγϣϯʹՃ͑γʔϯίϯςΩετΛڞಉͰϞσϧԽ  ಈతͱ੩తͷͭͷ෺ମͱͷিಥΛආ͚Δܦ࿏Λ༧ଌՄೳ ˔ .VMUJ"HFOU5FOTPS'VTJPO  $//Ͱγʔϯʹؔ͢ΔίϯςΩετ৘ใΛநग़ 

    Ҡಈର৅ຖͷҐஔ৘ใ͔ΒۭؒతάϦουʹ-45.ͷग़ྗΛ֨ೲ  ίϯςΩετ৘ใͱۭؒతάϦουΛνϟωϧํ޲ʹ࿈݁͠ɼ$//Ͱ'VTJPO  'VTJPOͨ͠ಛ௃ྔ͔Β-45.%FDPEFSͰܦ࿏Λ༧ଌ .VMUJ"HFOU5FOTPS'VTJPO<5;IBP $713 > 17 Figure 5: Ablative results on Stanford Drone dataset. From left to right are results from MATF Multi Agent Scene, MAT ೖྗ஋ ਅ஋ ༧ଌ஋ T. Zhao, et al., “Multi-Agent Tensor Fusion for Contextual Trajectory Prediction,” CVPR, 2019.
  18. ͭͷωοτϫʔΫΛ݁߹ͨ͠૬ޓֶशʹΑΔܦ࿏༧ଌख๏ ˔ 'PSXBSE1SFEJDUJPO/FUXPSLɿҰൠతͳيಓ༧ଌख๏ ؍ଌˠ༧ଌ  ˔ #BDLXBSE1SFEJDUJPO/FUXPSLɿҰൠతͳيಓ༧ଌख๏ͷٯ ༧ଌˠ؍ଌ  ૬ޓ੍໿ʹج͍ͮͯ"EWFSTBSJBM"UUBDLͷ֓೦ʹجͮ͘ϞσϧΛߏங

    ˔ ೖྗي੻ΛJUFSBUJWFʹมߋ ˔ Ϟσϧͷग़ྗͱҰகͤ͞Δ͜ͱͰɼ৽͍֓͠೦ ૬ޓ߈ܸ ͱݺͿϞσϧΛ։ൃ 3FDJQSPDBM/FUXPSL<4)BP $713 > 18 orks for Human Trajectory Prediction iqun Zhao, and Zhihai He ersity of Missouri ,hezhi}@mail.missouri.edu ward ward orms y dif- prop- earn- Figure 1. Illustration of our idea of reciprocal learning for human 3. The generator is constructed by a decoder LSTM. Sim- ilar to the conditional GAN [24], a white noise vector Z is sampled from a multivariate normal distribution. Then, a merge layer is used in our proposed network which concate- nates all encoded features mentioned above with the noise vector Z. We take this as the input to the LSTM decoder to generate the candidate future paths for each human. The discriminator is built with an LSTM encoder which takes the input as randomly chosen trajectory from either ground truth or predicted trajectories and classifies them as “real” or “fake”. Generally speaking, the discriminator classifies the trajectories which are not accurate as “fake” and forces the generator to generate more realistic and feasible trajec- tories. Within the framework of our reciprocal learning for hu- man trajectory prediction, let Gθ : X → Y and Gφ : Y → Figure 4. Illustration of the proposed attack method. S. Hao, et al., “Reciprocal Learning Networks for Human Trajectory Prediction,” CVPR, 2020.
  19. 1SFEJDUFE&OEQPJOU$POEJUJPOFE/FUXPSL 1&$/FU  ˔ ༧ଌ࠷ऴ஍఺ ΤϯυϙΠϯτ Λॏࢹֶͨ͠शΛߦ͏ܦ࿏༧ଌख๏  ɹɹͰΤϯυϙΠϯτΛ༧ଌ͠ɼ1BTU&ODPEJOHͷग़ྗͱ࿈݁ DPODBUFODPEJOH

      ࿈݁ͨ͠ಛ௃ྔ͔Β4PDJBM1PPMJOH಺ͷ֤ύϥϝλಛ௃ྔΛऔಘ  าߦऀYาߦऀͷ4PDJBM.BTLͰาߦऀؒͷΠϯλϥΫγϣϯΛٻΊΔ  DPODBUFODPEJOHͱΠϯλϥΫγϣϯ৘ใ͔ΒͰܦ࿏Λ༧ଌ 1&$/FU<,.BOHBMBN &$$7 > 19 6 K. Mangalam, H. Girase, S. Agarwal, K. Lee, E. Adeli, J. Malik, A. Gaidon Dlatent PECNet: Pedestrian Endpoint Conditioned Trajectory Prediction Network 13 lower prediction error than way-points in the middle! This in a nutshell, con- firms the motivation of this work. E↵ect of Number of samples (K): All the previous works use K = 20 sam- ples (except DESIRE which uses K = 5) to evaluate the multi-modal predictions for metrics ADE & FDE. Referring to Figure 5, we see the expected decreas- ing trend in ADE & FDE with time as K increases. Further, we observe that our proposed method achieves the same error as the previous works with much smaller K. Previous state-of-the-art achieves 12.58 [39] ADE using K = 20 sam- ples which is matched by PECNet at half the number of samples, K = 10. This further lends support to our hypothesis that conditioning on the inferred way- point significantly reduces the modeling complexity for multi-modal trajectory forecasting, providing a better estimate of the ground truth. Lastly, as K grows large (K ! 1) we observe that the FDE slowly gets closer to 0 with more number of samples, as the ground truth Gc is eventually found. However, the ADE error is still large (6.49) because of the errors in the rest of the predicted trajectory. This is in accordance with the observed ADE (8.24) for the oracle conditioned on the last observed point (i.e. 0 FDE error) in Fig. 4. Design choice for VAE: We also evaluate our design choice of using the in- ferred future way-points ˆ Gc for training subsequent modeules (social pooling & prediction) instead of using the ground truth Gc . As mentioned in Section 3.2, this is also a valid choice for training PECNet end to end. Empirically, we find Fig. 6. Visualizing Multimodality: We show visualizations for some multi-modal ೖྗ஋ ਅ஋ ༧ଌ஋ K. Mangalam, et al., “It is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction,” ECCV, 2020. Pfuture
  20. ຊαʔϕΠͷ໨త 20 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w

    "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦
  21. άϥϑߏ଄Λۭ࣌ؒํ޲ʹ֦ுͨ͠ܦ࿏༧ଌख๏ ˔ /PEFɿର৅ͷҐஔ৘ใ ˔ &EHFɿର৅ؒͷۭؒ৘ใɼ࣌ؒํ޲΁఻೻͢Δର৅ࣗ਎ͷ৘ใ ˔ /PEFͱ&EHF͔Β"UUFOUJPOΛٻΊɼ஫໨ର৅Λಋग़  ஫໨ର৅Λճආ͢Δܦ࿏༧ଌ͕Մೳ 

    "UUFOUJPOΛٻΊΔ͜ͱͰࢹ֮తͳઆ໌͕Մೳ 4PDJBM"UUFOUJPO<"7FNVMB *$3" > 21 ponding factor graph el (a) (b) (a) (b) Fig. 3. Architecture of EdgeRNN (left), Attention module (middle) and NodeRNN (right) and Interacting Gaussian Processes [4]. Hence, we chose Social LSTM as the baseline to compare the performance of our method. C. Quantitative Results The prediction errors for all the methods on the 5 crowd A. Vemula, et al., “Social Attention: Modeling Attention in Human Crowds,” ICRA, 2018.
  22. Ҡಈର৅ͷߦಈʹΑΔةݥ౓Λ"UUFOUJPOͰਪఆ͠ɼߦಈͷಛ௃ʹॏΈ෇͚ ˔ .PUJPO&ODPEFS.PEVMFͰର৅ຖͷߦಈΛΤϯίʔυ ˔ -PDBUJPO&ODPEFS.PEVMFͰର৅ຖͷҐஔ৘ใΛΤϯίʔυ  ༧ଌର৅ͱશଞର৅ͷ಺ੵΛٻΊɼ4PGUNBYͰଞର৅ͷಛ௃ʹॏΈ෇͚ ˔ ͭͷ.PEVMFΛ࿈݁͠ɼ࣍࣌ࠁҎ߱ͷܦ࿏Λ༧ଌ $*%//<:9V

    $713 > 22 N t h 1 fc 2 fc 3 fc 1 t S 1 fc 2 fc 3 fc i t S 1 fc 2 fc 3 fc N t S 1 t h i t h Ö Ö ,1 i t a , i i t a , i N t a Ö Ö 1 1 1 1 2 , ,..., t S S S LSTM LSTM 1 t z 1 2 , ,..., i i i t S S S LSTM LSTM i t z 1 2 , ,..., N N N t S S S LSTM LSTM N t z Ö Ö Ö Ö Displacement Prediction Module 1 i t S d + i t c fc # ƒ ƒ ƒ Location Encoder Module Motion Encoder Module Crowd Interaction ƒ # Inner product Scalar multiplication Sum Ö Ö Ö Ö Figure 2. The architecture of crowd interaction deep neural network (CIDNN). Successful Cases Figure 3. Qualitative results: history traj Successful Cases Figure 3. Qualitative results: history trajectory (red), ground truth (blue Successful Cases Figure 3. Qualitative results: history trajectory (red), ground truth (blue), and predicted trajectories from ou ೖྗ஋ ਅ஋ ༧ଌ஋ Y. Xu, et al., “Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction,” CVPR, 2018.
  23. ݱ࣌ࠁͷΠϯλϥΫγϣϯ৘ใ͔Β༧ଌର৅ͷະདྷͷ༧ଌܦ࿏Λߋ৽ ˔ 4UBUFTSFpOFNFOUNPEVMF಺ͷͭͷػߏͰߴਫ਼౓ͳܦ࿏༧ଌΛ࣮ݱ  ଞର৅ͱͷিಥΛ๷͙1FEFTUSJBOBXBSFBUUFOUJPO 1"   ଞର৅ͷಈ͖͔Βɼ༧ଌର৅ࣗ਎͕ܦ࿏Λબ୒͢Δ.PUJPOHBUF .(

     ˔ .(ͰিಥΛىͦ͜͠͏ͳର৅ͷಈ͖͔Βܦ࿏Λબ୒ ˔ 1"Ͱ༧ଌର৅ۙྡͷଞର৅ʹண໨ 43-45.<1;IBOH $713 > 23 5,36,39]. Vemula from the hidden gives an impor- et al. [33] utilize ght the important pairwise velocity who are in simi- ms to selects mo- strian during the lly aware neigh- d in previous ap- ramework. This lution Networks LSTM LSTM LSTM LSTM SR t t+1 LSTM LSTM States refinement module LSTM states Input the location to LSTM Ouput the prediction ... selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input tra- jectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise sam- ples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. Figure 6. Selected feature patterns by motion gate. Each row is related to a hidden neuron (feature) of LSTM. Column 1: Activa- tion trajectory pattern of the hidden feature. Column 2-6: Pairwise trajectory examples (end with solid dots) having high activation to 3) Row 3: Thi considers more tion. 4) Row 4 hidden feature attention on th walk towards h Pedestrian- ples of the ped LSTM in Fig.7 to the close ne tention, 2) the often largely fo refinement ten bors with grou longer time ran Figure 7. Illustr magenta represe the dashed circle ment. Larger cir represents the ta ones are his/her their walking dir 5. Conclusio selects the features, where each row is related to a certain dimension of hidden feature. In Fig.6, the first column shows the trajectory patterns captured by hidden features started from origin and ended at the dots, which are extracted in similar way as Fig.2(a). The motion gate for a feature considers pairwise input tra- jectories with similar configurations. Some examples for high response of the gate are shown in the other columns of Fig.6. In these pairwise trajectory samples, the red and blue ones are respectively the trajectories of pedestrian i and j, and the time step we calculate the motion gate are shown with dots (where the trajectory ends). These pairwise sam- ples are extracted by searching from database with highest activation for the motion gate neuron. High response of gate means that the corresponding feature is selected. 3) Row 3: This case is similar to row 2. This gate element considers more distant neighbor walking in opposite direc- tion. 4) Row 4: The neighbor in blue is static, the selected hidden feature shows that pedestrian i in red potentially pay attention on this stationary neighbor in case he is about to walk towards him/her. Pedestrian-wise attention. We illustrate some exam- ples of the pedestrian-wise attention expected by our SR- LSTM in Fig.7. It shows that 1) dominant attention is paid to the close neighbors, while the others also take slight at- tention, 2) the attention given by the first refinement layer often largely focuses on the close neighbors, and the second refinement tends to strengthen the effect of farther neigh- bors with group behavior or may influence the pedestrian in longer time range. 1FEFTUSJBOBXBSFBUUFOUJPO .PUJPOHBUF ༧ଌର৅ ༧ଌର৅ P. Zhang, et al., “SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction,” CVPR, 2019.
  24. কདྷͷܦ࿏ͱߦಈΛಉ࣌ʹ༧ଌ͢ΔϞσϧΛఏҊ ˔ 1FSTPO#FIBWJPS.PEVMFɿาߦऀͷ֎ݟ৘ใͱࠎ֨৘ใΛΤϯίʔυ ˔ 1FSTPO*OUFSBDUJPO.PEVMFɿपลͷ੩త؀ڥ৘ใͱࣗಈं౳ͷ෺ମ৘ใΛΤϯίʔυ ˔ 7JTVBM'FBUVSF5FOTPS2ɿ্هͭͷಛ௃ͱաڈͷي੻৘ใΛΤϯίʔυ ˔ 5SBKFDUPSZ(FOFSBUPSɿকདྷͷܦ࿏Λ༧ଌ ˔

    "DUJWJUZ1SFEJDUJPOɿ༧ଌ࠷ऴ࣌ࠁͷߦಈΛ༧ଌ /FYU<+-JBOH $713 > 24 Figure 2. Overview of our model. Given a sequence of frames containing the person for prediction, our model utilizes person behavior module and person interaction module to encode rich visual semantics into a feature tensor. We propose novel person interaction module that takes into account both person-scene and person-object relations for joint activities and locations prediction. 3. Approach RoIAlign CNN Figure 6. (Better viewed in color.) Qualitative comparison between our method and the baselines. Yellow path is the observable trajectory and Green path is the ground truth trajectory during the prediction period. Predictions are shown as Blue heatmaps. Our model also predicts the future activity, which is shown in the text and with the person pose template. Figure 7. (Better viewed in color.) Qualitative analysis of o Method ETH HOTEL UN Model Linear 1.33 / 2.94 0.39 / 0.72 0.82 LSTM 1.09 / 2.41 0.86 / 1.91 0.61 J. Liang, et al., “Peeking into the Future: Predicting Future Person Activities and Locations in Videos,” CVPR, 2019.
  25. าߦऀಉ࢜ͷΠϯλϥΫγϣϯʹՃ͑ɼ੩త؀ڥ৘ใΛߟྀͨ͠༧ଌख๏ ˔ 1IZTJDBM"UUFOUJPOɿ੩త؀ڥʹؔ͢Δ"UUFOUJPOΛਪఆ ˔ 4PDJBM"UUFOUJPOɿಈత෺ମʹؔ͢Δ"UUFOUJPOΛਪఆ ֤"UUFOUJPOͱ-45.&ODPEFSͷग़ྗ͔Βকདྷͷܦ࿏Λ༧ଌ 4P1IJF<"4BEFHIJBO $713 > 25

    Physical Attention: !""#$ Social Attention: !""%& Generator Attention Module for i-th person GAN Module  concat. Discriminator  LSTM LSTM   LSTM LSTM LSTM   LSTM concat. concat. concat. z z z decoder 1st agent i-th agent N-th agent Attention Module for 1st person  Attention Module for n-th person   CNN Feature Extractor Module i-th agent calc. relative relative relative relative   LSTM   LSTM LSTM encoder (a) (b) (c) N-th agent 1st agent Figure 2. An overview of SoPhie architecture. Sophie consists of three key modules including: (a) A feature extractor module, (b) An attention module, and (c) An LSTM based GAN module. 3.3. Feature extractors where πj is the index of the other agents sorted according to their distances to the target agent i. In this framework, each Nexus 6 Li Figure 3. Using the generator to sample trajectories and the discrimin maps for SDD scenes. Maps are presented in red, and generated only Ground Truth Social LSTM Social GAN Sophie (Ours) Figure 4. Comparison of Sophie’s predictions against the ground truth trajectories and two baselines. Each pedestrian is displayed with a different color, where dashed lines are observed trajecto- A. Sadeghian, et al., “SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints,” CVPR, 2019.
  26. ΠϯλϥΫγϣϯΛ࣌ؒํ޲΁఻ൖͨ͠༧ଌख๏ ΠϯλϥΫγϣϯΛߟྀ͢ΔͨΊʹ(SBQI"UUFOUJPO/FUXPSL ("5 Λద༻ ˔ ("5ɿάϥϑߏ଄ΛऔΓೖΕͨ"UUFOUJPOʹجͮ͘(SBQI$POWPMVUJPOBM/FUXPSLT  γʔϯશମʹ͍Δଞର৅ͷؔ܎ͷॏཁ౓Λ"UUFOUJPOػߏͰֶश ˔ ("5ͰٻΊͨಛ௃Λ࣌ؒํ޲ʹ఻೻͢Δ͜ͱͰۭ࣌ؒͷΠϯλϥΫγϣϯΛߟྀ

     িಥͷՄೳੑ͕͋Δର৅ͷ৘ใΛաڈͷܦ࿏͔Βಋग़Մೳ 45("5<:)VBOH *$$7 > 26 GAT GAT GAT GAT c c c ] ] ] '/670 '/670 Encoder State Decoder '/670 GAT Graph Attention Network Concat Noise c · · · <latexit sha1_base64="RyHhXBLV/0cOSbkT6YVaYO8lbL8=">AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPXgoePFYwX5AG8pms2nXbnbD7kQoof/BiwdFvPp/vPlv3LY5aOuDgcd7M8zMC1PBDXret7Oyura+sVnaKm/v7O7tVw4OW0ZlmrImVULpTkgME1yyJnIUrJNqRpJQsHY4up367SemDVfyAccpCxIykDzmlKCVWj0aKTT9StWreTO4y8QvSBUKNPqVr16kaJYwiVQQY7q+l2KQE42cCjYp9zLDUkJHZMC6lkqSMBPks2sn7qlVIjdW2pZEd6b+nshJYsw4CW1nQnBoFr2p+J/XzTC+DnIu0wyZpPNFcSZcVO70dTfimlEUY0sI1dze6tIh0YSiDahsQ/AXX14mrfOaf1Hz7i+r9ZsijhIcwwmcgQ9XUIc7aEATKDzCM7zCm6OcF+fd+Zi3rjjFzBH8gfP5A60zjys=</latexit> tobs <latexit sha1_base64="vcpPldxy0fXoWuPkuol+dvCEp1Q=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqYI4BLx4jmAckS5idTJIhszPLTK8QlnyEFw+KePV7vPk3TpI9aGJBQ1HVTXdXlEhh0fe/vcLG5tb2TnG3tLd/cHhUPj5pWZ0axptMS206EbVcCsWbKFDyTmI4jSPJ29Hkbu63n7ixQqtHnCY8jOlIiaFgFJ3Uxn6mIzvrlyt+1V+ArJMgJxXI0eiXv3oDzdKYK2SSWtsN/ATDjBoUTPJZqZdanlA2oSPedVTRmNswW5w7IxdOGZChNq4UkoX6eyKjsbXTOHKdMcWxXfXm4n9eN8VhLcyESlLkii0XDVNJUJP572QgDGcop45QZoS7lbAxNZShS6jkQghWX14nratqcF31H24q9VoeRxHO4BwuIYBbqMM9NKAJDCbwDK/w5iXei/fufSxbC14+cwp/4H3+ALeBj8c=</latexit> t2 <latexit sha1_base64="7ItLwn8Q8RXHU3NW/LT2SA7MEc4=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWMF0xbaUDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsbm1vbObmmvvH9weHRcOTltmyTTjPsskYnuhtRwKRT3UaDk3VRzGoeSd8LJ3dzvPHFtRKIecZryIKYjJSLBKFrJx0Fenw0qVbfmLkDWiVeQKhRoDSpf/WHCspgrZJIa0/PcFIOcahRM8lm5nxmeUjahI96zVNGYmyBfHDsjl1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7YheKsvr5N2veZd19yHm2qzUcRRgnO4gCvw4BaacA8t8IGBgGd4hTdHOS/Ou/OxbN1wipkz+APn8wfJU46h</latexit> t1 <latexit sha1_base64="Bdkp2HUnpwjOv5mbz6SwzTMJ8ag=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4Pcmw2qNbfuLkDWiVeQGhRoDapf/WHCspgrZJIa0/PcFIOcahRM8lmlnxmeUjahI96zVNGYmyBfHDsjF1YZkijRthSShfp7IqexMdM4tJ0xxbFZ9ebif14vw6gR5EKlGXLFlouiTBJMyPxzMhSaM5RTSyjTwt5K2JhqytDmU7EheKsvr5P2Vd27rrsPN7Vmo4ijDGdwDpfgwS004R5a4AMDAc/wCm+Ocl6cd+dj2VpyiplT+APn8wfHzo6g</latexit> t3 <latexit sha1_base64="OcT9ss7O545agx6P8OIS9udeh5k=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lUsMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSj4P8ejao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKNGkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvuu4+3NSajSKOMpzBOVyCB7fQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx/K2I6i</latexit> z M-LSTM G-LSTM Figure 2. The architecture of our proposed STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encod Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and GAT. The Intermediate St encapsulates the spatial and temporal information of all observed trajectories. The Decoder module generates the future trajectories bas edestrians in a scene are considered as nodes on the aph at every time-step. The edges on the graph repre- st of human-human interactions. ~ h1 <latexit sha1_base64="16Yxx8+YpYlp2tGJX9qHHqgdbbY=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMF+wFtKJvtpl262Q27k0IJ+RlePCji1V/jzX/jts1Bqw8GHu/NMDMvTAQ36HlfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByN/c7U6YNV/IRZwkLYjKSPOKUoJV6/Smj2TgfZH4+qNa8ureA+5f4BalBgeag+tkfKprGTCIVxJie7yUYZEQjp4LllX5qWELohIxYz1JJYmaCbHFy7p5ZZehGStuS6C7UnxMZiY2ZxaHtjAmOzao3F//zeilGN0HGZZIik3S5KEqFi8qd/+8OuWYUxcwSQjW3t7p0TDShaFOq2BD81Zf/kvZF3b+sew9XtcZtEUcZTuAUzsGHa2jAPTShBRQUPMELvDroPDtvzvuyteQUM8fwC87HN46CkWw=</latexit> ~ h5 <latexit sha1_base64="R1g6gAHeZA/axOfzhGwObt3Ay0g=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIf6LLoxmUF+4A0lMn0ph06mQkzk0IJ+Qw3LhRx69e482+ctllo64ELh3Pu5d57woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjtpapotCikkvVDYkGzgS0DDMcuokCEoccOuH4fuZ3JqA0k+LJTBMIYjIULGKUGCv5vQnQbJT3s+u8X625dXcOvEq8gtRQgWa/+tUbSJrGIAzlRGvfcxMTZEQZRjnklV6qISF0TIbgWypIDDrI5ifn+MwqAxxJZUsYPFd/T2Qk1noah7YzJmakl72Z+J/npya6DTImktSAoItFUcqxkXj2Px4wBdTwqSWEKmZvxXREFKHGplSxIXjLL6+S9kXdu6y7j1e1xl0RRxmdoFN0jjx0gxroATVRC1Ek0TN6RW+OcV6cd+dj0Vpyiplj9AfO5w+UlpFw</latexit> ~ h4 <latexit sha1_base64="zHvzoSStxfsE4+HkOUWqlEWKApg=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0m0oMeiF48V7AekoWy2k3bpZjfsbgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJR8tUUWhTyaXqhUQDZwLahhkOvUQBiUMO3XByP/e7U1CaSfFkZgkEMRkJFjFKjJX8/hRoNs4HWSMfVGtu3V0ArxOvIDVUoDWofvWHkqYxCEM50dr33MQEGVGGUQ55pZ9qSAidkBH4lgoSgw6yxck5vrDKEEdS2RIGL9TfExmJtZ7Foe2MiRnrVW8u/uf5qYlug4yJJDUg6HJRlHJsJJ7/j4dMATV8ZgmhitlbMR0TRaixKVVsCN7qy+ukc1X3ruvuY6PWvCviKKMzdI4ukYduUBM9oBZqI4okekav6M0xzovz7nwsW0tOMXOK/sD5/AGTEZFv</latexit> ~ h3 <latexit sha1_base64="4MzhxHH1VdXzxV7PaR8PPHgLBTc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0msoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJ56ruNeru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGRjJFu</latexit> ~ h2 <latexit sha1_base64="hAcukdL3sj/GIPn2AtYuEsqEXSA=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7AekoWy2m3bpZjfsTgol5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5YSK4Adf9dkobm1vbO+Xdyt7+weFR9fikY1SqKWtTJZTuhcQwwSVrAwfBeolmJA4F64aT+7nfnTJtuJJPMEtYEJOR5BGnBKzk96eMZuN8kDXyQbXm1t0F8DrxClJDBVqD6ld/qGgaMwlUEGN8z00gyIgGTgXLK/3UsITQCRkx31JJYmaCbHFyji+sMsSR0rYk4IX6eyIjsTGzOLSdMYGxWfXm4n+en0J0G2RcJikwSZeLolRgUHj+Px5yzSiImSWEam5vxXRMNKFgU6rYELzVl9dJp1H3ruru43WteVfEUUZn6BxdIg/doCZ6QC3URhQp9Ixe0ZsDzovz7nwsW0tOMXOK/sD5/AGQB5Ft</latexit> ~ h6 <latexit sha1_base64="Kuq84mWaskTChJNeNnyTi5aZ6oc=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lU1GPRi8cK9gPSUDbbSbt0sxt2N4US8jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqKLSo5FJ1Q6KBMwEtwwyHbqKAxCGHTji+n/mdCSjNpHgy0wSCmAwFixglxkp+bwI0G+X97DrvV2tu3Z0DrxKvIDVUoNmvfvUGkqYxCEM50dr33MQEGVGGUQ55pZdqSAgdkyH4lgoSgw6y+ck5PrPKAEdS2RIGz9XfExmJtZ7Goe2MiRnpZW8m/uf5qYlug4yJJDUg6GJRlHJsJJ79jwdMATV8agmhitlbMR0RRaixKVVsCN7yy6ukfVH3Luvu41WtcVfEUUYn6BSdIw/doAZ6QE3UQhRJ9Ixe0ZtjnBfn3flYtJacYuYY/YHz+QOWG5Fx</latexit> ~ h1 <latexit sha1_base64="KoV/fuxRJ+UPzkEixIjJIx41qvU=">AAAB+HicbVBNS8NAEJ34WetHox69LBbRU0lU0GPRi8cK9gPaGDbbTbt0swm7m0IN+SVePCji1Z/izX/jts1BWx8MPN6bYWZekHCmtON8Wyura+sbm6Wt8vbO7l7F3j9oqTiVhDZJzGPZCbCinAna1Exz2kkkxVHAaTsY3U799phKxWLxoCcJ9SI8ECxkBGsj+XalN6YkG+aP2WnuZ27u21Wn5syAlolbkCoUaPj2V68fkzSiQhOOleq6TqK9DEvNCKd5uZcqmmAywgPaNVTgiCovmx2eoxOj9FEYS1NCo5n6eyLDkVKTKDCdEdZDtehNxf+8bqrDay9jIkk1FWS+KEw50jGapoD6TFKi+cQQTCQztyIyxBITbbIqmxDcxZeXSeu85l7UnPvLav2miKMER3AMZ+DCFdThDhrQBAIpPMMrvFlP1ov1bn3MW1esYuYQ/sD6/AHv15NC</latexit> ~12 <latexit sha1_base64="7wH2WVdWISdwYnqsMTVyR5rAKM0=">AAAB+nicbVBNS8NAEJ34WetXqkcvi0XwVJIq6LHoxWMF+wFtCJvtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBlhRzgRtaaY57SaS4ijgtBOMb2d+Z0KlYrF40NOEehEeChYygrWRfLvSn1CS9TFPRjj3M7ee+3bVqTlzoFXiFqQKBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXu6niiaYjPGQ9gwVOKLKy+an5+jMKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0yiYEd/nlVdKu19yLmnN/WW3cFHGU4ARO4RxcuIIG3EETWkDgEZ7hFd6sJ+vFerc+Fq1rVjFzDH9gff4AT7aUBQ==</latexit> ~13 <latexit sha1_base64="Oj5OOfl5UD0Mrh/Gd9yVwPeeGOY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ+5F7ts1p+7MgVaJW5AaFGj69ld/EJM0okITjpXquU6ivQxLzQineaWfKppgMsZD2jNU4IgqL5ufnqNTowxQGEtTQqO5+nsiw5FS0ygwnRHWI7XszcT/vF6qw2svYyJJNRVksShMOdIxmuWABkxSovnUEEwkM7ciMsISE23SqpgQ3OWXV0n7vO5e1J37y1rjpoijDMdwAmfgwhU04A6a0AICj/AMr/BmPVkv1rv1sWgtWcXMEfyB9fkDUTuUBg==</latexit> ~11 <latexit sha1_base64="Z1DxO/2cTeAPqRO0wzlGCoK0+XY=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSyCp5KooMeiF48V7Ae0IWy2m3bpZhN2N5US81O8eFDEq7/Em//GbZuDtj4YeLw3w8y8IOFMacf5tkpr6xubW+Xtys7u3v6BXT1sqziVhLZIzGPZDbCinAna0kxz2k0kxVHAaScY3878zoRKxWLxoKcJ9SI8FCxkBGsj+Xa1P6Ek62OejHDuZ66b+3bNqTtzoFXiFqQGBZq+/dUfxCSNqNCEY6V6rpNoL8NSM8JpXumniiaYjPGQ9gwVOKLKy+an5+jUKAMUxtKU0Giu/p7IcKTUNApMZ4T1SC17M/E/r5fq8NrLmEhSTQVZLApTjnSMZjmgAZOUaD41BBPJzK2IjLDERJu0KiYEd/nlVdI+r7sXdef+sta4KeIowzGcwBm4cAUNuIMmtIDAIzzDK7xZT9aL9W59LFpLVjFzBH9gff4ATjGUBA==</latexit> ~14 <latexit sha1_base64="bUiB7GM/EueyHNLIaRxhSyx8/6A=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIt6LHoxWMF+wFtCJPtpl262YTdTaXE/BQvHhTx6i/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQrhy1VZxKQlsk5rHsBqAoZ4K2NNOcdhNJIQo47QTj25nfmVCpWCwe9DShXgRDwUJGQBvJtyv9CSVZH3gygtzP3Hru21Wn5syBV4lbkCoq0PTtr/4gJmlEhSYclOq5TqK9DKRmhNO83E8VTYCMYUh7hgqIqPKy+ek5PjPKAIexNCU0nqu/JzKIlJpGgemMQI/UsjcT//N6qQ6vvYyJJNVUkMWiMOVYx3iWAx4wSYnmU0OASGZuxWQEEog2aZVNCO7yy6ukfVFzL2vOfb3auCniKKETdIrOkYuuUAPdoSZqIYIe0TN6RW/Wk/VivVsfi9Y1q5g5Rn9gff4AUsCUBw==</latexit> ~15 <latexit sha1_base64="cC09I5yAhT8haElJRiWSguUZ4i0=">AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclcQHuiy6cVnBPqAJYTK9bYdOJmFmUikxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHCmtON8Wyura+sbm6Wt8vbO7t6+XTloqTiVFJo05rHshEQBZwKammkOnUQCiUIO7XB0O/XbY5CKxeJBTxLwIzIQrM8o0UYK7Io3Bpp5hCdDkgeZe5kHdtWpOTPgZeIWpIoKNAL7y+vFNI1AaMqJUl3XSbSfEakZ5ZCXvVRBQuiIDKBrqCARKD+bnZ7jE6P0cD+WpoTGM/X3REYipSZRaDojoodq0ZuK/3ndVPev/YyJJNUg6HxRP+VYx3iaA+4xCVTziSGESmZuxXRIJKHapFU2IbiLLy+T1lnNPa859xfV+k0RRwkdoWN0ilx0heroDjVQE1H0iJ7RK3qznqwX6936mLeuWMXMIfoD6/MHVEWUCA==</latexit> ~16 <latexit sha1_base64="Th2k0L1quCqnOCL5jErebK97YEU=">AAAB+nicbVBNS8NAEN34WetXqkcvi0XwVBIV9Vj04rGC/YAmhM122i7dbMLuplJifooXD4p49Zd489+4bXPQ1gcDj/dmmJkXJpwp7Tjf1srq2vrGZmmrvL2zu7dvVw5aKk4lhSaNeSw7IVHAmYCmZppDJ5FAopBDOxzdTv32GKRisXjQkwT8iAwE6zNKtJECu+KNgWYe4cmQ5EHmXuaBXXVqzgx4mbgFqaICjcD+8noxTSMQmnKiVNd1Eu1nRGpGOeRlL1WQEDoiA+gaKkgEys9mp+f4xCg93I+lKaHxTP09kZFIqUkUms6I6KFa9Kbif1431f1rP2MiSTUIOl/UTznWMZ7mgHtMAtV8YgihkplbMR0SSag2aZVNCO7iy8ukdVZzz2vO/UW1flPEUUJH6BidIhddoTq6Qw3URBQ9omf0it6sJ+vFerc+5q0rVjFziP7A+vwBVcqUCQ==</latexit> n illustration of graph attention layer. It allows a node fferent importance to different nodes within a neigh- propose to use another LSTM to model the temp lations between interactions explicitly. We term t as G-LSTM: gt i = G-LSTM(gt 1 i , ˆ mt i ;Wg) where ˆ mt i is from Eq. 5. Wg is the G-LSTM we shared among all the sequences. In Encoder component, two LSTMs (M-L LSTM) are used to model the motion pattern of e trian, and the temporal correlations of interaction tively. We combine these two parts to accomplish of spatial and temporal information. At time-step are two hidden variables (mTobs i , gTobs i ) from two each pedestrian. In our implementation, these two ༧ଌର৅ ଞର৅ ଞର৅ ଞର৅ ଞର৅ ଞର৅ Y. Huang, et al., “STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction,” ICCV, 2019.
  27. ෳ਺ର৅Λಈతͳάϥϑߏ଄Ͱޮ཰తʹϞσϧԽ ˔ /)&ɿ؍ଌ࣌ࠁͷ/PEFಛ௃Λ-45.΁ೖྗ ˔ /'&ɿֶश࣌ʹ/PEFͷະདྷͷਅͷي੻ΛΤϯίʔυ͢ΔͨΊʹ#J-45.Λద༻ ˔ &&ɿಛఆൣғ಺ͷશର৅͔Β"UUFOUJPOΛٻΊΔ  ॏཁ౓ͷߴ͍&EHF৘ใΛऔಘ 

    ࣌ࠁຖʹ&EHF৘ใ͸มಈ ˔ ֤ಛ௃͔Β%FDPEFSͰܦ࿏Λ༧ଌ  ಺෦ͷ$7"&ͰϚϧνϞʔμϧͳܦ࿏Λ༧ଌ  (BVTTJBO.JYUVSF.PEFMͰ༧ଌܦ࿏Λચ࿅ ؀ڥ৘ใΛ௥Ճͨ͠5SBKFDUSPO <54BM[NBOO &$$7 > ͕ఏҊ 5SBKFDUSPO<#*WBOPWJD $713 > 27 Overall, we chose to make our model part of the “graph as architecture” methods, as a result of their stateful graph representation (leading to efficient iterative predictions on- line) and modularity (enabling model reuse and extensive parameter sharing). 3. Problem Formulation In this work, we are interested in jointly reasoning and generating a distribution of future trajectories for each agent in a scene simultaneously. We assume that each scene is preprocessed to track and classify agents as well as obtain their spatial coordinates at each timestep. As a result, each agent i has a classification type Ci (e.g. “Pedestrian”). Let Xt i = (xt i , yt i ) represent the position of the ith agent at time t and let Xt 1,...,N represent the same quantity for all agents in a scene. Further, let X (t1:t2) i = (Xt1 i , Xt1+1 i , . . . , Xt2 i ) denote a sequence of values for time steps t 2 [t1, t2]. As in previous works [1, 16, 49], we take as input the previous trajectories of all agents in a scene X (1:tobs) 1,...,N and aim to produce predictions b X (tobs+1:tobs+T ) 1,...,N that match the true future trajectories X (tobs+1:tobs+T ) 1,...,N . Note that we have not assumed N to be static, i.e. we can have N = f(t). !" /$ !% /& !' /( ! "-! % !' -!% Legend Modeled Node !" /$ Node $ is of type !" !" -!% Edge is of type !" -!% Edge being created Normal Edge !" /) !" -!% Attention !(#$%) !(#$') !(#) !(#(') !(#(%) !(#()) F C F C EE NHE NFE Encoder ℎ+ ℎ, -(.|0) 1(.|0, 3) F C . !(#) ., ℎ+ 4 !(#(') ! #(' ., ℎ+ 4 !(#(%) ! #(% ., ℎ+ 4 !(#(') GMM GMM GMM 4 !(#(%) 4 !(#()) Decoder 5) 5) 5) 6(#$%) + 8(#$%) 6(#$') + 8(#$') 6(#) + 8(#) 5' -5) 9(#$%) 9(#$') 9(#) 5% -5) Legend LSTM Cell Modulating Function FC Fully-Connected Layer Projection to a GMM Concatenation Randomly sampled Train time only Predict time only Train and Predict GMM M M M M M M M + Figure 2. Top: An example graph with four nodes. a is our mod- T. Salzmann, et al., “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data,” ECCV, 2020. B. Ivanovic, et al., “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs,” ICCV, 2019.
  28. ୯७ʹϊΠζϕΫτϧΛ෇༩͢Δͱɼߴ͍෼ࢄΛ࣋ͭܦ࿏Λ༧ଌͯ͠͠·͏ ˔ طଘݚڀ͸ਅʹϚϧνϞʔμϧͳ෼෍ΛֶशͰ͖͍ͯͳ͍ ༧ଌܦ࿏ͱϊΠζϕΫτϧؒͷજࡏతදݱΛֶश ˔ ϊΠζϕΫτϧ͔Βੜ੒ͨ͠༧ଌܦ࿏Λ-45.&ODPEFS΁ೖྗ ˔ ݩͷϊΠζϕΫτϧͱྨࣅ͢ΔΑ͏ʹϚοϐϯά ˔ ਅʹϚϧνϞʔμϧͳܦ࿏Λੜ੒Մೳ

    4PDJBM#J("5<7,PTBSBKV /FVS*14 > 28 Figure 2: Architecture for the proposal Social-BiGAT model. The model consists of a single generator, two Figure 4: Generated trajectories visualized for the S-GAN-P, Sophie, and Social-BiGAT models across four main scenes. Observed trajectories are shown as solid lines, ground truth future movements are shown as dashed lines, and generated samples are shown as contour maps. Different colors correspond to different pedestrians. V. Kosaraju, et al., “Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks,” NeurIPS, 2019.
  29. 4QBUJBM5FNQPSBM(SBQIΛ༻͍ͯϞσϧԽ ˔ (SBQI$POWPMVUJPO/FUXPSL ($/ ͰΠϯλϥΫγϣϯʹؔ͢Δಛ௃நग़  ྡ઀ߦྻ͔ΒΠϯλϥΫγϣϯ৘ใΛٻΊΔ ˔ ($/Ͱಘͨಛ௃͔Β5FNQPSBM$POWPMVUJPOBM/FUXPSL 5$/

    Ͱ༧ଌ෼෍Λग़ྗ  -45.͸༧ଌܦ࿏Λஞ࣍ग़ྗ͢Δ͕ɼ5$/͸༧ଌܦ࿏Λฒྻʹग़ྗ  ਪ࿦଎౓Λେ෯ʹվળ 4PDJBM45($//<".PIBNFE $713 > 29 Figure 2. The Social-STGCNN Model. Given T frames, we construct the spatio-temporal graph representing G = (V, A). Then G is forwarded through the Spatio-Temporal Graph Convolution Neural Networks (ST-GCNNs) creating a spatio-temporal embedding. Following this, the TXP-CNNs predicts future trajectories. P is the dimension of pedestrian position, N is the number of pedestrians, T is the number ˆ A. Mohamed, et al., “Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction,” CVPR, 2020.
  30. าߦऀؒͷؔ܎Λௐࠪ͢ΔάϧʔϓϕʔεͷΠϯλϥΫγϣϯΛϞσϧԽ ˔ άϧʔϓɿಉ͡໨త஍ʹߦ͘าߦऀɼԕํͷಉํ޲΁ਐΉาߦऀɼFUD ˔ άϧʔϓͷ൑ผͷͨΊʹɼਓ͕άϧʔϓ৘ใΛΞϊςʔγϣϯ 3FMBUJPOBM4PDJBM3FQSFTFOUBUJPOͰάϧʔϓͷΠϯλϥΫγϣϯΛٻΊΔ ˔ पғͷ੩త؀ڥ৘ใ΍աڈͷي੻৘ใͱ࿈݁͠ɼܦ࿏Λ༧ଌ 34#(<+4VO $713

    > 30 BiLSTM CNN RSBG Generator Coordinates Imagepatch Coordinates GCN LSTM IndividualRepresentation Decoder RelationalSocialRepresentation Features RSBG J. Sun, et al., “Recursive Social Behavior Graph for Trajectory Prediction,” CVPR, 2020.
  31. ܦ࿏༧ଌͰ༻͍Δ-45.͸఺ͷ໰୊͕͋Δ ˔ -45.͸ෳࡶͳ࣌ؒґଘੑͷϞσϧԽ͕ࠔ೉ ˔ "UUFOUJPOϞσϧͷ༧ଌख๏͕ΠϯλϥΫγϣϯΛ׬શʹϞσϧԽͰ͖ͳ͍ 5SBOTGPSNFSΛۭ࣌ؒత"UUFOUJPO΁֦ு͠ɼܦ࿏༧ଌλεΫʹԠ༻ ˔ 5FNQPSBM5SBOTGPSNFSͰي੻ಛ௃ΛΤϯίʔυ ˔ 4QBUJBM5SBOTGPSNFSͰ࣌ࠁຖʹಠཱͨ͠ΠϯλϥΫγϣϯΛநग़

    ˔ ͭͷ5SBOTGPSNFSΛ༻͍Δ͜ͱͰɼ-45.Λ༻͍ͨ༧ଌख๏ͷ༧ଌਫ਼౓Λେ෯ʹվળ 45"3<$:V &$$7 > 31 6 Yu C., Ma X., Ren J., Zhao H., Yi S. VHOIDWWHQWLRQ ) & (a) Temporal Transformer (b) Spatial Transformer Fig. 3. STAR has two main components, Temporal Transformer and Spatial Tr former. (a) Temporal Transformer treats each pedestrians independently and extr the temporal dependencies by Transformer model (h is the embedding of pedest Yu C., Ma X., Ren J., Zhao H., Yi S. SHGHVWULDQ WUDMHFWRULHV ) & ) & 7HPSRUDO 7UDQVIRUPHU 6SDWLDO 7UDQVIRUPHU ) & FRQFDW 6SDWLDO 7UDQVIRUPHU ) & RXWSXW 7HPSRUDO 7UDQVIRUPHU UHDG ZULWH HQFRGHU HQFRGHU GHFRGHU *UDSK 0HPRU\ DGGWRKLVWRU\ Temporal Transformer Spatial Transformer C. Yu, et al., “Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction,” ECCV, 2020.
  32. ຊαʔϕΠͷ໨త 32 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w

    "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦
  33. $//Λ༻͍ͨ༧ଌख๏ ˔ աڈͷي੻৘ใΛΤϯίʔυ͠ɼεύʔεͳϘΫηϧʹ֨ೲ ˔ $POWPMVUJPOͱ.BY1PPMJOHΛෳ਺ճߦ͍ɼ%FDPOWPMVUJPOͰ༧ଌܦ࿏Λग़ྗ ˔ -PDBUJPOCJBTNBQʹΑΓಛఆγʔϯͷ෺ମ৘ใͱજࡏతಛ௃දݱͰཁૉੵΛͱΔ  ಛఆγʔϯʹΑͬͯมԽ͢ΔาߦऀͷৼΔ෣͍Λߟྀ #FIBWJPS$//<4:J

    &$$7 > 33 output of CNN, as they are of variable lengths and observed in different periods. 3 Pedestrian Behavior Modeling and Prediction The overall framework is shown in Fig. 2. The input to our system is pedestrian walking paths in previous frames (colored curves in Fig. 2(a)). They could be obtained by simple trackers such as KLT [41]. They are then encoded into a displacement volume (Fig. 2(b)) with the proposed walking behavior encoding scheme. Behavior-CNN in Fig. 2(c) takes the encoded displacement volume as Fig. 2. System flowchart. (a) Pedestrian walking paths in previous frames. Three exam- ples are shown in different colors. Rectangles indicate current locations of pedestrians. (b) The displacement volume encoded from pedestrians’ past walking paths in (a). (c) Behavior-CNN. (d) The predicted displacement volume by Behavior-CNN. (e) Pre- dicted future pedestrian walking paths decoded from (d). Three bottom convolution layers, conv1, conv2, and conv3, are to be con- volved with input data of size X × Y × 2M. conv1 contains 64 filters of size 3 × 3 × 2M, while both conv2 and conv3 contain 64 filters of size 3 × 3 × 64. Zeros are padded to each convolution input in order to guarantee feature maps of these layers be of the same spatial size with the input. The three bottom convolution layers are followed by max pooling layers max-pool with stride 2. The output size of max-pool is X/2 × Y/2 × 64. In this way, the receptive field of the network can be doubled. Large receptive field is necessary for the task of pedestrian walking behavior modeling because each individual’s behavior are significantly influenced by his/her neighbors. A learnable location bias map of size X/2×Y/2 is channel-wisely added to each of the pooled feature maps. Every spatial location has one independent bias value shared across channels. With the location bias map, location information of the scene can be automatically learned by the proposed Behavior-CNN. As for the three top convolution layers, conv4 and conv5 contain 64 filters of size 3 × 3 × 64, while conv6 contains 2M∗ filters of size 3 × 3 × 64 to output the predicted displacement volume. Zeros are also S. Yi, et al., “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks,” ECCV, 2016.
  34. ਓশࢹ఺ʹ͓͚Δର໘ͷาߦऀͷͨΊͷҐஔ༧ଌ ˔ ਓশࢹ఺ಛ༗ͷखֻ͔ΓΛҐஔ༧ଌʹར༻  ର໘ͷาߦऀͷҐஔʹӨڹ͢ΔΤΰϞʔγϣϯ  ର໘ͷาߦऀͷεέʔϧ  ର໘ͷาߦऀͷ࢟੎ ˔

    ্هͭͷ৘ใΛ༻͍ͨϚϧνετϦʔϜϞσϧͰকདྷͷҐஔ༧ଌ 'VUVSFMPDBMJ[BUJPOJOpSTUQFSTPOWJEFPT<5:BHJ $713 > 34 Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream ܺ௜௡ ൌ ሺ࢞௧బି்౦౨౛౬ାଵ ǡ ڮ ࢞௧బ ሻ ܧ௜௡ ൌ ሺࢋ௧బା்౦౨౛౬ାଵ ǡ ڮ ࢋ௧బ ሻ ܲ௜௡ ൌ ሺ࢖௧బି்౦౨౛౬ାଵ ǡ ڮ ࢖௧బ ሻ ܺ௢௨௧ ൌ ሺ࢞௧బାଵ െ ࢞௧బ ǡ ڮ ǡ ࢞௧బା்౜౫౪౫౨౛ െ ࢞௧బ ሻ Figure 3. Proposed Network Architecture. Blue blocks corre- tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in first-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target peo- ple jointly. Given a simple assumption that heights of people do not differ too much, scales of observed peo- ple can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar- get locations. Then, we extend each location lt ∈ R2 + of ݐ଴ a) Input ݐ଴ ൅ ܶ௙௨௧௨௥௘ b) Prediction ݐ଴ െ ܶ௣௥௘௩ ൅ ͳ e) Pose ࢖௧ c-1) Location ࢒௧ c-2) Scale ݏ௧ d) Ego-motion ࢋ௧ Figure 2. Future Person Localization in First-Person Videos. Given a) Tprev-frames observations as input, we b) predict future locations of a target person in the subsequent Tfuture frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego- motion of camera wearers and e) poses of the target persons as a salient cue for the prediction. Channel-wise Concatenation Input Output Location-Scale Stream Ego-Motion Stream Pose Stream ܺ௜௡ ൌ ሺ࢞௧బି்౦౨౛౬ାଵ ǡ ڮ ࢞௧బ ሻ ܧ௜௡ ൌ ሺࢋ௧బା்౦౨౛౬ାଵ ǡ ڮ ࢋ௧బ ሻ ܲ௜௡ ൌ ሺ࢖௧బି்౦౨౛౬ାଵ ǡ ڮ ࢖௧బ ሻ ܺ௢௨௧ ൌ ሺ࢞௧బାଵ െ ࢞௧బ ǡ ڮ ǡ ࢞௧బା்౜౫౪౫౨౛ െ ࢞௧బ ሻ Figure 3. Proposed Network Architecture. Blue blocks corre- spond to convolution/deconvolution layers while gray blocks de- scribe intermediate deep features. tain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep go- ing in that direction in subsequent future frames too. How- ever, visual distances in first-person videos can correspond to different physical distances depending on where people are observed in the frame. In order to take into account this perspective effect, we propose to learn both locations and scales of target peo- ple jointly. Given a simple assumption that heights of people do not differ too much, scales of observed peo- ple can make a rough estimate of how large movements they made in the actual physical world. Formally, let Lin = (lt0−Tprev+1, . . . , lt0 ) be a history of previous tar- get locations. Then, we extend each location lt ∈ R2 + of a target person by adding the scale information of that per- son st ∈ R+, i.e., xt = (lt , st ) . Then, the ‘location- scale’ input stream in Figure 3 learns time evolution in Xin = (xt0−Tprev+1, . . . , xt0 ), and the output stream gen- erates Xout = (xt0+1 − xt0 , . . . , xt0+Tfuture − xt0 ). Ground Truth Ours Social LSTM Input NNeighbor ݐ ൌ ݐ଴ െ ͻ ݐ ൌ ݐ଴ െ ͷ ݐ ൌ –଴ ݐ ൌ ݐ଴ ൅ ͸ ݐ ൌ ݐ଴ ൅ ͳͲ Past observations Predictions (a) (b) (c) (d) (e) Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of target people (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predict locations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We compared T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
  35. ंࡌΧϝϥө૾ʹөΔาߦऀͷকདྷͷҐஔΛ༧ଌ͢Δख๏ ˔ าߦऀͷۣܗྖҬɼࣗंͷҠಈྔɼंࡌΧϝϥը૾Λೖྗ ˔ าߦऀͷকདྷͷۣܗྖҬΛग़ྗ 0116<"#IBUUBDIBSZZB $713 > 35 Figure

    2: Two stream architecture for prediction of future pedestrian bounding boxes. ion of Pedestrian Trajectories sequence ˆ v (containing information about past pedestrian Last Observation: t Prediction: t + 5 Prediction: t + 10 Prediction: t + 15 Figure 4: Rows 1-3: Point estimates. Blue: Ground-truth, Red: Kalman Filter (Table 1 row 1), Yellow: One-stream model (Table 1 row 4), Green: Two-stream model (mean of predictive distribution, Table 4 row 3). Rows 4-6: Predictive distributions of our two-stream model as heat maps. (Link to video results in the Appendix). sequences have low error (note, log(530) ≈ 6.22 the MSE that, the predicted uncertainty upper bounds the error of the A. Bhattacharyya, et al., “Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,” CVPR, 2018.
  36. ࣗಈंݕग़ɼ௥੻ɼܦ࿏༧ଌΛಉ࣌ʹਪ࿦͢ΔϞσϧΛఏҊ ˔ %఺܈σʔλΛೖྗʹ࢖༻  ࣍ݩۭؒͰεύʔεͳಛ௃දݱ  ωοτϫʔΫͷܭࢉίετΛ཈੍  ϦΞϧλΠϜͰͭͷλεΫΛಉ࣌ʹܭࢉՄೳ 'BTUBOE'VSJPVT<8-VP

    $713 > 36 Real Time End-to-End 3D Detection, Tracking and Motion orecasting with a Single Convolutional Net Wenjie Luo, Bin Yang and Raquel Urtasun Uber Advanced Technologies Group University of Toronto {wenjie, byang10, urtasun}@uber.com ract a novel deep neural network about 3D detection, track- iven data captured by a 3D about these tasks, our holis- occlusion as well as sparse h performs 3D convolutions bird’s eye view representa- is very efficient in terms of n. Our experiments on a new ed in several north american W. Luo, et al., “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net,” CVPR, 2018.
  37. ҟͳΔҠಈ෺ମΛଐੑͱΈͳ͠ɼଐੑຖͷಛ௃తͳܦ࿏Λ༧ଌ ˔ ҟͳΔҠಈ෺ମ͕อ༗͢Δજࡏతಛ௃Λߟྀ  าߦऀɿาಓ΍ंಓΛา͘  ࣗಈंɿंಓΛ૸Δ ˔ ଐੑΛPOFIPUWFDUPSͰදݱ͢ΔͨΊɼଐੑຖʹϞσϧΛ࡞੒͢Δඞཁ͕ͳ͍ 

    ܭࢉίετͷ཈੍ ˔ ଐੑຖͷಛ௃తͳܦ࿏Λ༧ଌ͢ΔͨΊʹɼγʔϯϥϕϧΛར༻ 0CKFDU"UUSJCVUFTBOE4FNBOUJD&OWJSPONFOU<).JOPVSB 7*4"11 > 37 H. Minoura, et al., “Path predictions using object attributes and semantic environment,” VISAPP, 2019.
  38. Ұൠಓʹ͓͚Δࣗಈंͷܦ࿏༧ଌख๏ ˔ ҟͳΔγʔϯίϯςΩετΛߟྀ͢Δ$//ͱ(36ʹΑΔϞσϧΛఏҊ  ༧ଌंɾଞंͷҐஔɼίϯςΩετɼಓ࿏৘ใΛνϟωϧํ޲ʹ࿈݁  ࿈݁ͨ͠ςϯιϧΛ࣌ࠁຖʹ$//ͰΤϯίʔυ  Τϯίʔυͨ͠ಛ௃Λ(36Ͱ࣌ؒํ޲΁఻೻͠ɼকདྷͷܦ࿏Λ༧ଌ 3VMFTPGUIF3PBE<+)POH

    $713 > 38 Figure 2: Entity and world context representation. For an example scene (visualized left-most), the world is represented with the tensors shown, as described in the text. ontext representation. For an example scene (visualized left-most), the world is represented with bed in the text. One-shot (b) with RNN decoder (a) Gaussian Regression (b) GMM-CVAE Figure 4: Examples of Gaussian Regression and GMM-CVAE methods. Ellipses represent a standard deviation of uncertainty, a only drawn for the top trajectory; only trajectories with probability > 0.05 are shown, with cyan the most probable.We see that uncer ellipses are larger when turning than straight, and often follow the direction of velocity. In the GMM-CVAE example, different sa γʔϯྫ ༧ଌंͷҐஔ ଞंͷҐஔ ίϯςΩετ ಓ࿏৘ใ J. Hong, et al., “Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions,” CVPR, 2019.
  39. ෳ਺ͷ໬΋Β͍͠ܦ࿏Λ༧ଌ͢Δ.VMUJWFSTFΛఏҊ ˔ աڈͷηϚϯςΟοΫϥϕϧͱݻఆάϦουΛ)JTUPSZ&ODPEFS )& ΁ೖྗ  $POWPMVUJPO3FDVSSFOU/FVSBM/FUXPSLͰۭ࣌ؒಛ௃ΛΤϯίʔυ  ηϚϯςΟοΫϥϕϧΛೖྗ͢Δ͜ͱͰɼυϝΠϯγϑτʹରͯ͠ڧݻʹͳΔ ˔

    )&ͷग़ྗͱ؍ଌ࠷ऴͷηϚϯςΟοΫϥϕϧΛ$PBSTF-PDBUJPO%FDPEFS $-% ΁ೖྗ  ("5ͰάϦουͷ֤֨ࢠʹॏΈ෇͚͠ɼॏΈϚοϓΛੜ੒ ˔ 'JOF-PDBUJPO%FDPEFS '-% ͰάϦουͷ֤֨ࢠʹڑ཭ϕΫτϧΛ֨ೲ  $-%ಉ༷ɼ("5Ͱ֤֨ࢠʹॏΈ෇͚ ˔ '-%ͱ$-%͔Βෳ਺ͷܦ࿏Λ༧ଌ .VMUJWFSTF<+-JBOH $713 > 39 Figure 2: Overview of our model. The input to the model is the ground truth location history, and a set of video frames, which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN. J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
  40. ຊαʔϕΠͷ໨త 40 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w

    "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦
  41. ܦ࿏༧ଌͰ࠷΋࢖༻͞ΕΔσʔληοτ 41 ETH Dataset UCY Dataset Stanford Drone Dataset •

    αϯϓϧ਺ɿ786 ࢢ֗஍ͷาߦऀΛࡱӨͨ͠σʔληοτ ࢢ֗஍ͷาߦऀΛࡱӨͨ͠σʔληοτ • γʔϯ਺ɿ2 • ର৅छྨ - pedestrian • αϯϓϧ਺ɿ750 • γʔϯ਺ɿ3 • ର৅छྨ - pedestrian ελϯϑΥʔυେֶߏ಺ΛࡱӨͨ͠σʔληοτ • αϯϓϧ਺ɿ10,300 • γʔϯ਺ɿ8 • ର৅छྨ - pedestrian, car, cyclist, bus, skater, cart R. Alexandre, et al., “Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes,” ECCV, 2016. A. Lerner, et al., “Crowds by Example,” CGF, 2007. S. Pellegrini, et al., “You’ll Never Walk Alone: Modeling Social Behavior for Multi-target Tracking,” ICCV, 2009.
  42. ݻఆΧϝϥ΍υϩʔϯͰࡱӨ ˔ ๛෋ͳσʔλΛऔಘͰ͖ΔͨΊɼඇৗʹେن໛ͳσʔληοτ ၆ᛌࢹ఺ͷσʔληοτ 42 lts from NN+map(prior) m- aseline.

    The orange trajectory Red represents ground truth for n represents the multiple fore- 3 s. Top left: The car starts to Argoverse Dataset The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections Julian Bock1, Robert Krajewski1, Tobias Moers2, Steffen Runde1, Lennart Vater1 and Lutz Eckstein1 Fig. 1: Exemplary result of road user trajectories in the inD dataset. The position and speed of each road user is measured accurately over time and shown by bounding boxes and tracks. For privacy reasons, the buildings were made unrecognizable. Abstract —Automated vehicles rely heavily on data-driven methods, especially for complex urban environments. Large datasets of real world measurement data in the form of road user trajectories are crucial for several tasks like road user prediction models or scenario-based safety validation. So far, though, this demand is unmet as no public dataset of urban road user trajectories is available in an appropriate size, quality and variety. By contrast, the highway drone dataset (highD) has recently shown that drones are an efficient method for acquiring naturalistic road user trajectories. Compared to driving studies or ground-level infrastructure sensors, one major advantage of using a drone is the possibility to record naturalistic behavior, as road users do not notice measurements taking place. Due to the ideal viewing angle, an entire intersection scenario can be measured with significantly less occlusion than with sensors at ground level. Both the class and the trajectory of each road user can be extracted from the video recordings with high precision using state-of-the-art deep neural networks. Therefore, we propose the creation of a comprehensive, large-scale urban intersection dataset with naturalistic road user behavior using camera-equipped drones as successor of the highD dataset. The resulting dataset contains more than 11500 road users including vehicles, bicyclists and pedestrians at intersections in Germany and is called inD. The dataset consists of 10 hours of measurement data from four intersections and is available online for non- commercial research at: http://www.inD-dataset.com 1The authors are with the Automated Driving Department, Institute for Automotive Engineering RWTH Aachen University (Aachen, Germany). (E-mail: {bock, krajewski, steffen.runde, vater, eckstein}@ika.rwth- aachen.de). 2The author is with the Automated Driving Department, fka GmbH (Aachen, Germany). (E-mail: tobias.moers@fka.de). Index Terms —Dataset, Trajectories, Road Users, Machine Learning I. INTRODUCTION Automated driving is expected to reduce the number and severity of accidents significantly [13]. However, intersections are challenging for automated driving due to the large com- plexity and variety of scenarios [15]. Scientists and companies are researching how to technically handle those scenarios by an automated driving function and how to proof safety of these systems. An ever-increasing proportion of the approaches to tackle both challenges are data-driven and therefore large amounts of measurement data are required. For example, re- cent road user behaviour models, which are used for prediction or simulation, use probabilistic approaches based on large scale datasets [2], [11]. Furthermore, current approaches for safety validation of highly automated driving such as scenario- based testing heavily rely on large-scale measurement data on trajectory level [3], [5], [17]. However, the widely used ground-level or on-board measurement methods have several disadvantages. These include that road users can be (partly) occluded by other road users and do not behave naturally as they notice being part of a measurement due to conspicuous sensors [5]. We propose to use camera-equipped drones to record road user movements at urban intersections (see Fig. 2). Drones with high-resolution cameras allow to record traffic from a arXiv:1911.07602v1 [cs.CV] 18 Nov 2019 inD Dataset ysis. The red trajectories are single-future method predictions and the yellow-orange heatmaps are ctions. The yellow trajectories are observations and the green ones are ground truth multi-future etails. Method Single-Future Multi-Future Our full model 18.51 / 35.84 166.1 / 329.5 The Forking Paths Dataset Figure 7: Example output of the motion prediction solution supplied as part of the software development kit. A convo- lution neural network takes rasterised scenes around nearby vehicles as input, and predicts their future motion. ity and multi-threading to make it suitable for distributed machine learning. Customisable scene visualisation and rasterisation. We provide several functions to visualise and rasterise Lyft Level 5 Dataset • αϯϓϧ਺ɿ300K • γʔϯ਺ɿ113 • ର৅छྨ • ௥Ճ৘ใ - car - ंઢ৘ใɼ஍ਤσʔλɼηϯαʔ৘ใ • αϯϓϧ਺ɿ13K • γʔϯ਺ɿ4 • ର৅छྨ - pedestrian, car, cyclist • αϯϓϧ਺ɿ3B • γʔϯ਺ɿ170,000 • ର৅छྨ • ௥Ճ৘ใ - pedestrian, car, cyclist - ߤۭ৘ใ - ηϚϯςΟοΫϥϕϧ • αϯϓϧ਺ɿ0.7K • γʔϯ਺ɿ7 • ର৅छྨ • ௥Ճ৘ใ - pedestrian - ෳ਺ܦ࿏৘ใ - ηϚϯςΟοΫϥϕϧ ҰൠಓΛࡱӨͨ͠σʔληοτ ަࠩ఺ΛࡱӨͨ͠σʔληοτ ҰൠಓΛࡱӨͨ͠σʔληοτ γϛϡϨʔλͰ࡞੒͞Εͨσʔληοτ J. Houston, et al., “One Thousand and One Hours: Self-driving Motion Prediction Dataset,” CoRR, 2020. J. Bock, et al., “The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections,” CoRR, 2019. M.F. Chang, et al., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” CVPR, 2019. J. Liang, et al., “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction,” CVPR, 2020.
  43. ࣗಈंલํͷҠಈର৅ͷܦ࿏༧ଌΛ໨త ंࡌΧϝϥࢹ఺ͷσʔληοτ 43 • αϯϓϧ਺ɿ1.8K • γʔϯ਺ɿ53 • ର৅छྨ •

    ௥Ճ৘ใ - pedestrian - ं྆৘ใɼΠϯϑϥετϥΫνϟ ҰൠಓΛࡱӨͨ͠σʔληοτ Apolloscape Dataset Figure 3. Example scenarios of the TITAN Dataset: a pedestrian bounding box with tracking ID is shown in , vehicle bounding b with ID is shown in , future locations are displayed in . Action labels are shown in different colors following Figure 2. centric views captured from a mobile platform. In the TITAN dataset, every participant (individuals, vehicles, cyclists, etc.) in each frame is localized us- ing a bounding box. We annotated 3 labels (person, 4- wheeled vehicle, 2-wheeled vehicle), 3 age groups for per- son (child, adult, senior), 3 motion-status labels for both 2 and 4-wheeled vehicles, and door/trunk status labels for 4- wheeled vehicles. For action labels, we created 5 mutually exclusive person action sets organized hierarchically (Fig- ure 2). In the first action set in the hierarchy, the annota- tor is instructed to assign exactly one class label among 9 atomic whole body actions/postures that describe primitive action poses such as sitting, standing, standing, bending, etc. The second action set includes 13 actions that involve single atomic actions with simple scene context such as jay- walking, waiting to cross, etc. The third action set includes 7 complex contextual actions that involve a sequence of atomic actions with higher contextual understanding, such agent i at each past time step from 1 to Tobs, where (cu, and (lu, lv ) represent the center and the dimension of bounding box, respectively. The proposed TITAN fram work requires three inputs as follows: Ii t=1:Tobs for the tion detector, xi t for both the interaction encoder and p object location encoder, and et = {αt, ωt } for the eg motion encoder where αt and ωt correspond to the accel ation and yaw rate of the ego-vehicle at time t, respective During inference, the multiple modes of future bound box locations are sampled from a bi-variate Gaussian g erated by the noise parameters, and the future ego-motio ˆ et are accordingly predicted, considering the multi-mo nature of the future prediction problem. Henceforth, the notation of the feature embedding fu tion using multi-layer perceptron (MLP) is as follows: Φ without any activation, and Φr, Φt, and Φs are associa with ReLU, tanh, and a sigmoid function, respectively. TITAN Dataset LSTM 172 330 911 837 3352 289 569 155 B-LSTM[5] 101 296 855 811 3259 159 539 153 PIEtraj 58 200 636 596 2477 110 399 124 Table 3: Location (bounding box) prediction errors over varying future time steps. M predicted time steps, CMSE and CFMSE are the MSEs calculated over the center of predicted sequence and only the last time step respectively. Method 0.5 Linear 0.8 LSTM 1.5 PIEspeed 0.6 Table 4: Speed predict on the PIE dataset. Las results are reported in km is generally better on bou degrees of freedom. Context in trajector PIE Dataset Figure 5: Illustration of our TrafficPredict (TP) method on camera-based images. T conditions and traffic situations. We only show the trajectories of several instances in drawn in green and the prediction results of other methods (ED,SL,SA) are shown w trajectories of our TP algorithm (pink lines) are the closest to ground truth in most of stance layer to ca instances and use larities of moveme type and guide the in spatial and tem ferred in our desig previous state-of-t racy of trajectory p heterogeneous tra ҰൠಓΛࡱӨͨ͠σʔληοτ ҰൠಓΛࡱӨͨ͠σʔληοτ • αϯϓϧ਺ɿ81K • γʔϯ਺ɿ100,000 • ର৅छྨ - pedestrian, car, cyclist • αϯϓϧ਺ɿ645K • γʔϯ਺ɿ700 • ର৅छྨ • ௥Ճ৘ใ - pedestrian, car, cyclist - ߦಈϥϕϧɼาߦऀͷ೥ྸ Y. Ma, et al., “TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents,” AAAI, 2019. A. Rasouli, et al., “PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction, ” ICCV, 2019. S. Malla, et al., “TITAN: Future Forecast using Action Priors,” CVPR, 2020.
  44. લํͷาߦऀͷܦ࿏༧ଌΛ໨త ˔ ඃݧऀʹ΢ΣΞϥϒϧΧϝϥΛ૷ண ਓশࢹ఺ͷσʔληοτ 44 • αϯϓϧ਺ɿ5K • γʔϯ਺ɿ87 •

    ର৅छྨ • ௥Ճ৘ใ - pedestrian - ࢟੎৘ใɼΤΰϞʔγϣϯ าಓΛࡱӨͨ͠σʔληοτ 0 6 0 10 Predictions First-Person Locomotion Dataset T. Yagi, et al., “Future Person Localization in First-Person Videos,” CVPR, 2018.
  45. ༧ଌ͞ΕۣͨܗྖҬͱਅͷۣܗྖҬͷத৺࠲ඪͰධՁ ˔ ंࡌΧϝϥө૾ʹ͓͚Δܦ࿏༧ଌͰར༻ ˔ ۣܗྖҬͷॏͳΓ཰͔Β'஋ͰධՁ΋Ͱ͖Δ ධՁࢦඪ 45 %JTQMBDFNFOU&SSPS /FHBUJWFMPHMJLFMJIPPE .FBO4RVBSF&SSPS

    $PMMJTJPOSBUF (a) Average Displacement Error (b) Final Displacement Error    "%& '%& ਅ஋ͱ༧ଌ஋ͱͷϢʔΫϦουڑ཭ޡࠩ ˔ "WFSBHF%JTQMBDFNFOU&SSPS "%& ɿ༧ଌ࣌ࠁؒͷฏۉޡࠩ ˔ 'JOBM%JTQMBDFNFOU&SSPS '%& ɿ༧ଌ࠷ऴ࣌ࠁͷޡࠩ !"#$ … !"#& # of Samples: ' Prediction Horizon: ( Figure 5. An illustration of our probabilistic evaluation methodol- ogy. It uses kernel density estimates at each timestep to compute the log-likelihood of the ground truth trajectory at each timestep, averaging across time to obtain a single value. Figure 6. Mean NLL for each dataset. Error bars are bootstrapped 95% confidence intervals. 2000 trajectories were sampled per model at each prediction timestep. Lower is better. ADE and FDE are useful metrics for comparing determinis- tic regressors, they are not able to compare the distributions produced by generative models, neglecting aspects such as variance and multimodality [40]. To bridge this gap in eval- Dataset ADE / FDE, SGAN [16] ETH 0.64 / 1.13 Hotel 0.43 / 0.91 Univ 0.53 / 1.12 Zara 1 0.29 / 0.58 Zara 2 0.27 / 0.56 Average 0.43 / 0.86 Table 1. Quantitative ADE and metric where N = 100. Both of our methods signi the ETH datasets, the UCY (P <.001; two-tailed t-test and SGAN’s mean NLL). On Full model is identical in pe same t-test). However, on th model performs worse than We believe that this is caused tions more often than in other truth trajectories to frequentl tions whereas SGAN’s highe to have density there. Acros uration outperforms our zbes model’s full multimodal mo for strong performance on th We also evaluated our m to determine how much the p prediction horizon. The resu be seen, our Full model sig at every timestep (P <.001; ence between our and SGAN ਪఆͨ͠෼෍ͷݩͰͷਅ஋ͷର਺໬౓ͷظ଴஋ ˔ "%&ͱ'%&Ͱෳ਺ܦ࿏ΛධՁ͢Δͷ͸ϚϧνϞʔμϧੑΛແࢹ ˔ /FHBUJWFMPHMJLFMJIPPEͰෳ਺ܦ࿏ͷ༧ଌͷධՁࢦඪͱͯ͠ར༻ USVUI QSFEJDUJPO .4& -OPSN -OPSN ਅ஋ ༧ଌ஋ ैདྷͷධՁࢦඪ ఏҊ͢ΔධՁࢦඪ ͭͷ%JTQMBDFNFOU&SSPSͰධՁ ඇઢܗܦ࿏ͷ%JTQMBDFNFOU&SSPSɼͭͷ෺ମͱͷিಥ཰ͰධՁ -OPSN -OPSN ਅ஋ ༧ଌ஋ ैདྷͷධՁࢦඪ ఏҊ͢ΔධՁࢦඪ ͭͷ%JTQMBDFNFOU&SSPSͰධՁ %JTQMBDFNFOU&SSPS͸શαϯϓϧʹର͠ฏۉΛٻΊΔ ˔ ΠϯλϥΫγϣϯ৘ใ͕Ͳͷ༧ଌܦ࿏ʹޮՌత͔ධՁͰ͖ͳ͍ ༧ଌ஋͕֤෺ମͱিಥ͔ͨ͠൱͔ͷিಥ཰ͰධՁ ˔ ಈత෺ମɿө૾தͷଞର৅ ˔ ੩త෺ମɿݐ෺΍໦ͳͲͷো֐෺ ಈత෺ମ ੩త෺ମ
  46. ຊαʔϕΠͷ໨త 46 %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐ͢Δܦ࿏༧ଌख๏ຖͷಛ௃Λ·ͱΊΔ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w

    "UUFOUJPOϞσϧ  ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪ΋঺հ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦
  47. ୅දతͳϞσϧΛ༻͍ͯɼਫ਼౓ݕূΛߦ͏ ධՁ࣮ݧ 47 Ϟσϧ໊ ΠϯλϥΫγϣϯ %FFQ-FBSOJOH ؀ڥ σʔληοτ -45. 

    ✔  &5)6$: 4%% 3&%  ✔  &5)6$: 4%% $POTU7FM    &5)6$: 4%% 4PDJBM-45. ✔ ✔  &5)6$: 4%% 4PDJBM("/ ✔ ✔  &5)6$: 4%% 45("5 ✔ ✔  &5)6$: 4%% 5SBKFDUSPO ✔ ✔  &5)6$: 4%% &OW-45.  ✔ ✔ 4%% 4PDJBM45($// ✔ ✔  &5)6$: 1&$/FU ✔ ✔  &5)6$: ਫ਼౓ൺֱΛߦ͏Ϟσϧ AttentionϞσϧ PoolingϞσϧ
  48. σʔληοτ ˔ &5)6$: ˔ 4%%  าߦऀݶఆ ΤϙοΫ਺ɿ όοναΠζɿ ࠷దԽख๏ɿ"EBN

    ˔ ֶश཰ɿ ؍ଌ࣌ࠁɿඵ ༧ଌ࣌ࠁɿඵ ධՁࢦඪ ˔ %JTQMBDFNFOU&SSPSɼ$PMMJTJPOSBUF ˔ ෳ਺ͷ༧ଌܦ࿏ΛαϯϓϦϯά͢Δख๏Ͱ͸ɼαϯϓϦϯάͨ͠தͰ࠷ྑͷ΋ͷΛ࢖༻ ࣮ݧ৚݅ 48
  49. &5)6$:ʹ͓͚Δ%JTQMBDFNFOU&SSPS 49 4DFOF .FUIPE $&5) &5) )05&- 6$: ;"3" ;"3"

    "7( -45.        3&%        $POTU7FM        4PDJBM-45.        4PDJBM("/        1&$/FU        45("5        5SBKFDUSPO        4PDJBM45($//        4JOHMF.PEFM 0VUQVUT Single ModelɿΠϯλϥΫγϣϯΛߟྀ͠ͳ͍RED͕࠷΋ޡࠩΛ௿ݮ 20 OutputsɿADEͰSTGATɼFDEͰPECNet͕࠷΋ޡࠩΛ௿ݮ "%&'%& <N>
  50. &5)6$:ʹ͓͚Δ%JTQMBDFNFOU&SSPS 50 4DFOF .FUIPE $&5) &5) )05&- 6$: ;"3" ;"3"

    "7( -45.        3&%        $POTU7FM        4PDJBM-45.        4PDJBM("/        1&$/FU        45("5        5SBKFDUSPO        4PDJBM45($//        4JOHMF.PEFM 0VUQVUT "%&'%& <N> PoolingϞσϧͱൺֱͯ͠AttentionϞσϧʹΑΔܦ࿏༧ଌख๏͕༗ޮ
  51. -45. 3&% $POTU7FM 4PDJBM-45. 4PDJBM("/ 1&$/FU 45("5 5SBKFDUSPO 4PDJBM45($// ಈత෺ମ

             ੩త෺ମ          &5)6$:ʹ͓͚Δ$PMMJTJPOSBUFͱ༧ଌ݁Ռྫ 51 .PEFM 0CKFDU $POTU7FM -45. 1&$/FU 3&% 4PDJBM("/ 4PDJBM-45. 45("5 4PDJBM45($// 5SBKFDUSPO ೖྗ஋ ਅ஋ ༧ଌ஋ $PMMJTJPOSBUF <>
  52. -45. 3&% $POTU7FM 4PDJBM-45. 4PDJBM("/ 1&$/FU 45("5 5SBKFDUSPO 4PDJBM45($// ಈత෺ମ

             ੩త෺ମ          &5)6$:ʹ͓͚Δ$PMMJTJPOSBUFͱ༧ଌ݁Ռྫ 52 .PEFM 0CKFDU ೖྗ஋ ਅ஋ ༧ଌ஋ $PMMJTJPOSBUF <> $POTU7FM -45. 1&$/FU 3&% 4PDJBM("/ 4PDJBM-45. 45("5 4PDJBM45($// 5SBKFDUSPO ༧ଌޡ͕ࠩ௿͍ख๏ ≠ িಥ཰͕௿͍ख๏ - ਅ஋ͱྨࣅ͠ͳ͍৔߹ʹ༧ଌޡࠩ͸૿Ճ͢Δ͕ɼিಥ཰͸ݮগ͢Δ͜ͱ͕ى͜ΓಘΔ ಈత෺ମʹؔ͢Δ$PMMJTJPOSBUFͰিಥ͍ͯ͠ͳ͍ͱ൑ఆ͞Εͨܦ࿏ ಈత෺ମʹؔ͢Δ$PMMJTJPOSBUFͰিಥͨ͠ͱ൑ఆ͞Εͨܦ࿏
  53. 4%%ʹ͓͚Δ%JTQMBDFNFOU&SSPS 53 4DFOF .FUIPE CPPLTUPSF DPVQB EFBUI$JSDMF HBUFT IZBOH MJUUMF

    OFYVT RVBE "7( -45.          3&%          $POTU7FM          4PDJBM-45.          &OW-45.          4PDJBM("/          45("5          5SBKFDUSPO          4JOHMF.PEFM 0VUQVUT "%&'%& <QJYFM> Single ModelɿDeep LearningΛ࢖༻͠ͳ͍ConstVel͕࠷΋ޡࠩΛ௿ݮ 20 OutputsɿADEͰTrajectronɼFDEͰSTGAT͕࠷΋ޡࠩΛ௿ݮ
  54. ࡱӨՕॴʹӨڹ͞ΕΔ ˔ &5)6$:͸௿ॴͰࡱӨ  ਓͷܦ࿏͕TFOTJUJWFʹͳΔ ˔ 4%%͸ߴॴͰࡱӨ  ਓͷܦ࿏͕JOTFOTJUJWFʹͳΔ 

    ਓͷಈ͖͕ઢܗʹͳΓɼઢܗ༧ଌ͢Δ$POTU7FMͷ༧ଌޡ͕ࠩ௿Լ 4%%Ͱ$POTU7FMͷ༧ଌޡ͕ࠩ௿͍ཁҼ͸Կ͔ 54 ETH/UCY SDD
  55. 4%%ʹ͓͚Δ$PMMJTJPOSBUFͱ༧ଌ݁Ռྫ 55 .PEFM 0CKFDU $PMMJTJPOSBUF <> Social-GAN RED ConstVel Trajectron

    STGAT LSTM Social-LSTM ਅ஋ ೖྗ஋ ༧ଌ஋ Env-LSTM $POTU7FM -45. 3&% 4PDJBM("/ 4PDJBM-45. 45("5 5SBKFDUSPO &OW-45. ೖྗ஋ ਅ஋ ༧ଌ஋ -45. 3&% $POTU7FM 4PDJBM-45. 4PDJBM("/ 45("5 5SBKFDUSPO &OW-45. ಈత෺ମ         ੩త෺ମ        
  56. 4%%ʹ͓͚Δ$PMMJTJPOSBUFͱ༧ଌ݁Ռྫ 56 .PEFM 0CKFDU Social-GAN RED ConstVel Trajectron STGAT LSTM

    Social-LSTM ਅ஋ ೖྗ஋ ༧ଌ஋ Env-LSTM $POTU7FM -45. 3&% 4PDJBM("/ 4PDJBM-45. 45("5 5SBKFDUSPO &OW-45. ೖྗ஋ ਅ஋ ༧ଌ஋ -45. 3&% $POTU7FM 4PDJBM-45. 4PDJBM("/ 45("5 5SBKFDUSPO &OW-45. ಈత෺ମ         ੩త෺ମ         ؀ڥ৘ใΛಋೖ͢Δ͜ͱͰɼো֐෺ͱͷ઀৮Λආ͚Δܦ࿏༧ଌ͕Մೳ $PMMJTJPOSBUF <>
  57. %FFQ-FBSOJOHͷൃలʹΑΔσʔληοτͷେن໛Խ ܦ࿏༧ଌͷධՁࢦඪͷ࠶ߟ ˔ ༧ଌਫ਼౓͕ྑ͍㱠࠷΋ྑ͍༧ଌख๏ ˔ ίϛϡχςΟશମͰߟ͑௚͢ඞཁ͕͋Δ ෳ਺ܦ࿏Λ༧ଌ͢ΔΞϓϩʔνͷ૿Ճ ˔ .VMUJWFSTFΛච಄ʹෳ਺ܦ࿏Λॏࢹ͢Δܦ࿏༧ଌख๏͕૿Ճ͢Δʁ ࠓޙͷܦ࿏༧ଌ͸ʁ

    57 2016 interaction multimodal paths other 2020 Social-LSTM [A. Alahi+, CVPR, 2016] DESIRE [N. Lee+, CVPR, 2017] Conv.Social-Pooling [N. Deo+, CVPRW, 2018] SoPhie [A. Sadeghian+, CVPR, 2019] Social-BiGAT [V. Kosaraju+, NeurIPS, 2019] Social-STGCNN [A. Mohamedl+, CVPR, 2020] Social-GAN [A. Gupta+, CVPR, 2018] Next [J. Liang+, CVPR, 2019] STGAT [Y. Huang+, ICCV, 2019] Trajectron [B. Ivanovic+, ICCV, 2019] Social-Attention [A. Vemula+, ICRA, 2018] Multi-Agent Tensor Fusion [T. Zhao+, CVPR, 2019] MX-LSTM [I. Hasan+, CVPR, 2018] CIDNN [Y. Xu+, CVPR, 2018] SR-LSTM [P. Zhang+, CVPR, 2019] Group-LSTM [N. Bisagno+, CVPR, 2018] Reciprocal Network [S. Hao+, CVPR, 2020] PECNet [K. Mangalam+, ECCV, 2020] RSBG [J. SUN+, CVPR, 2020] STAR [C. Yu+, ECCV, 2020] Behavior CNN [S. Yi+, ECCV, 2016] Future localization in first-person videos [T. Yagi+, CVPR, 2018] Fast and Furious [W. Luo+, CVPR, 2018] OPPU [A. Bhattacharyya+, CVPR, 2018] Object Attributes and Semantic Segmentation [H. Minoura+, VISAPP, 2019] Rule of the Road [J. Hong+, CVPR, 2019] Multiverse [J. Liang+, CVPR, 2020] Trajectron++ [T. Salzmann+, ECCV, 2020] Multimodal paths + (interaction) ɾɾɾ
  58. %FFQ-FBSOJOHΛ༻͍ͨܦ࿏༧ଌख๏ͷಈ޲ௐࠪ ˔ ֤ΧςΰϦʹଐͨ͠ܦ࿏༧ଌख๏ͷಛ௃Λௐࠪ  ΠϯλϥΫγϣϯ͋Γ w 1PPMJOHϞσϧ w "UUFOUJPOϞσϧ 

    ΠϯλϥΫγϣϯͳ͠ 0UIFS  ˔ ఆྔతධՁͷͨΊͷσʔληοτɼධՁࢦඪΛ঺հ  %FFQ-FBSOJOHͷൃలʹΑΓେن໛ͳσʔληοτ͕૿Ճ ˔ ୅දతϞσϧΛ࢖༻ͯ͠ɼ֤Ϟσϧͷਫ਼౓ͱ༧ଌ݁Ռʹ͍ͭͯٞ࿦  "UUFOUJPOϞσϧ͸1PPMJOHϞσϧΑΓ༧ଌޡ͕ࠩ௿͍  ࠷΋িಥ཰͕௿͍Ϟσϧ㱠༧ଌޡ͕ࠩ௿͍Ϟσϧ  4FOTJUJWFͳσʔληοτͰ%FFQ-FBSOJOHʹΑΔ༧ଌख๏͸ޮՌత ·ͱΊ 58
  59. None