Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PRML2023 S9-5 EriKuroda

Eri KURODA
August 06, 2023

PRML2023 S9-5 EriKuroda

This is the presentation material for PRML 2023 session 9-5 on August 6, 2023.
Title: Extraction of Motion Change Points based on the Physical Characteristics of Objects
Speaker: Eri Kuroda (Ochanomizu University, Japan Society for the Promotion of Science)

Eri KURODA

August 06, 2023
Tweet

More Decks by Eri KURODA

Other Decks in Research

Transcript

  1. Extraction of Motion Change Points based on the Physical Characteristics

    of Objects Eri Kuroda1,2・Ichiro Kobayashi1 1 : Ochanomizu University 2 : Japan Society for the Promotion of Science PRML2023 Material S9-5 AU0004
  2. 2 Background・Purpose • World Model Ø Learn models of what

    happens after events in the real world Ø Modeling the observed environment in the human brain Ø learn how the world works and background knowledge from a few interactions and observations • Recognition • Using models that represent observations in the brain to understand the existence and physical properties of objects Real World Cognition of Humans BUT… • Machine learning for real-world recognition Ø input (observation) is an image → equivalent to human vision Ø predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do • Recognizes "what" and "what kind of motion" objects are seen like humans • Proposes a real-world recognition method that takes into account the relationships (ex: positions and physical properties) between objects. Purpose
  3. Motivation 3 Variational Temporal Abstraction (VTA) [Kim+, 2019] Extract the

    latent structure of the environment from visual information and extract the timing of environmental changes Focuses on pixel changes and does not take into account the physical operating characteristics of the object • Graph-based representation of relationships between objects • Extraction of environmental change points based on graph changes Propose
  4. 4 Overview 3D maze Image features only does not understand

    real world Change Point Extraction Model VTA Conventional Methods
  5. Conventional Methods 5 Overview 3D maze Image features only does

    not understand real world CLEVRER Graph structure Proposed Method Object detection, speed, acceleration, image features, etc. Flag extraction of change points Change Point Extraction Model VTA
  6. 6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when

    to transition 𝑍 problem Human: easy ↔ Model: difficult Observation (Input) Observation abstraction temporal abstraction
  7. 7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0

    or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags
  8. Method Process of change point extraction 8 object recognition object

    position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data
  9. Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video

    REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration
  10. Training data • Dataset created from physical characteristics of the

    environment 10 object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data
  11. Training data 11 • Dataset created from physical characteristics of

    the environment object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data
  12. Yolov3 [Redmon+, 18] • Recognize objects in the image by

    shape only Øobjects’ position Øshape • familiar examples Øface recognition Øautomatic driving YOLACT [Bolya+, 19] • Recognize objects in the image by shape, color(, material) Øobjects’ position Øshape Øcolor Ømaterial 12 Object recognition YOLOv3 {shape, color} {shape, color, material} YOLACT
  13. Training data 13 • Dataset created from physical characteristics of

    the environment object recognition node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data object position
  14. Velocity・Acceleration Training data 14 (𝑥! , 𝑦!) (𝑥" , 𝑦")

    𝑐 = 𝑥, 𝑦 = ( 𝑥! + 𝑥" 2 , 𝑦! + 𝑦" 2 ) c Calculate location information • Calculate the coordinates of the object center from the acquired bounding box coordinates velocity acceleration 𝑎!! = (𝑣!! − 𝑣!" )/(𝑒𝑡"#$%&×𝑡) 𝑎'! = (𝑣'! − 𝑣'" )/(𝑒𝑡"#$%&×𝑡) ※ 𝑒𝑡#$%&' = 5/128 time elapsed between frames 𝑣!! = (𝑥( − 𝑥()* )/𝑒𝑡"#$%& 𝑣'! = (𝑦( − 𝑦()*)/𝑒𝑡"#$%&
  15. graph structure Training data 15 x flag “5” flag “-5”

    flag “-1” main object others main object = (𝑥!"#$ , 𝑦!"#$ ) others = (𝑥%&'() , 𝑦%&'() ) 𝑥*#++ = 𝑥%&'() − 𝑥!"#$ 𝑦*#++ = 𝑦%&'() − 𝑦!"#$ 𝑥*#++ 𝑦*#++ + + − − flag “5” flag “1” flag “-1” flag “-5” y flag “1” Position direction flags between objects • Node information Øshape, color, material
  16. • graph2vec [Grover+, 2016] Øinspired by doc2vec’s PV- DBOW Training

    data 16 [[0.54, 0.29, 0.61…], [[0.82, 0.91, 0.15…], … [[0.14, 0.35, 0.69…]] Example of embedding vector embedding vector • node2vec [Grover+, 2016] Øinspired by word2vec’s Skip-gram
  17. Experiment Process of change point extraction 17 object recognition object

    position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data
  18. Experiment : Accuracy Calculation Method • Examine the accuracy (%)

    of annotation collision information and flag timing Example • collision→19 frame, by eye → 21 frame • The correct answer range was set to 19-21 frame • flag︓18, 19, 20, 22 → accuracy︓2/4×100=50 (%) 18 19 frame 20 frame 21 frame
  19. Experiment : settings • Number of training data : 600,000

    • Number of times studies : 500,000 • Batch size : 100 • Output : 80 • Optimization : Adam • Error function : KL divergence 19
  20. 20 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  21. 21 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  22. 22 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  23. 23 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  24. 24 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  25. 25 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  26. 26 ※ Accuracy is shown in %, - is not

    flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec
  27. 27 Recognition Dataset Accuracy shape color material graph velocity accele-

    ration flag image i ii iii iv v vi YOLO v3 ⑬ ✔ ✔ - - - - - - ⑭ ✔ ✔ ✔ 0 20 0 33.3 20 0 YOLACT ⑮ ✔ ✔ ✔ ✔ - - - - - - ⑯ ✔ ✔ ✔ ✔ ✔ 0 25 10 20 20 0 annotation ⑰ ✔ ✔ ✔ ✔ - - - - - - ⑱ ✔ ✔ ✔ ✔ ✔ 0 20 20 50 0 0 ※ Accuracy is shown in %, - is not flagged. Result : graph2vec
  28. Conclusion • Research focused on real-world recognition, including world models

    ØVTA • Training Data ØConventional : only image features ØProposed : graphs representing object relationships • Focus on individual objects, not just visual information about the environment • Recognize the real world in detail 28