Slide 1

Slide 1 text

Extraction of Motion Change Points based on the Physical Characteristics of Objects Eri Kuroda1,2・Ichiro Kobayashi1 1 : Ochanomizu University 2 : Japan Society for the Promotion of Science PRML2023 Material S9-5 AU0004

Slide 2

Slide 2 text

2 Background・Purpose • World Model Ø Learn models of what happens after events in the real world Ø Modeling the observed environment in the human brain Ø learn how the world works and background knowledge from a few interactions and observations • Recognition • Using models that represent observations in the brain to understand the existence and physical properties of objects Real World Cognition of Humans BUT… • Machine learning for real-world recognition Ø input (observation) is an image → equivalent to human vision Ø predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do • Recognizes "what" and "what kind of motion" objects are seen like humans • Proposes a real-world recognition method that takes into account the relationships (ex: positions and physical properties) between objects. Purpose

Slide 3

Slide 3 text

Motivation 3 Variational Temporal Abstraction (VTA) [Kim+, 2019] Extract the latent structure of the environment from visual information and extract the timing of environmental changes Focuses on pixel changes and does not take into account the physical operating characteristics of the object • Graph-based representation of relationships between objects • Extraction of environmental change points based on graph changes Propose

Slide 4

Slide 4 text

4 Overview 3D maze Image features only does not understand real world Change Point Extraction Model VTA Conventional Methods

Slide 5

Slide 5 text

Conventional Methods 5 Overview 3D maze Image features only does not understand real world CLEVRER Graph structure Proposed Method Object detection, speed, acceleration, image features, etc. Flag extraction of change points Change Point Extraction Model VTA

Slide 6

Slide 6 text

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when to transition 𝑍 problem Human: easy ↔ Model: difficult Observation (Input) Observation abstraction temporal abstraction

Slide 7

Slide 7 text

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0 or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags

Slide 8

Slide 8 text

Method Process of change point extraction 8 object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Slide 9

Slide 9 text

Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

Slide 10

Slide 10 text

Training data • Dataset created from physical characteristics of the environment 10 object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Slide 11

Slide 11 text

Training data 11 • Dataset created from physical characteristics of the environment object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Slide 12

Slide 12 text

Yolov3 [Redmon+, 18] • Recognize objects in the image by shape only Øobjects’ position Øshape • familiar examples Øface recognition Øautomatic driving YOLACT [Bolya+, 19] • Recognize objects in the image by shape, color(, material) Øobjects’ position Øshape Øcolor Ømaterial 12 Object recognition YOLOv3 {shape, color} {shape, color, material} YOLACT

Slide 13

Slide 13 text

Training data 13 • Dataset created from physical characteristics of the environment object recognition node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data object position

Slide 14

Slide 14 text

Velocity・Acceleration Training data 14 (𝑥! , 𝑦!) (𝑥" , 𝑦") 𝑐 = 𝑥, 𝑦 = ( 𝑥! + 𝑥" 2 , 𝑦! + 𝑦" 2 ) c Calculate location information • Calculate the coordinates of the object center from the acquired bounding box coordinates velocity acceleration 𝑎!! = (𝑣!! − 𝑣!" )/(𝑒𝑡"#$%&×𝑡) 𝑎'! = (𝑣'! − 𝑣'" )/(𝑒𝑡"#$%&×𝑡) ※ 𝑒𝑡#$%&' = 5/128 time elapsed between frames 𝑣!! = (𝑥( − 𝑥()* )/𝑒𝑡"#$%& 𝑣'! = (𝑦( − 𝑦()*)/𝑒𝑡"#$%&

Slide 15

Slide 15 text

graph structure Training data 15 x flag “5” flag “-5” flag “-1” main object others main object = (𝑥!"#$ , 𝑦!"#$ ) others = (𝑥%&'() , 𝑦%&'() ) 𝑥*#++ = 𝑥%&'() − 𝑥!"#$ 𝑦*#++ = 𝑦%&'() − 𝑦!"#$ 𝑥*#++ 𝑦*#++ + + − − flag “5” flag “1” flag “-1” flag “-5” y flag “1” Position direction flags between objects • Node information Øshape, color, material

Slide 16

Slide 16 text

• graph2vec [Grover+, 2016] Øinspired by doc2vec’s PV- DBOW Training data 16 [[0.54, 0.29, 0.61…], [[0.82, 0.91, 0.15…], … [[0.14, 0.35, 0.69…]] Example of embedding vector embedding vector • node2vec [Grover+, 2016] Øinspired by word2vec’s Skip-gram

Slide 17

Slide 17 text

Experiment Process of change point extraction 17 object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Slide 18

Slide 18 text

Experiment : Accuracy Calculation Method • Examine the accuracy (%) of annotation collision information and flag timing Example • collision→19 frame, by eye → 21 frame • The correct answer range was set to 19-21 frame • flag︓18, 19, 20, 22 → accuracy︓2/4×100=50 (%) 18 19 frame 20 frame 21 frame

Slide 19

Slide 19 text

Experiment : settings • Number of training data : 600,000 • Number of times studies : 500,000 • Batch size : 100 • Output : 80 • Optimization : Adam • Error function : KL divergence 19

Slide 20

Slide 20 text

20 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 21

Slide 21 text

21 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 22

Slide 22 text

22 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 23

Slide 23 text

23 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 24

Slide 24 text

24 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 25

Slide 25 text

25 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 26

Slide 26 text

26 ※ Accuracy is shown in %, - is not flagged. Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

Slide 27

Slide 27 text

27 Recognition Dataset Accuracy shape color material graph velocity accele- ration flag image i ii iii iv v vi YOLO v3 ⑬ ✔ ✔ - - - - - - ⑭ ✔ ✔ ✔ 0 20 0 33.3 20 0 YOLACT ⑮ ✔ ✔ ✔ ✔ - - - - - - ⑯ ✔ ✔ ✔ ✔ ✔ 0 25 10 20 20 0 annotation ⑰ ✔ ✔ ✔ ✔ - - - - - - ⑱ ✔ ✔ ✔ ✔ ✔ 0 20 20 50 0 0 ※ Accuracy is shown in %, - is not flagged. Result : graph2vec

Slide 28

Slide 28 text

Conclusion • Research focused on real-world recognition, including world models ØVTA • Training Data ØConventional : only image features ØProposed : graphs representing object relationships • Focus on individual objects, not just visual information about the environment • Recognize the real world in detail 28