PRML2023 S9-5 EriKuroda

Extraction of Motion Change Points based on the Physical Characteristics
of Objects Eri Kuroda1,2・Ichiro Kobayashi1 1 : Ochanomizu University 2 : Japan Society for the Promotion of Science PRML2023 Material S9-5 AU0004

2 Background・Purpose • World Model Ø Learn models of what
happens after events in the real world Ø Modeling the observed environment in the human brain Ø learn how the world works and background knowledge from a few interactions and observations • Recognition • Using models that represent observations in the brain to understand the existence and physical properties of objects Real World Cognition of Humans BUT… • Machine learning for real-world recognition Ø input (observation) is an image → equivalent to human vision Ø predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do • Recognizes "what" and "what kind of motion" objects are seen like humans • Proposes a real-world recognition method that takes into account the relationships (ex: positions and physical properties) between objects. Purpose

Motivation 3 Variational Temporal Abstraction (VTA) [Kim+, 2019] Extract the
latent structure of the environment from visual information and extract the timing of environmental changes Focuses on pixel changes and does not take into account the physical operating characteristics of the object • Graph-based representation of relationships between objects • Extraction of environmental change points based on graph changes Propose

4 Overview 3D maze Image features only does not understand
real world Change Point Extraction Model VTA Conventional Methods

Conventional Methods 5 Overview 3D maze Image features only does
not understand real world CLEVRER Graph structure Proposed Method Object detection, speed, acceleration, image features, etc. Flag extraction of change points Change Point Extraction Model VTA

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when
to transition 𝑍 problem Human: easy ↔ Model: difficult Observation (Input) Observation abstraction temporal abstraction

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0
or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags

Method Process of change point extraction 8 object recognition object
position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video
REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

Training data • Dataset created from physical characteristics of the
environment 10 object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Training data 11 • Dataset created from physical characteristics of
the environment object recognition object position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Yolov3 [Redmon+, 18] • Recognize objects in the image by
shape only Øobjects’ position Øshape • familiar examples Øface recognition Øautomatic driving YOLACT [Bolya+, 19] • Recognize objects in the image by shape, color(, material) Øobjects’ position Øshape Øcolor Ømaterial 12 Object recognition YOLOv3 {shape, color} {shape, color, material} YOLACT

Training data 13 • Dataset created from physical characteristics of
the environment object recognition node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data object position

Velocity・Acceleration Training data 14 (𝑥! , 𝑦!) (𝑥" , 𝑦")
𝑐 = 𝑥, 𝑦 = ( 𝑥! + 𝑥" 2 , 𝑦! + 𝑦" 2 ) c Calculate location information • Calculate the coordinates of the object center from the acquired bounding box coordinates velocity acceleration 𝑎!! = (𝑣!! − 𝑣!" )/(𝑒𝑡"#$%&×𝑡) 𝑎'! = (𝑣'! − 𝑣'" )/(𝑒𝑡"#$%&×𝑡) ※ 𝑒𝑡#$%&' = 5/128 time elapsed between frames 𝑣!! = (𝑥( − 𝑥()* )/𝑒𝑡"#$%& 𝑣'! = (𝑦( − 𝑦()*)/𝑒𝑡"#$%&

graph structure Training data 15 x flag “5” flag “-5”
flag “-1” main object others main object = (𝑥!"#$ , 𝑦!"#$ ) others = (𝑥%&'() , 𝑦%&'() ) 𝑥*#++ = 𝑥%&'() − 𝑥!"#$ 𝑦*#++ = 𝑦%&'() − 𝑦!"#$ 𝑥*#++ 𝑦*#++ + + − − flag “5” flag “1” flag “-1” flag “-5” y flag “1” Position direction flags between objects • Node information Øshape, color, material

• graph2vec [Grover+, 2016] Øinspired by doc2vec’s PV- DBOW Training
data 16 [[0.54, 0.29, 0.61…], [[0.82, 0.91, 0.15…], … [[0.14, 0.35, 0.69…]] Example of embedding vector embedding vector • node2vec [Grover+, 2016] Øinspired by word2vec’s Skip-gram

Experiment Process of change point extraction 17 object recognition object
position node2vec graph2vec velocity acceleration Position direction flags between objects graph structure embedding vector combination VTA Mechanism chang-point extraction YOLO v3 YOALACT training data

Experiment : Accuracy Calculation Method • Examine the accuracy (%)
of annotation collision information and flag timing Example • collision→19 frame， by eye → 21 frame • The correct answer range was set to 19-21 frame • flag︓18, 19, 20, 22 → accuracy︓2/4×100=50 (%) 18 19 frame 20 frame 21 frame

Experiment : settings • Number of training data : 600,000
• Number of times studies : 500,000 • Batch size : 100 • Output : 80 • Optimization : Adam • Error function : KL divergence 19

20 ※ Accuracy is shown in %, - is not
flagged. Recognition Dataset Accuracy shape color material graph velocity acceleration flag image i ii iii iv v vi YOLO v3 ① ✔ ✔ 50 100 - - - - ② ✔ ✔ ✔ 14.3 25 9.1 37.5 14.3 28.6 YOLACT ③ ✔ ✔ ✔ 50 0 50 25 - - ④ ✔ ✔ ✔ ✔ 22.2 22.2 20 22.2 10 10 ⑤ ✔ ✔ ✔ ✔ ✔ ✔ 100 50 25 33.3 25 50 ⑥ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 11.1 10 0 - ⑦ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 75 50 33.3 50 40 50 ⑧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 0 - 10 11.1 - 10 annotation ⑨ ✔ ✔ ✔ ✔ ✔ ✔ 20 100 20 100 50 33.3 ⑩ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 22.2 22.2 20 50 12.5 25 ⑪ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 100 100 33.3 66.7 25 100 ⑫ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 11.1 22.2 10 37.5 11.1 43.9 VTA ⑲ ✔ - - - - - - Result : node2vec

27 Recognition Dataset Accuracy shape color material graph velocity accele-
ration flag image i ii iii iv v vi YOLO v3 ⑬ ✔ ✔ - - - - - - ⑭ ✔ ✔ ✔ 0 20 0 33.3 20 0 YOLACT ⑮ ✔ ✔ ✔ ✔ - - - - - - ⑯ ✔ ✔ ✔ ✔ ✔ 0 25 10 20 20 0 annotation ⑰ ✔ ✔ ✔ ✔ - - - - - - ⑱ ✔ ✔ ✔ ✔ ✔ 0 20 20 50 0 0 ※ Accuracy is shown in %, - is not flagged. Result : graph2vec

Conclusion • Research focused on real-world recognition, including world models
ØVTA • Training Data ØConventional : only image features ØProposed : graphs representing object relationships • Focus on individual objects, not just visual information about the environment • Recognize the real world in detail 28

PRML2023 S9-5 EriKuroda

PRML2023 S9-5 EriKuroda

Eri KURODA

More Decks by Eri KURODA

Other Decks in Research

Featured

Transcript

Extraction of Motion Change Points based on the Physical Characteristics

2 Background・Purpose • World Model Ø Learn models of what

Motivation 3 Variational Temporal Abstraction (VTA) [Kim+, 2019] Extract the

4 Overview 3D maze Image features only does not understand

Conventional Methods 5 Overview 3D maze Image features only does

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0

Method Process of change point extraction 8 object recognition object

Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video

Training data • Dataset created from physical characteristics of the

Training data 11 • Dataset created from physical characteristics of

Yolov3 [Redmon+, 18] • Recognize objects in the image by

Training data 13 • Dataset created from physical characteristics of

Velocity・Acceleration Training data 14 (𝑥! , 𝑦!) (𝑥" , 𝑦")

graph structure Training data 15 x flag “5” flag “-5”

• graph2vec [Grover+, 2016] Øinspired by doc2vec’s PV- DBOW Training

Experiment Process of change point extraction 17 object recognition object

Experiment : Accuracy Calculation Method • Examine the accuracy (%)

Experiment : settings • Number of training data : 600,000

20 ※ Accuracy is shown in %, - is not

21 ※ Accuracy is shown in %, - is not

22 ※ Accuracy is shown in %, - is not

23 ※ Accuracy is shown in %, - is not

24 ※ Accuracy is shown in %, - is not

25 ※ Accuracy is shown in %, - is not

26 ※ Accuracy is shown in %, - is not

27 Recognition Dataset Accuracy shape color material graph velocity accele-

Conclusion • Research focused on real-world recognition, including world models