DS2023_erikuroda

Predictive Inference Model of the Physical Environment that emulates Predictive
Coding Eri Kuroda*1,2 & Ichiro Kobayashi*1 1 : Ochanomizu University 2 : Japan Society for the Promotion of Science DS2023 Material

2 Background・Purpose • Recognition and Prediction Ø predict what the
subject will do next and take action accordingly Ø learn how the world works and background knowledge from a few interactions and observations • change point, common sense • Understanding the real world through language Ø have linguistic information such as common sense and knowledge Ø gain a deeper understanding of the real world by connecting language to the real world Real World Cognition of Humans BUT… • Machine learning for real-world recognition prediction Ø input (observation) is an image → equivalent to human vision Ø predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do • real world and understanding the real world through language have not yet been linked • proposes a predictive inference model that can detect and predict physical change points based on the physical laws of real-world objects. • To connect the real world and language, the inference is expressed as a language. Purpose

3 Overview CLEVRER whether the timing of the change point
of the next step can be displayed correctly Proposed Model Graph structure Representation of a set of physical properties PredNet VTA, graph VTA Image when looking at the real world from the visual Generate inference content as a language Experiment 1 Experiment 2 • Object detection • speed • acceleration • image features, etc

4 PredNet [Lotter+, 2016]

5 PredNet [Lotter+, 2016] Hierarchical Model

6 PredNet [Lotter+, 2016] the process of predictive coding Hierarchical
Model

7 Variational Temporal Abstraction [Kim+, 19] when walking on the
blue road when walking on the red road all events chang points all events chang points

8 Variational Temporal Abstraction [Kim+, 19] difficult to decide when
to transition 𝑍 problem Human: easy ↔ Model: difficult Observation (Input) Observation abstraction temporal abstraction

9 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0
or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags

10 Proposed Model 𝐸!"_ℓ%& 𝐸!"_ℓ ⊝ ⊝ 𝑅!"_ℓ%& 𝑥" Input
# 𝐴!"_ℓ%& 𝐴!"_ℓ%& # 𝐴!"_ℓ 𝐴!"_ℓ 𝐸'"_ℓ%& 𝐸'"_ℓ ⊝ ⊝ 𝑅'"_ℓ%& 𝑅'"_ℓ # 𝐴'"_ℓ%& 𝐴'"_ℓ%& # 𝐴'"_ℓ 𝐴!"_ℓ img Output 𝑑𝑖𝑓𝑓 !" 𝑅!"_ℓ 𝑑𝑖𝑓𝑓'" 𝑚( Output 𝑑𝑖𝑓𝑓 > 𝛼 physical training data Input Error Representation Prediction time t 𝛼︓ threshold Difference Graph structure prediction based on physical properties Image Prediction 𝑑𝑖𝑓𝑓 = 𝑑𝑖𝑓𝑓!" + 𝑑𝑖𝑓𝑓%"

Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video
REpresentation and Reasoning 11 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

combination Dataset physical training dataset • Dataset created from physical
characteristics of the environment 12 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

characteristics of the environment 13 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

object recognition • YOLACT Ø[Bolya+,2019] ØA type of instance segmentation
Ø{shape, color, material} of an object Dataset physical training dataset 14 Before detecting After detecting

object recognition • YOLACT Ø[Bolya+,2019] ØA type of instance segmentation
Ø{shape, color, material} of an object Calculate location information • Calculate the coordinates of the object center from the acquired bounding box coordinates Dataset physical training dataset 15 (𝑥& , 𝑦&) (𝑥' , 𝑦') 𝑐 = 𝑥, 𝑦 = ( 𝑥& + 𝑥' 2 , 𝑦& + 𝑦' 2 ) c Before detecting After detecting

characteristics of the environment 16 object recognition velocity acceleration Position direction flags between objects graph structure embedding vector object position

Velocity・Acceleration Dataset physical training dataset 17 velocity acceleration 𝑎!" =
(𝑣!" − 𝑣!# )/(𝑒𝑡"#$%&×𝑡) 𝑎'" = (𝑣'" − 𝑣'# )/(𝑒𝑡"#$%&×𝑡) ※ 𝑒𝑡()*+, = 5/128 time elapsed between frames 𝑣!" = (𝑥( − 𝑥()*)/𝑒𝑡"#$%& 𝑣'" = (𝑦( − 𝑦()* )/𝑒𝑡"#$%&

Velocity・Acceleration Position direction flags between objects Dataset physical training dataset
18 velocity acceleration 𝑎!" = (𝑣!" − 𝑣!# )/(𝑒𝑡"#$%&×𝑡) 𝑎'" = (𝑣'" − 𝑣'# )/(𝑒𝑡"#$%&×𝑡) ※ 𝑒𝑡()*+, = 5/128 time elapsed between frames 𝑣!" = (𝑥( − 𝑥()*)/𝑒𝑡"#$%& 𝑣'" = (𝑦( − 𝑦()* )/𝑒𝑡"#$%& x main object others main object = (𝑥&'%( , 𝑦&'%( ) others = (𝑥)"*+, , 𝑦)"*+, ) 𝑥-%.. = 𝑥)"*+, − 𝑥&'%( 𝑦-%.. = 𝑦)"*+, − 𝑦&'%( 𝑥-%.. 𝑦-%.. + + − − y 1st Quadrant 2nd Quadrant 3rd Quadrant 4th Quadrant 1st Quadrant 2nd Quadrant 4th Quadrant 3rd Quadrant

graph structure • Node information Øshape, color, material embedding vector
• node2vec [Grover+, 2016] Dataset physical training dataset 19 [[0.54, 0.29, 0.61…], [[0.82, 0.91, 0.15…], … [[0.14, 0.35, 0.69…]] Example of embedding vector

graph structure object position Dataset physical training dataset • Dataset
created from physical characteristics of the environment 20 object recognition combination velocity acceleration Position direction flags between objects embedding vector

Ex 1: Extracting Predicted Change Points Ex 2: Text Generation
Experiment Summary 21

Ex 1: Extracting Predicted Change Points Purpose • whether the
predicted change point of an event can be extracted correctly Setting • Dataset ØCLEVRER ØPhysical training data Scope of coverage: 6 patterns x 10 frames Situations in which physical changes of objects occur, such as collision, disappearance, appearance. Experiment Summary 22

Ex1︓ Accuracy Calculation Method • Examine the F-measure (%) of
annotation collision information and flag timing correct answer range • collision→19 frame， by eye → 21 frame • The correct answer range was set to 19-21 frame 23 19 frame 20 frame 21 frame

Ex1︓ Setting • Training data : 600,000 • Test data
: 80,000 • epoch︓500,000 • batch-size︓100 • Optimization ︓Adam • Error function︓KL divergence 24

25 Ex1︓Result Physical training data i ii iii iv v
vi Created based on 2D coordinates obtained from object recognition 40.0 50.0 50.0 40.0 57.1 50.0 Created from accurate 3D information (annotation) 57.1 50.0 57.1 44.4 50.0 50.0 F-measure (%) Original image Predicted image t=1 t=12 m=1 m=0 m=0 m=1 m=1 m=0 m=1 m=1 collision Result of range i m=0 m=1

26 Ex1︓Result Physical training data i ii iii iv v
vi Created based on 2D coordinates obtained from object recognition 40.0 50.0 50.0 40.0 57.1 50.0 Created from accurate 3D information (annotation) 57.1 50.0 57.1 44.4 50.0 50.0 F-measure (%) Original image Predicted image t=1 t=12 m=1 m=0 m=0 m=1 m=1 m=0 m=1 m=1 collision Result of range i m=0 m=1 accuracy with 2D based-data predictions with accuracy equivalent to 3D based-data (annotation)

Ex 1: Extracting Predicted Change Points Purpose • whether the
predicted change point of an event can be extracted correctly Setting • Data Set ØCLEVRER ØPhysical training data Scope of coverage: 6 patterns x 10 frames Situations in which physical changes of objects occur, such as collision, disappearance, appearance. Ex 2: Text Generation Purpose • Express reasoning as language to connect the real world and language Setting • Dataset ØPaired data of graph embedding vectors and language data • Collision situations only Experiment Summary 27

Ex2︓ Creation of Templates • nine templates Ø3(before・collision・after)×3(sentence type) •
Object type Ø”color” “shape” Øex) blue sphere, gray cylinder, etc. 28 「⻘⾊の球と灰⾊の球が近づく」 “Blue sphere and gray sphere approach.” 「⻘⾊の球が灰⾊の球に近づく」 “Blue sphere approaches gray sphere.” 「灰⾊の球が⻘⾊の球に近づく」 “Gray sphere approaches blue sphere.” 「⻘⾊の球と灰⾊の球がぶつかる」 “Blue sphere and gray sphere collide.” 「⻘⾊の球が灰⾊の球にはじかれる」 “Blue sphere is repulsed by gray sphere.” 「灰⾊の球が⻘⾊の球にはじかれる」 “Gray sphere is repulsed by blue sphere.” collision before collision after collision 「⻘⾊の球と灰⾊の球が離れる」 “Blue sphere and gray sphere leave.” 「⻘⾊の球から灰⾊の球が離れる」 “Gray sphere away from blue sphere.” 「灰⾊の球から⻘⾊の球が離れる」 “Blue sphere away from gray sphere.” Example of text templates：Colliding Objects “blue sphere”, “gray sphere” 5 frames 5 frames • A and B approach • A approaches B • B approaches A • A and B collide • A is repulsed by B • B is repulsed by A • A and B leave • A away from B • B away from A before collision after template ※ A・B︓objects

29 Ex2︓ text generating model test Trained Decoder Model generated
text indicating predicted content pred graph embedding input # 𝐴!"_ℓ Decoder Softmax <bos> w1 w2 wt <eos> … w1 w2 wt … Decoder train model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces

30 Ex2︓Result Range i Range ii Range iv Range vi
original image Predicted image 「緑⾊の球と⾚⾊の円柱がぶつかる」 “Green sphere and red cylinder collide." 「緑⾊の球が⾚⾊の円柱にはじかれる」 “Green sphere is repulsed by red cylinder.” 「⾚⾊の円柱が緑⾊の球にはじかれる」 “Red cylinder is repulsed by green sphere.” correct text 緑⾊の円柱が⾚⾊の円柱にはじかれる Green cylinder is repulsed by red cylinder. generated text 「灰⾊の球と⻘⾊の円柱がぶつかる」 “Gray sphere and blue cylinder collide." 「灰⾊の球が⻘⾊の円柱にはじかれる」 “Gray sphere is repulsed by blue cylinder.” 「⻘⾊の円柱が灰⾊の球にはじかれる」 “Blue cylinder is repulsed by gray sphere.” 灰⾊の球が⻘⾊の⽴⽅体にはじかれる Gray sphere is repulsed by blue cube. 「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」 “Cyan cube and cyan cylinder collide." 「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」 “Cyan cube is repulsed by cyan cylinder.” 「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」 “Cyan cylinder is repulsed by cyan cube.” ⽔⾊の⽴⽅体が⻘⾊の球にぶつかる Cyan cube is repulsed by blue sphere. 「緑⾊の円柱と茶⾊の⽴⽅体がぶつかる」 “Green cylinder and brown cube collide." 「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」 “Green cylinder is repulsed by brown cube.” 「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」 “Brown cube is repulsed by green cylinder.” 緑⾊の円柱が茶⾊の⽴⽅体にぶつかる Green cylinder is repulsed by brown cube. object’s color ✔，shape ✘ object’s color ✔，shape ✔ object’s color ✔，shape ✘ object’s color ✘，shape ✘ correct text generated text correct text generated text correct text generated text original image original image original image Predicted image Predicted image Predicted image

Ex2︓ Discussion of the results of the range vi 31
20 frames before 25 frames before collision 15 frames before 5 frames before 10 frames before collision Incorrect reason for both color and shape of object Possibility that "cyan cube" and "blue sphere" were judged to have collided Range vi る er. る「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」 “Cyan cube and cyan cylinder collide." 「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」 “Cyan cube is repulsed by cyan cylinder.” 「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」 “Cyan cylinder is repulsed by cyan cube.” ⽔⾊の⽴⽅体が⻘⾊の球にぶつかる Cyan cube is repulsed by blue sphere. 「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」 “Green cylinder is repulsed by brown cube.” 「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」 “Brown cube is repulsed by green cylinder.” 緑⾊の円柱が茶⾊の⽴⽅体にぶつかる Green cylinder is repulsed by brown cube. object’s color ✔，shape ✔ object’s color ✘，shape ✘ correct text generated text generated text original image Predicted image Predicted image

Ex2︓ BLEU score 32 BLEU@2 BLEU@3 BLEU@4 score 79.7 74.5
68.8 Range i Range ii Range iv Range vi original image Predicted image 「緑⾊の球と⾚⾊の円柱がぶつかる」 “Green sphere and red cylinder collide." 「緑⾊の球が⾚⾊の円柱にはじかれる」 “Green sphere is repulsed by red cylinder.” 「⾚⾊の円柱が緑⾊の球にはじかれる」 “Red cylinder is repulsed by green sphere.” correct text 緑⾊の円柱が⾚⾊の円柱にはじかれる Green cylinder is repulsed by red cylinder. generated text 「灰⾊の球と⻘⾊の円柱がぶつかる」 “Gray sphere and blue cylinder collide." 「灰⾊の球が⻘⾊の円柱にはじかれる」 “Gray sphere is repulsed by blue cylinder.” 「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」 “Cyan cube and cyan cylinder collide." 「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」 “Cyan cube is repulsed by cyan cylinder.” 「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」「緑⾊の円柱と茶⾊の⽴⽅体がぶつかる」 “Green cylinder and brown cube collide." 「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」 “Green cylinder is repulsed by brown cube.” 「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」 “Red cylinder is repulsed by green sphere.” 緑⾊の円柱が茶⾊の⽴⽅体にぶつかる Green cylinder is repulsed by brown cube. object’s color ✔，shape ✘ object’s color ✔，shape ✔ correct text correct text correct text generated text original image original image original image Predicted image since the average is taken, it is possible that the score is a little low

Conclusion • Construct a predictive inference model that mimics the
hierarchical structure of the human brain Øadd flag “m” representing change points to the hierarchical structure of PredNet Øbased on experimental results, timing of change points can also be obtained for predictive content • generated a language of inference to connect real-world events and objects as a language Øon the basis of the experimental results, it was possible to generate a language for the content of the inferences 33

Future Tasks • Modify the model to increase accuracy •
Use of real world-like data • in the real-life environment, extraction of easy-to-understand change points and prediction of what actions will be necessary 34

DS2023_erikuroda

DS2023_erikuroda

Eri KURODA

More Decks by Eri KURODA

Featured

Transcript

Predictive Inference Model of the Physical Environment that emulates Predictive

2 Background・Purpose • Recognition and Prediction Ø predict what the

3 Overview CLEVRER whether the timing of the change point

4 PredNet [Lotter+, 2016]

5 PredNet [Lotter+, 2016] Hierarchical Model

6 PredNet [Lotter+, 2016] the process of predictive coding Hierarchical

7 Variational Temporal Abstraction [Kim+, 19] when walking on the

8 Variational Temporal Abstraction [Kim+, 19] difficult to decide when

9 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0

10 Proposed Model 𝐸!"_ℓ%& 𝐸!"_ℓ ⊝ ⊝ 𝑅!"_ℓ%& 𝑥" Input

Dataset︓CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] ØCoLlision Events for Video

combination Dataset physical training dataset • Dataset created from physical

combination Dataset physical training dataset • Dataset created from physical

object recognition • YOLACT Ø[Bolya+,2019] ØA type of instance segmentation

object recognition • YOLACT Ø[Bolya+,2019] ØA type of instance segmentation

combination Dataset physical training dataset • Dataset created from physical

Velocity・Acceleration Dataset physical training dataset 17 velocity acceleration 𝑎!" =

Velocity・Acceleration Position direction flags between objects Dataset physical training dataset

graph structure • Node information Øshape, color, material embedding vector

graph structure object position Dataset physical training dataset • Dataset

Ex 1: Extracting Predicted Change Points Ex 2: Text Generation

Ex 1: Extracting Predicted Change Points Purpose • whether the

Ex1︓ Accuracy Calculation Method • Examine the F-measure (%) of

Ex1︓ Setting • Training data : 600,000 • Test data

25 Ex1︓Result Physical training data i ii iii iv v

26 Ex1︓Result Physical training data i ii iii iv v

Ex 1: Extracting Predicted Change Points Purpose • whether the

Ex2︓ Creation of Templates • nine templates Ø3(before・collision・after)×3(sentence type) •

29 Ex2︓ text generating model test Trained Decoder Model generated

30 Ex2︓Result Range i Range ii Range iv Range vi

Ex2︓ Discussion of the results of the range vi 31

Ex2︓ BLEU score 32 BLEU@2 BLEU@3 BLEU@4 score 79.7 74.5

Conclusion • Construct a predictive inference model that mimics the

Future Tasks • Modify the model to increase accuracy •