Slide 1

Slide 1 text

Verbal Representation of Object Collision Prediction Based on Physical CommonSense Knowledge ̋ Eri Kuroda & Ichiro Kobayashi Ochanomizu University

Slide 2

Slide 2 text

Background, Purpose 2 • Predict what the object's next move will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real-world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct four different chage point prediction models. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose

Slide 3

Slide 3 text

Prediction of graph structure based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 3 physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ Input Prediction of graph structure output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add CLEVRER [Yi+, 19] 1 2 3

Slide 4

Slide 4 text

PredNet [Lotter+, 16] • mimics predictive coding processing in the cerebral cortex • inferring errors hierarchically PreCNet [Straka+, 23] • improved PredNet • infer the entire input information each time Based Models 4 !ℓ"# !ℓ " # ℓ"# "ℓ"# " # ℓ "ℓ $ℓ"# $ℓ ⊝ ⊝ conv Prediction Target pool conv input Error +,-ReLU subtract %$! Input Representation conv LSTM !! ℓ ⊝ ⊝ "! ℓ#$ # $ ! ℓ#$ # $ ! ℓ "! ℓ %! upsample !! ℓ#$ ⊝ ⊝ upsample conv LSTM conv +,- ReLU subtract conv input conv LSTM Representation Pediction Error +,- ReLU subtract

Slide 5

Slide 5 text

PredRNN [Wang+, 2017] • prediction model with a hierarchical structure using ConvLSTM • H (hidden layer) is input for both space and time. PredRNN v2 [Wang+, 2022] • prediction model that improves on PredRNN • the number of gates that input H has been increased Based Models 5 !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory ! = !! + !" = 1 % & '# ( − '# + $ #%! 1 % & '# ( − '# " $ #%! ( %&'()*+& ℓ,! = cos(∆!! ℓ, ∆%! ℓ) !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory PredRNN PredRNN v2 Original ConvLSTM Adding a spatio-temporal memory mechanism to ConvLSTM. Internal structure of PredRNN, PredRNN v2

Slide 6

Slide 6 text

Variational Temporal Abstraction [Kim+, 19] 6 when walking on the blue path when walking on the red path all events change points all events change points

Slide 7

Slide 7 text

Variational Temporal Abstraction [Kim+, 19] 7 difficult to decide when to transition 𝑍 problem Human: easy, Model: difficult Observation (Input) Observation abstraction Temporal abstraction

Slide 8

Slide 8 text

Variational Temporal Abstraction [Kim+, 19] 8 Determines the flag (0 or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation

Slide 9

Slide 9 text

PredNet-based proposed Model 9 image data !!"_ℓ%& !!"_ℓ ⊝ ⊝ #!"_ℓ%& $ % !"_ℓ%& $!"_ℓ%& $ % !"_ℓ $!"_ℓ !'"_ℓ%& !'"_ℓ ⊝ ⊝ #'"_ℓ%& $ % '"_ℓ%& $'"_ℓ%& $ % '"_ℓ $'"_ℓ img img output &'((!" #!"_ℓ &'(('" )( flag output )( = 0 ∶ &'(( < . )( = 1 ∶ &'(( > . &'(( = &'((!" + &'(('" 2!" Input Error Representation Prediction time t Physical data .: threshold Difference #'"_ℓ 2'" Input physical training data image data

Slide 10

Slide 10 text

PredRNN, PredRNN v2-based proposed Model 10 !!_#$% ℓ'( , #!_#$% ℓ'( !!_#$% !!_&'( ST-$%&'!"# ℓ%& ST-$%&'!"# ℓ%' ST-$%&'!"# ℓ%( ST-$%&'!"# ℓ%) ! " !)*_#$% ! " !)*_&'( $!)*_+,- ℓ'. $!_+,- ℓ'* $!_+,- ℓ'( $!_+,- ℓ'/ !!_+,- ℓ'* !!_+,- ℓ'( !!_+,- ℓ'/ $!_+,- ℓ'. $!_#$% ℓ'. %! = ' 0 ∶ +,--! < / 1 ∶ +,--! > / time ! $!)*_#$% ℓ'. image data ()***_,-. physical data ()***_!"# +,--! = +,--!_+,- + +,--!_#$% ST-$%&',-. ℓ%& ST- $%&',-. ℓ%' ST- $%&',-. ℓ%( ST- $%&',-. ℓ%) physical training data img output !!_+,- ℓ'( , #!_+,- ℓ'( !!_+,- ℓ'/ , #!_+,- ℓ'/ !!_+,- ℓ'. , #!_+,- ℓ'. !!_+,- ℓ'* , #!_+,- ℓ'* $!_#$% ℓ'* $!_#$% ℓ'( $!_#$% ℓ'/ !!_#$% ℓ'* !!_#$% ℓ'( !!_#$% ℓ'/ !!_#$% ℓ'/ , #!_#$% ℓ'/ !!_#$% ℓ'. , #!_#$% ℓ'. !!_#$% ℓ'* , #!_#$% ℓ'*

Slide 11

Slide 11 text

PreCNet-based proposed Model 11 𝐸!_#$% ℓ'( 𝐸!_#$% ℓ ⊝ ⊝ 𝑅!_#$% ℓ'( # 𝐴!_#$% ℓ'( ! 𝐴!_#$% ℓ 𝑅!_#$% ℓ Error Representation Prediction 𝑥"_$%& Input 𝐸!_)*+ ℓ'( 𝐸!_)*+ ℓ ⊝ ⊝ 𝑅!_)*+ ℓ'( ! 𝐴!_'() ℓ*+ ! 𝐴!_'() ℓ 𝑥"_'() Input 𝑅!,(_#$% ℓ 𝑅!,(_)*+ ℓ 𝑅!_)*+ ℓ upsample upsample 𝑚" = $ 0 ∶ 𝑑𝑖𝑓𝑓" < 𝛼 1 ∶ 𝑑𝑖𝑓𝑓" > 𝛼 time t image data physical data 𝑑𝑖𝑓𝑓"_$%& 𝑑𝑖𝑓𝑓"_'() img Output 𝑑𝑖𝑓𝑓" = 𝑑𝑖𝑓𝑓"_$%& + 𝑑𝑖𝑓𝑓"_'()

Slide 12

Slide 12 text

Dataset: CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events for Video REpresentation and Reasoning 12 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

Slide 13

Slide 13 text

combination Dataset physical training dataset • Dataset created from physical characteristics of the environment 13 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

Slide 14

Slide 14 text

based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 14 predicted image Prediction of graph structure output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Prediction of graph structure physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ Input Change Point Prediction Model CLEVRER [Yi+, 19] 1 2

Slide 15

Slide 15 text

Ex1: Creation of Templates • nine templates • 3(beforeɾcollisionɾafter) × 3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 15 「⻘⾊の球と灰⾊の球が近づく」 “Blue sphere and gray sphere approach.” 「⻘⾊の球が灰⾊の球に近づく」 “Blue sphere approaches gray sphere.” 「灰⾊の球が⻘⾊の球に近づく」 “Gray sphere approaches blue sphere.” 「⻘⾊の球と灰⾊の球がぶつかる」 “Blue sphere and gray sphere collide.” 「⻘⾊の球が灰⾊の球にはじかれる」 “Blue sphere is repulsed by gray sphere.” 「灰⾊の球が⻘⾊の球にはじかれる」 “Gray sphere is repulsed by blue sphere.” collision before collision after collision 「⻘⾊の球と灰⾊の球が離れる」 “Blue sphere and gray sphere leave.” 「⻘⾊の球から灰⾊の球が離れる」 “Gray sphere away from blue sphere.” 「灰⾊の球から⻘⾊の球が離れる」 “Blue sphere away from gray sphere.” Example of text templates:Colliding Objects “blue sphere”, “gray sphere” 5 frames 5 frames • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects

Slide 16

Slide 16 text

Ex1: text generating model 16 test Trained Decoder Model generated text indicating predicted content pred graph embedding input Decoder Softmax w1 w2 wt … w1 w2 wt … Decoder model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces

Slide 17

Slide 17 text

Ex1: result – range i 17 Range i color shape Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. PredNet- based Green cylinder is repulsed by red cylinder. ✔ ✘ PredRNN- based Green cylinder and red cylinder collide. ✔ ✘ PredRNN v2-based Red cylinder is repulsed by green sphere. ✔ ✔ PreCNet- based Red cylinder is repulsed by green sphere. ✔ ✔

Slide 18

Slide 18 text

Ex1: result – range vi 18 Range vi color shape Correct Cyan cube and cyan cylinder collide. Cyan cube is repulsed by cyan cylinder. Cyan cylinder is repulsed by cyan cube. PredNet- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN v2-based Cyan cube is repulsed by cyan sphere. ✔ ✘ PreCNet- based Cyan cube is repulsed by cyan cylinder. ✔ ✔

Slide 19

Slide 19 text

Ex1: result – Comparison of accuracy using evaluation metrics 19 Based model language BLEU@2↑ BLEU@3↑ BLEU@4↑ METEOR↑ CIDEr↑ PredNet-based en 80.3 63.0 56.3 68.8 72.9 ja 79.7 74.5 68.8 70.2 72.4 PredRNN-based en 84.3 66.8 59.1 72.6 74.6 ja 82.5 76.1 73.4 73.5 75.1 PredRNN v2 -based en 86.2 72.4 62.7 75.9 78.3 ja 85.9 78.9 75.7 77.6 78.2 PreCNet-based en 90.6 77.1 67.9 78.1 80.3 ja 88.3 80.6 79.2 80.4 81.2

Slide 20

Slide 20 text

1 Prediction of graph structure physical training data • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects Input Change Point Prediction Model CLEVRER [Yi+, 19] based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] Overview 20 predicted image Prediction of graph structure output generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ generated text Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Language Model 2

Slide 21

Slide 21 text

Ex2: Text generation model for collision situation based on physical commonsense knowledge 21 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.

Slide 22

Slide 22 text

Ex2: Add common sense externally Assignment of conditions: 45 types 22 State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed

Slide 23

Slide 23 text

Ex2: result - range i 23 Generated statement by model: A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.

Slide 24

Slide 24 text

Ex2: result - range vi 24 Generated statement by model: A cyan cube is repelled by a cyan cylinder. Object A: cyan cube, Object B: cyan cylinder Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators − A = B − The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The light blue cylinder collides with the light blue cube, and the light blue cube is sent flying slippery A > B A < B The cyan cube collides with the cyan cylinder with great force, and the cyan cube is sent flying away. The cyan cube is violently hit by the cyan cylinder, and the cyan cube is slightly knocked away rough − A = B The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The cyan cylinder and the cyan cube collide, and the cyan cylinder and the cyan cube bounce back to the same extent. − A = B A > B The cyan cube collides with the cyan cylinder with great force, sending the cyan cylinder flying away. The cyan cube is slowly sent in one direction, while the cyan cylinder is sent in the opposite direction at high speed.

Slide 25

Slide 25 text

Ex2: result – Comparison of accuracy using evaluation metrics 25 Based models BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL- 4o PredNet -based 37.3 68.2 30.3 34.4 43.2 64.5 80.3 PredRNN -based 43.6 74.7 36.1 41.8 48.7 69.7 83.1 PredRNN v2 -based 46.5 79.5 45.6 49.6 56.1 75.3 88.5 PreCNet -based 55.8 82.2 49.7 56.4 67.9 80.2 92.4

Slide 26

Slide 26 text

Conclusion & Future works • Predictive inference model to extract change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 26