ICMLC2025_erikuroda

Verbal Representation of Object Collision Prediction Based on Physical CommonSense
Knowledge ̋ Eri Kuroda & Ichiro Kobayashi Ochanomizu University

Background, Purpose 2 • Predict what the object's next move
will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real-world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct four different chage point prediction models. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose

Prediction of graph structure based-models • PredNet [Lotter+,16] • PredRNN
[Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 3 physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ✔，shape ✔ Input Prediction of graph structure output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add CLEVRER [Yi+, 19] 1 2 3

PredNet [Lotter+, 16] • mimics predictive coding processing in the
cerebral cortex • inferring errors hierarchically PreCNet [Straka+, 23] • improved PredNet • infer the entire input information each time Based Models 4 !ℓ"# !ℓ " # ℓ"# "ℓ"# " # ℓ "ℓ $ℓ"# $ℓ ⊝ ⊝ conv Prediction Target pool conv input Error +,-ReLU subtract %$! Input Representation conv LSTM !! ℓ ⊝ ⊝ "! ℓ#$ # $ ! ℓ#$ # $ ! ℓ "! ℓ %! upsample !! ℓ#$ ⊝ ⊝ upsample conv LSTM conv +,- ReLU subtract conv input conv LSTM Representation Pediction Error +,- ReLU subtract

PredRNN [Wang+, 2017] • prediction model with a hierarchical structure
using ConvLSTM • H (hidden layer) is input for both space and time. PredRNN v2 [Wang+, 2022] • prediction model that improves on PredRNN • the number of gates that input H has been increased Based Models 5 !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory ! = !! + !" = 1 % & '# ( − '# + $ #%! 1 % & '# ( − '# " $ #%! ( %&'()*+& ℓ,! = cos(∆!! ℓ, ∆%! ℓ) !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory PredRNN PredRNN v2 Original ConvLSTM Adding a spatio-temporal memory mechanism to ConvLSTM. Internal structure of PredRNN, PredRNN v2

Variational Temporal Abstraction [Kim+, 19] 6 when walking on the
blue path when walking on the red path all events change points all events change points

Variational Temporal Abstraction [Kim+, 19] 7 difficult to decide when
to transition 𝑍 problem Human: easy, Model: difficult Observation (Input) Observation abstraction Temporal abstraction

Variational Temporal Abstraction [Kim+, 19] 8 Determines the flag (0
or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation

PredNet-based proposed Model 9 image data !!"_ℓ%& !!"_ℓ ⊝ ⊝
#!"_ℓ%& $ % !"_ℓ%& $!"_ℓ%& $ % !"_ℓ $!"_ℓ !'"_ℓ%& !'"_ℓ ⊝ ⊝ #'"_ℓ%& $ % '"_ℓ%& $'"_ℓ%& $ % '"_ℓ $'"_ℓ img img output &'((!" #!"_ℓ &'(('" )( flag output )( = 0 ∶ &'(( < . )( = 1 ∶ &'(( > . &'(( = &'((!" + &'(('" 2!" Input Error Representation Prediction time t Physical data .: threshold Difference #'"_ℓ 2'" Input physical training data image data

PredRNN, PredRNN v2-based proposed Model 10 !!_#$% ℓ'( , #!_#$%
ℓ'( !!_#$% !!_&'( ST-$%&'!"# ℓ%& ST-$%&'!"# ℓ%' ST-$%&'!"# ℓ%( ST-$%&'!"# ℓ%) ! " !)*_#$% ! " !)*_&'( $!)*_+,- ℓ'. $!_+,- ℓ'* $!_+,- ℓ'( $!_+,- ℓ'/ !!_+,- ℓ'* !!_+,- ℓ'( !!_+,- ℓ'/ $!_+,- ℓ'. $!_#$% ℓ'. %! = ' 0 ∶ +,--! < / 1 ∶ +,--! > / time ! $!)*_#$% ℓ'. image data ()***_,-. physical data ()***_!"# +,--! = +,--!_+,- + +,--!_#$% ST-$%&',-. ℓ%& ST- $%&',-. ℓ%' ST- $%&',-. ℓ%( ST- $%&',-. ℓ%) physical training data img output !!_+,- ℓ'( , #!_+,- ℓ'( !!_+,- ℓ'/ , #!_+,- ℓ'/ !!_+,- ℓ'. , #!_+,- ℓ'. !!_+,- ℓ'* , #!_+,- ℓ'* $!_#$% ℓ'* $!_#$% ℓ'( $!_#$% ℓ'/ !!_#$% ℓ'* !!_#$% ℓ'( !!_#$% ℓ'/ !!_#$% ℓ'/ , #!_#$% ℓ'/ !!_#$% ℓ'. , #!_#$% ℓ'. !!_#$% ℓ'* , #!_#$% ℓ'*

PreCNet-based proposed Model 11 𝐸!_#$% ℓ'( 𝐸!_#$% ℓ ⊝ ⊝
𝑅!_#$% ℓ'( # 𝐴!_#$% ℓ'( ! 𝐴!_#$% ℓ 𝑅!_#$% ℓ Error Representation Prediction 𝑥"_$%& Input 𝐸!_)*+ ℓ'( 𝐸!_)*+ ℓ ⊝ ⊝ 𝑅!_)*+ ℓ'( ! 𝐴!_'() ℓ*+ ! 𝐴!_'() ℓ 𝑥"_'() Input 𝑅!,(_#$% ℓ 𝑅!,(_)*+ ℓ 𝑅!_)*+ ℓ upsample upsample 𝑚" = $ 0 ∶ 𝑑𝑖𝑓𝑓" < 𝛼 1 ∶ 𝑑𝑖𝑓𝑓" > 𝛼 time t image data physical data 𝑑𝑖𝑓𝑓"_$%& 𝑑𝑖𝑓𝑓"_'() img Output 𝑑𝑖𝑓𝑓" = 𝑑𝑖𝑓𝑓"_$%& + 𝑑𝑖𝑓𝑓"_'()

Dataset: CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events
for Video REpresentation and Reasoning 12 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

combination Dataset physical training dataset • Dataset created from physical
characteristics of the environment 13 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN
v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 14 predicted image Prediction of graph structure output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Prediction of graph structure physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ✔，shape ✔ Input Change Point Prediction Model CLEVRER [Yi+, 19] 1 2

Ex1: Creation of Templates • nine templates • 3(beforeɾcollisionɾafter) ×
3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 15 「⻘⾊の球と灰⾊の球が近づく」 “Blue sphere and gray sphere approach.” 「⻘⾊の球が灰⾊の球に近づく」 “Blue sphere approaches gray sphere.” 「灰⾊の球が⻘⾊の球に近づく」 “Gray sphere approaches blue sphere.” 「⻘⾊の球と灰⾊の球がぶつかる」 “Blue sphere and gray sphere collide.” 「⻘⾊の球が灰⾊の球にはじかれる」 “Blue sphere is repulsed by gray sphere.” 「灰⾊の球が⻘⾊の球にはじかれる」 “Gray sphere is repulsed by blue sphere.” collision before collision after collision 「⻘⾊の球と灰⾊の球が離れる」 “Blue sphere and gray sphere leave.” 「⻘⾊の球から灰⾊の球が離れる」 “Gray sphere away from blue sphere.” 「灰⾊の球から⻘⾊の球が離れる」 “Blue sphere away from gray sphere.” Example of text templates：Colliding Objects “blue sphere”, “gray sphere” 5 frames 5 frames • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects

Ex1: text generating model 16 test Trained Decoder Model generated
text indicating predicted content pred graph embedding input Decoder Softmax <bos> w1 w2 wt <eos> … w1 w2 wt … Decoder model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces

Ex1: result – range i 17 Range i color shape
Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. PredNet- based Green cylinder is repulsed by red cylinder. ✔ ✘ PredRNN- based Green cylinder and red cylinder collide. ✔ ✘ PredRNN v2-based Red cylinder is repulsed by green sphere. ✔ ✔ PreCNet- based Red cylinder is repulsed by green sphere. ✔ ✔

Ex1: result – range vi 18 Range vi color shape
Correct Cyan cube and cyan cylinder collide. Cyan cube is repulsed by cyan cylinder. Cyan cylinder is repulsed by cyan cube. PredNet- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN v2-based Cyan cube is repulsed by cyan sphere. ✔ ✘ PreCNet- based Cyan cube is repulsed by cyan cylinder. ✔ ✔

Ex1: result – Comparison of accuracy using evaluation metrics 19
Based model language BLEU@2↑ BLEU@3↑ BLEU@4↑ METEOR↑ CIDEr↑ PredNet-based en 80.3 63.0 56.3 68.8 72.9 ja 79.7 74.5 68.8 70.2 72.4 PredRNN-based en 84.3 66.8 59.1 72.6 74.6 ja 82.5 76.1 73.4 73.5 75.1 PredRNN v2 -based en 86.2 72.4 62.7 75.9 78.3 ja 85.9 78.9 75.7 77.6 78.2 PreCNet-based en 90.6 77.1 67.9 78.1 80.3 ja 88.3 80.6 79.2 80.4 81.2

1 Prediction of graph structure physical training data • Graph
embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects Input Change Point Prediction Model CLEVRER [Yi+, 19] based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] Overview 20 predicted image Prediction of graph structure output generated text Red cylinder is repulsed by green sphere object’s color ✔，shape ✔ generated text Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Language Model 2

Ex2: Text generation model for collision situation based on physical
commonsense knowledge 21 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.

Ex2: Add common sense externally Assignment of conditions: 45 types
22 State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed

Ex2: result - range i 23 Generated statement by model:
A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.

Ex2: result - range vi 24 Generated statement by model:
A cyan cube is repelled by a cyan cylinder. Object A: cyan cube, Object B: cyan cylinder Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators − A = B − The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The light blue cylinder collides with the light blue cube, and the light blue cube is sent flying slippery A > B A < B The cyan cube collides with the cyan cylinder with great force, and the cyan cube is sent flying away. The cyan cube is violently hit by the cyan cylinder, and the cyan cube is slightly knocked away rough − A = B The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The cyan cylinder and the cyan cube collide, and the cyan cylinder and the cyan cube bounce back to the same extent. − A = B A > B The cyan cube collides with the cyan cylinder with great force, sending the cyan cylinder flying away. The cyan cube is slowly sent in one direction, while the cyan cylinder is sent in the opposite direction at high speed.

Ex2: result – Comparison of accuracy using evaluation metrics 25
Based models BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL- 4o PredNet -based 37.3 68.2 30.3 34.4 43.2 64.5 80.3 PredRNN -based 43.6 74.7 36.1 41.8 48.7 69.7 83.1 PredRNN v2 -based 46.5 79.5 45.6 49.6 56.1 75.3 88.5 PreCNet -based 55.8 82.2 49.7 56.4 67.9 80.2 92.4

Conclusion & Future works • Predictive inference model to extract
change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 26

ICMLC2025_erikuroda

ICMLC2025_erikuroda

Eri KURODA

More Decks by Eri KURODA

Featured

Transcript

Verbal Representation of Object Collision Prediction Based on Physical CommonSense

Background, Purpose 2 • Predict what the object's next move

Prediction of graph structure based-models • PredNet [Lotter+,16] • PredRNN

PredNet [Lotter+, 16] • mimics predictive coding processing in the

PredRNN [Wang+, 2017] • prediction model with a hierarchical structure

Variational Temporal Abstraction [Kim+, 19] 6 when walking on the

Variational Temporal Abstraction [Kim+, 19] 7 difficult to decide when

Variational Temporal Abstraction [Kim+, 19] 8 Determines the flag (0

PredNet-based proposed Model 9 image data !!"_ℓ%& !!"_ℓ ⊝ ⊝

PredRNN, PredRNN v2-based proposed Model 10 !!_#$% ℓ'( , #!_#$%

PreCNet-based proposed Model 11 𝐸!_#$% ℓ'( 𝐸!_#$% ℓ ⊝ ⊝

Dataset: CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events

combination Dataset physical training dataset • Dataset created from physical

based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN

Ex1: Creation of Templates • nine templates • 3(beforeɾcollisionɾafter) ×

Ex1: text generating model 16 test Trained Decoder Model generated

Ex1: result – range i 17 Range i color shape

Ex1: result – range vi 18 Range vi color shape

Ex1: result – Comparison of accuracy using evaluation metrics 19

1 Prediction of graph structure physical training data • Graph

Ex2: Text generation model for collision situation based on physical

Ex2: Add common sense externally Assignment of conditions: 45 types

Ex2: result - range i 23 Generated statement by model:

Ex2: result - range vi 24 Generated statement by model:

Ex2: result – Comparison of accuracy using evaluation metrics 25

Conclusion & Future works • Predictive inference model to extract