Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICMLC2025_erikuroda

Eri KURODA
February 16, 2025
4

 ICMLC2025_erikuroda

ICMLC2025, Feb. 17, 2025
LC046-Eri Kuroda

Eri KURODA

February 16, 2025
Tweet

Transcript

  1. Verbal Representation of Object Collision Prediction Based on Physical CommonSense

    Knowledge ̋ Eri Kuroda & Ichiro Kobayashi Ochanomizu University
  2. Background, Purpose 2 • Predict what the object's next move

    will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real-world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct four different chage point prediction models. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose
  3. Prediction of graph structure based-models • PredNet [Lotter+,16] • PredRNN

    [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 3 physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ Input Prediction of graph structure output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add CLEVRER [Yi+, 19] 1 2 3
  4. PredNet [Lotter+, 16] • mimics predictive coding processing in the

    cerebral cortex • inferring errors hierarchically PreCNet [Straka+, 23] • improved PredNet • infer the entire input information each time Based Models 4 !ℓ"# !ℓ " # ℓ"# "ℓ"# " # ℓ "ℓ $ℓ"# $ℓ ⊝ ⊝ conv Prediction Target pool conv input Error +,-ReLU subtract %$! Input Representation conv LSTM !! ℓ ⊝ ⊝ "! ℓ#$ # $ ! ℓ#$ # $ ! ℓ "! ℓ %! upsample !! ℓ#$ ⊝ ⊝ upsample conv LSTM conv +,- ReLU subtract conv input conv LSTM Representation Pediction Error +,- ReLU subtract
  5. PredRNN [Wang+, 2017] • prediction model with a hierarchical structure

    using ConvLSTM • H (hidden layer) is input for both space and time. PredRNN v2 [Wang+, 2022] • prediction model that improves on PredRNN • the number of gates that input H has been increased Based Models 5 !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory ! = !! + !" = 1 % & '# ( − '# + $ #%! 1 % & '# ( − '# " $ #%! ( %&'()*+& ℓ,! = cos(∆!! ℓ, ∆%! ℓ) !!"# ℓ "! !! ℓ #! $! %! ℓ #′! '′! "′! %! ℓ"# '! !!"# ℓ "! !! ℓ ⨂ ⨂ ⨂ ⨂ ⨂ Input Gate Output Gate Input Modulation Gate Forget Gate Standard Temporal Memory Spatiotemporal Memory PredRNN PredRNN v2 Original ConvLSTM Adding a spatio-temporal memory mechanism to ConvLSTM. Internal structure of PredRNN, PredRNN v2
  6. Variational Temporal Abstraction [Kim+, 19] 6 when walking on the

    blue path when walking on the red path all events change points all events change points
  7. Variational Temporal Abstraction [Kim+, 19] 7 difficult to decide when

    to transition 𝑍 problem Human: easy, Model: difficult Observation (Input) Observation abstraction Temporal abstraction
  8. Variational Temporal Abstraction [Kim+, 19] 8 Determines the flag (0

    or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation
  9. PredNet-based proposed Model 9 image data !!"_ℓ%& !!"_ℓ ⊝ ⊝

    #!"_ℓ%& $ % !"_ℓ%& $!"_ℓ%& $ % !"_ℓ $!"_ℓ !'"_ℓ%& !'"_ℓ ⊝ ⊝ #'"_ℓ%& $ % '"_ℓ%& $'"_ℓ%& $ % '"_ℓ $'"_ℓ img img output &'((!" #!"_ℓ &'(('" )( flag output )( = 0 ∶ &'(( < . )( = 1 ∶ &'(( > . &'(( = &'((!" + &'(('" 2!" Input Error Representation Prediction time t Physical data .: threshold Difference #'"_ℓ 2'" Input physical training data image data
  10. PredRNN, PredRNN v2-based proposed Model 10 !!_#$% ℓ'( , #!_#$%

    ℓ'( !!_#$% !!_&'( ST-$%&'!"# ℓ%& ST-$%&'!"# ℓ%' ST-$%&'!"# ℓ%( ST-$%&'!"# ℓ%) ! " !)*_#$% ! " !)*_&'( $!)*_+,- ℓ'. $!_+,- ℓ'* $!_+,- ℓ'( $!_+,- ℓ'/ !!_+,- ℓ'* !!_+,- ℓ'( !!_+,- ℓ'/ $!_+,- ℓ'. $!_#$% ℓ'. %! = ' 0 ∶ +,--! < / 1 ∶ +,--! > / time ! $!)*_#$% ℓ'. image data ()***_,-. physical data ()***_!"# +,--! = +,--!_+,- + +,--!_#$% ST-$%&',-. ℓ%& ST- $%&',-. ℓ%' ST- $%&',-. ℓ%( ST- $%&',-. ℓ%) physical training data img output !!_+,- ℓ'( , #!_+,- ℓ'( !!_+,- ℓ'/ , #!_+,- ℓ'/ !!_+,- ℓ'. , #!_+,- ℓ'. !!_+,- ℓ'* , #!_+,- ℓ'* $!_#$% ℓ'* $!_#$% ℓ'( $!_#$% ℓ'/ !!_#$% ℓ'* !!_#$% ℓ'( !!_#$% ℓ'/ !!_#$% ℓ'/ , #!_#$% ℓ'/ !!_#$% ℓ'. , #!_#$% ℓ'. !!_#$% ℓ'* , #!_#$% ℓ'*
  11. PreCNet-based proposed Model 11 𝐸!_#$% ℓ'( 𝐸!_#$% ℓ ⊝ ⊝

    𝑅!_#$% ℓ'( # 𝐴!_#$% ℓ'( ! 𝐴!_#$% ℓ 𝑅!_#$% ℓ Error Representation Prediction 𝑥"_$%& Input 𝐸!_)*+ ℓ'( 𝐸!_)*+ ℓ ⊝ ⊝ 𝑅!_)*+ ℓ'( ! 𝐴!_'() ℓ*+ ! 𝐴!_'() ℓ 𝑥"_'() Input 𝑅!,(_#$% ℓ 𝑅!,(_)*+ ℓ 𝑅!_)*+ ℓ upsample upsample 𝑚" = $ 0 ∶ 𝑑𝑖𝑓𝑓" < 𝛼 1 ∶ 𝑑𝑖𝑓𝑓" > 𝛼 time t image data physical data 𝑑𝑖𝑓𝑓"_$%& 𝑑𝑖𝑓𝑓"_'() img Output 𝑑𝑖𝑓𝑓" = 𝑑𝑖𝑓𝑓"_$%& + 𝑑𝑖𝑓𝑓"_'()
  12. Dataset: CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events

    for Video REpresentation and Reasoning 12 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration
  13. combination Dataset physical training dataset • Dataset created from physical

    characteristics of the environment 13 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector
  14. based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN

    v2 [Wang+, 21] • PreCNet [Straka+, 23] generated text Overview 14 predicted image Prediction of graph structure output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Prediction of graph structure physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ Input Change Point Prediction Model CLEVRER [Yi+, 19] 1 2
  15. Ex1: Creation of Templates • nine templates • 3(beforeɾcollisionɾafter) ×

    3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 15 「⻘⾊の球と灰⾊の球が近づく」 “Blue sphere and gray sphere approach.” 「⻘⾊の球が灰⾊の球に近づく」 “Blue sphere approaches gray sphere.” 「灰⾊の球が⻘⾊の球に近づく」 “Gray sphere approaches blue sphere.” 「⻘⾊の球と灰⾊の球がぶつかる」 “Blue sphere and gray sphere collide.” 「⻘⾊の球が灰⾊の球にはじかれる」 “Blue sphere is repulsed by gray sphere.” 「灰⾊の球が⻘⾊の球にはじかれる」 “Gray sphere is repulsed by blue sphere.” collision before collision after collision 「⻘⾊の球と灰⾊の球が離れる」 “Blue sphere and gray sphere leave.” 「⻘⾊の球から灰⾊の球が離れる」 “Gray sphere away from blue sphere.” 「灰⾊の球から⻘⾊の球が離れる」 “Blue sphere away from gray sphere.” Example of text templates:Colliding Objects “blue sphere”, “gray sphere” 5 frames 5 frames • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects
  16. Ex1: text generating model 16 test Trained Decoder Model generated

    text indicating predicted content pred graph embedding input Decoder Softmax <bos> w1 w2 wt <eos> … w1 w2 wt … Decoder model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces
  17. Ex1: result – range i 17 Range i color shape

    Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. PredNet- based Green cylinder is repulsed by red cylinder. ✔ ✘ PredRNN- based Green cylinder and red cylinder collide. ✔ ✘ PredRNN v2-based Red cylinder is repulsed by green sphere. ✔ ✔ PreCNet- based Red cylinder is repulsed by green sphere. ✔ ✔
  18. Ex1: result – range vi 18 Range vi color shape

    Correct Cyan cube and cyan cylinder collide. Cyan cube is repulsed by cyan cylinder. Cyan cylinder is repulsed by cyan cube. PredNet- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN- based Cyan cube is repulsed by blue sphere. ✘ ✘ PredRNN v2-based Cyan cube is repulsed by cyan sphere. ✔ ✘ PreCNet- based Cyan cube is repulsed by cyan cylinder. ✔ ✔
  19. Ex1: result – Comparison of accuracy using evaluation metrics 19

    Based model language BLEU@2↑ BLEU@3↑ BLEU@4↑ METEOR↑ CIDEr↑ PredNet-based en 80.3 63.0 56.3 68.8 72.9 ja 79.7 74.5 68.8 70.2 72.4 PredRNN-based en 84.3 66.8 59.1 72.6 74.6 ja 82.5 76.1 73.4 73.5 75.1 PredRNN v2 -based en 86.2 72.4 62.7 75.9 78.3 ja 85.9 78.9 75.7 77.6 78.2 PreCNet-based en 90.6 77.1 67.9 78.1 80.3 ja 88.3 80.6 79.2 80.4 81.2
  20. 1 Prediction of graph structure physical training data • Graph

    embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects Input Change Point Prediction Model CLEVRER [Yi+, 19] based-models • PredNet [Lotter+,16] • PredRNN [Wang+, 17] • PredRNN v2 [Wang+, 21] • PreCNet [Straka+, 23] Overview 20 predicted image Prediction of graph structure output generated text Red cylinder is repulsed by green sphere object’s color ✔,shape ✔ generated text Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 3 Language Model 2
  21. Ex2: Text generation model for collision situation based on physical

    commonsense knowledge 21 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.
  22. Ex2: Add common sense externally Assignment of conditions: 45 types

    22 State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed
  23. Ex2: result - range i 23 Generated statement by model:

    A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.
  24. Ex2: result - range vi 24 Generated statement by model:

    A cyan cube is repelled by a cyan cylinder. Object A: cyan cube, Object B: cyan cylinder Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators − A = B − The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The light blue cylinder collides with the light blue cube, and the light blue cube is sent flying slippery A > B A < B The cyan cube collides with the cyan cylinder with great force, and the cyan cube is sent flying away. The cyan cube is violently hit by the cyan cylinder, and the cyan cube is slightly knocked away rough − A = B The cyan cube and the cyan cylinder collide, and both are sent flying in opposite directions. The cyan cylinder and the cyan cube collide, and the cyan cylinder and the cyan cube bounce back to the same extent. − A = B A > B The cyan cube collides with the cyan cylinder with great force, sending the cyan cylinder flying away. The cyan cube is slowly sent in one direction, while the cyan cylinder is sent in the opposite direction at high speed.
  25. Ex2: result – Comparison of accuracy using evaluation metrics 25

    Based models BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL- 4o PredNet -based 37.3 68.2 30.3 34.4 43.2 64.5 80.3 PredRNN -based 43.6 74.7 36.1 41.8 48.7 69.7 83.1 PredRNN v2 -based 46.5 79.5 45.6 49.6 56.1 75.3 88.5 PreCNet -based 55.8 82.2 49.7 56.4 67.9 80.2 92.4
  26. Conclusion & Future works • Predictive inference model to extract

    change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 26