Slide 1

Slide 1 text

Verbal Description Focusing on Physical Properties of Real-World Environments ◯ Eri Kuroda, Yuki Taya & Ichiro Kobayashi Ochanomizu University

Slide 2

Slide 2 text

Background, Purpose 2 • Predict what the object's next move will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct a chage point prediction model. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose

Slide 3

Slide 3 text

generated text Overview 3 physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ,shape image Prediction of graph structure representing physical properties output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere Prediction of graph structure add CLEVRER [Yi+, 19] 1 2 3

Slide 4

Slide 4 text

PreCNet [Straka+, 23] 4 PredNet [Lotter+, 16] PreCNet Error Representation Prediction 𝐸𝑡_ℓ+1 𝐸𝑡_ℓ ⊝ ⊝ 𝑅𝑡_ℓ+1 መ 𝐴𝑡_ℓ+1 መ 𝐴𝑡_ℓ 𝑅𝑡_ℓ 𝑥𝑡 Input upsample Inference of the entire input information every time Hierarchically infer errors

Slide 5

Slide 5 text

5 Variational Temporal Abstraction [Kim+, 19] when walking on the blue road when walking on the red road all events change points all events change points

Slide 6

Slide 6 text

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when to transition 𝑍 problem Human: easy Model: difficult Observation (Input) Observation abstraction Temporal abstraction

Slide 7

Slide 7 text

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0 or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags

Slide 8

Slide 8 text

8 PreCNet-based proposed Model 𝐸𝑡_𝑖𝑚𝑔 ℓ+1 𝐸𝑡_𝑖𝑚𝑔 ℓ ⊝ ⊝ 𝑅𝑡_𝑖𝑚𝑔 ℓ+1 መ 𝐴𝑡_𝑖𝑚𝑔 ℓ+1 መ 𝐴𝑡_𝑖𝑚𝑔 ℓ 𝑅𝑡_𝑖𝑚𝑔 ℓ Error Representation Prediction 𝑥𝑡_𝑖𝑚𝑔 Input 𝐸𝑡_𝑝ℎ𝑦 ℓ+1 𝐸𝑡_𝑝ℎ𝑦 ℓ ⊝ ⊝ 𝑅𝑡_𝑝ℎ𝑦 ℓ+1 መ 𝐴𝑡_𝑝ℎ𝑦 ℓ+1 መ 𝐴𝑡_𝑝ℎ𝑦 ℓ 𝑥𝑡_𝑝ℎ𝑦 Input 𝑅𝑡−1_𝑖𝑚𝑔 ℓ 𝑅𝑡−1_𝑝ℎ𝑦 ℓ 𝑅𝑡_𝑝ℎ𝑦 ℓ upsample upsample 𝑚𝑡 = ቊ 0 ∶ 𝑑𝑖𝑓𝑓𝑡 < 𝛼 1 ∶ 𝑑𝑖𝑓𝑓𝑡 > 𝛼 time t image data physical data 𝑑𝑖𝑓𝑓𝑡_𝑖𝑚𝑔 𝑑𝑖𝑓𝑓𝑡_𝑝ℎ𝑦 img Output 𝑑𝑖𝑓𝑓𝑡 = 𝑑𝑖𝑓𝑓𝑡_𝑖𝑚𝑔 + 𝑑𝑖𝑓𝑓𝑡_𝑝ℎ𝑦

Slide 9

Slide 9 text

Dataset:CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events for Video REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

Slide 10

Slide 10 text

combination Dataset physical training dataset • Dataset created from physical characteristics of the environment 10 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

Slide 11

Slide 11 text

3 generated text Overview 11 predicted image Prediction of graph structure representing physical properties output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ,shape image Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] 1 2

Slide 12

Slide 12 text

Ex1: Creation of Templates • nine templates • 3(before・collision・after) × 3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 12 • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects

Slide 13

Slide 13 text

13 Ex1: text generating model test Trained Decoder Model generated text indicating predicted content pred graph embedding input Decoder Softmax w1 w2 wt … w1 w2 wt … Decoder model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces

Slide 14

Slide 14 text

Result - Generation example - 14 Range Image Sentence Range i Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. Our Red cylinder is repulsed by green sphere. Range ii Correct Brown cube and green cylinder collide. Brown cube is repulsed by green cylinder. Green cylinder is repulsed by brown cube. Ours Brown cube is repulsed by green cylinder.

Slide 15

Slide 15 text

Evaluation of text generation 15 BLEU@2 BLEU@3 BLEU@4 METEOR CIDEr Score-en 90.6 77.1 67.9 78.1 80.3 Score-ja 88.3 80.6 79.2 80.4 81.2 Discussion • Sentences describing the environment could be generated with high accuracy • BLEU@4 evaluation showed lower accuracy • The subject of the sentences describing the collision depends on the language generation model. • BLEU scores were calculated for each of the three correct sentences and averaged, resulting in lower accuracy.

Slide 16

Slide 16 text

1 Overview 16 physical training data • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image image Prediction of graph structure representing physical properties output Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] generated text Language Model generated text Red cylinder is repulsed by green sphere object’s color ,shape Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 2 3

Slide 17

Slide 17 text

Text generation model for collision situation based on physical commonsense knowledge 17 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.

Slide 18

Slide 18 text

Add common sense externally Assignment of conditions: 45 types 18 State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed

Slide 19

Slide 19 text

Language generating model • T5 (Text-To-Text Transfer Transformer) [Raffel+, 2020] • Transformer-based model structure • Used for various tasks such as translation, question answering, classification, summarization, etc. • All tasks output text for input text • Three models used in the experiment (pre-trained in Japanese) • sonoisa/t5-base-japanese • megagonlabs/t5-base-japanese-web • nlp-waseda/comet-t5-base-Japanese 19

Slide 20

Slide 20 text

• Using data collected through crowdsourcing • train: 1,500 • validation: 250 • test: 250 T5 study settings Input data Output data Input statement: - A cube and a cylinder collide. Condition: - The floor is smooth. - The mass of the cube is small. - The mass of the cylinder is large. - The speed of the cube is slow. - The speed of the cylinder is fast. 1. A cylinder collides with a cube with such force that the cube is thrown far away. 2. The cube is thrown far away when the cylinder collides with the cube. 3. Cylinders collide with cubes with great force, and cubes are thrown away with great force. 4. Cylinder collides with cube and cube falls down. 5. The cylinder collides with the cube with great force, and the cube is thrown far away. One of the five correct answers is randomly selected at the training. 20

Slide 21

Slide 21 text

21 T5 study settings setting Learning rate 5 × 10−5 batch size 32 Epoch 100 optimization AdamW [Loshchilov+, 17] loss function cross entropy

Slide 22

Slide 22 text

Results using T5 22 Results with test data Epoch BLEU↑ ROUGE-2↑ ROUGE-L↑ sonoisa/ t5-base-japanese 81 95.2 64.2 74.6 megagonlabs/ t5-base-japanese-web 93 81.6 56.6 67.7 nlp-waseda/ comet-t5-base-japanese 98 80.9 56.2 67.4

Slide 23

Slide 23 text

23 Result - Range i - Generated statement by model: A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.

Slide 24

Slide 24 text

24 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.”

Slide 25

Slide 25 text

25 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.” • Implication Division 1. whether the fact of collision and the color and shape of the objects involved in the collision were correct. 2. whether the content of the movement after the collision was implied in the correct sentences.

Slide 26

Slide 26 text

26 Result - Accuracy - BLEU, BLEURT, and ROUGE • these indicators are evaluated based on the degree of word agreement between the correct sentence and the generated sentence after the addition of common sense. BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92

Slide 27

Slide 27 text

Conclusion & Future works • Predictive inference model to extract change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 27