Upgrade to Pro — share decks privately, control downloads, hide ads and more …

230530 AIRC Eri Kuroda

230530 AIRC Eri Kuroda

The slides for a presentation at the AIRC on May 30, 2023.

Eri KURODA

May 30, 2023
Tweet

More Decks by Eri KURODA

Other Decks in Research

Transcript

  1. Predictive Inference Model of the
    Physical Environment that mimics
    Predictive Coding
    Eri Kuroda
    Ochanomizu University
    2023/05/30 AIST AIRC
    Material My HP

    View full-size slide

  2. 7
    1 2
    Crossing in a hurry before the car arrives Crossing the street after the car has passed

    View full-size slide

  3. 8
    Where is our judgment
    Estimate distance
    Speed
    • How fast I can walk/run
    • Whether the car turns/does not turn
    • Whether the car will accelerate
    quickly
    Whether or not I'll have an accident
    based on past experience and common sense

    View full-size slide

  4. 9
    Background・Purpose
    • Recognition and Prediction
    Ø predict what the subject will do next and take action
    accordingly
    Ø learn how the world works and background knowledge
    from a few interactions and observations
    • change point, common sense
    • Understanding the real world through language
    Ø have linguistic information such as common sense and
    knowledge
    Ø gain a deeper understanding of the real world by
    connecting language to the real world
    Real World Cognition of Humans BUT…
    • Machine learning for real-world recognition prediction
    Ø input (observation) is an image
    → equivalent to human vision
    Ø predictions of image features are considered real-world
    predictions
    • ML doesn't make predictions based on physical
    properties of objects or physical laws, as humans do
    • real world and understanding the real world through
    language have not yet been linked
    • proposes a predictive inference model that can detect and predict physical change points based on the
    physical laws of real-world objects.
    • To connect the real world and language, the inference is expressed as a language.
    Purpose

    View full-size slide

  5. 10
    Overview
    CLEVRER
    whether the timing of
    the change point of the
    next step can be
    displayed correctly
    Proposed Model
    Graph structure
    Representation of a set of
    physical properties
    PredNet
    VTA, graph VTA
    Image
    when looking at the real world from
    the visual
    Generate inference
    content as a language
    Experiment 1
    Experiment 2
    • Object detection
    • speed
    • acceleration
    • image features, etc

    View full-size slide

  6. 11
    PredNet [Lotter+, 2016]

    View full-size slide

  7. 12
    PredNet [Lotter+, 2016]
    Hierarchical Model

    View full-size slide

  8. 13
    PredNet [Lotter+, 2016]
    the process of
    predictive coding
    Hierarchical Model

    View full-size slide

  9. 14
    PredNet [Lotter+, 2016]
    Propagation of
    predictions
    Updating the
    prediction model
    Error generation
    Error propagation
    input
    (observation)

    View full-size slide

  10. 15
    Variational Temporal Abstraction [Kim+, 19]
    when walking on the blue road
    when walking on the red road
    all events
    chang points
    all events
    chang points

    View full-size slide

  11. 16
    Variational Temporal Abstraction [Kim+, 19]
    difficult to decide when to transition 𝑍
    problem
    Human: easy ↔ Model: difficult
    Observation (Input)
    Observation abstraction
    temporal abstraction

    View full-size slide

  12. 17
    Variational Temporal Abstraction [Kim+, 19]
    Determines the flag (0 or 1) of 𝑚 by the magnitude of the change in
    latent state compared to the previous observation
    Introduced flags

    View full-size slide

  13. 18
    Proposed Model
    𝐸!"_ℓ%&
    𝐸!"_ℓ


    𝑅!"_ℓ%&
    𝑥"
    Input
    #
    𝐴!"_ℓ%&
    𝐴!"_ℓ%&
    #
    𝐴!"_ℓ
    𝐴!"_ℓ
    𝐸'"_ℓ%&
    𝐸'"_ℓ


    𝑅'"_ℓ%&
    𝑅'"_ℓ
    #
    𝐴'"_ℓ%&
    𝐴'"_ℓ%&
    #
    𝐴'"_ℓ
    𝐴!"_ℓ
    img
    Output
    𝑑𝑖𝑓𝑓
    !"
    𝑅!"_ℓ
    𝑑𝑖𝑓𝑓'"
    𝑚(
    Output
    𝑑𝑖𝑓𝑓 > 𝛼 physical
    training
    data
    Input
    Error
    Representation
    Prediction
    time t
    𝛼︓ threshold
    Difference
    Graph structure
    prediction based on
    physical properties
    Image Prediction
    𝑑𝑖𝑓𝑓 = 𝑑𝑖𝑓𝑓!"
    + 𝑑𝑖𝑓𝑓%"

    View full-size slide

  14. Dataset︓CLEVRER [Yi+,2020]
    • CLEVRER [Yi+, 2020]
    ØCoLlision Events for Video REpresentation and Reasoning
    19
    Number of
    videos
    20,000 (train:val:test=2:1:1)
    Video Length 5 sec
    Number of
    frames
    128 frame
    Shape cube, sphere, cylinder
    Material metal, rubber
    Color gray, red, blue, green, brown, cyan, purple, yellow
    Event appear, disappear, collide
    Annotation object id, position, speed, acceleration

    View full-size slide

  15. combination
    Dataset physical training dataset
    • Dataset created from physical characteristics of the environment
    20
    object
    recognition
    object
    position
    velocity
    acceleration
    Position direction flags
    between objects
    graph structure
    embedding
    vector

    View full-size slide

  16. combination
    Dataset physical training dataset
    • Dataset created from physical characteristics of the environment
    21
    object
    recognition
    object
    position
    velocity
    acceleration
    Position direction flags
    between objects
    graph structure
    embedding
    vector

    View full-size slide

  17. object recognition
    • YOLACT
    Ø[Bolya+,2019]
    ØA type of instance segmentation
    Ø{shape, color, material} of an object
    Dataset physical training dataset 22
    Before detecting
    After detecting

    View full-size slide

  18. object recognition
    • YOLACT
    Ø[Bolya+,2019]
    ØA type of instance segmentation
    Ø{shape, color, material} of an object
    Calculate location information
    • Calculate the coordinates of the object
    center from the acquired bounding box
    coordinates
    Dataset physical training dataset 23
    (𝑥&
    , 𝑦&)
    (𝑥'
    , 𝑦')
    𝑐 = 𝑥, 𝑦 = (
    𝑥& + 𝑥'
    2
    ,
    𝑦& + 𝑦'
    2
    )
    c
    Before detecting
    After detecting

    View full-size slide

  19. combination
    Dataset physical training dataset
    • Dataset created from physical characteristics of the environment
    24
    object
    recognition
    velocity
    acceleration
    Position direction flags
    between objects
    graph structure
    embedding
    vector
    object
    position

    View full-size slide

  20. Velocity・Acceleration
    Dataset physical training dataset 25
    velocity
    acceleration
    𝑎!"
    = (𝑣!"
    − 𝑣!#
    )/(𝑒𝑡"#$%&×𝑡)
    𝑎'"
    = (𝑣'"
    − 𝑣'#
    )/(𝑒𝑡"#$%&×𝑡)
    ※ 𝑒𝑡()*+, = 5/128
    time elapsed between frames
    𝑣!"
    = (𝑥( − 𝑥()*)/𝑒𝑡"#$%&
    𝑣'"
    = (𝑦(
    − 𝑦()*
    )/𝑒𝑡"#$%&

    View full-size slide

  21. Velocity・Acceleration Position direction flags between objects
    Dataset physical training dataset 26
    velocity
    acceleration
    𝑎!"
    = (𝑣!"
    − 𝑣!#
    )/(𝑒𝑡"#$%&×𝑡)
    𝑎'"
    = (𝑣'"
    − 𝑣'#
    )/(𝑒𝑡"#$%&×𝑡)
    ※ 𝑒𝑡()*+, = 5/128
    time elapsed between frames
    𝑣!"
    = (𝑥( − 𝑥()*)/𝑒𝑡"#$%&
    𝑣'"
    = (𝑦(
    − 𝑦()*
    )/𝑒𝑡"#$%& x
    flag “5”
    flag “-5”
    flag “-1”
    main object others
    main object = (𝑥&'%(
    , 𝑦&'%(
    )
    others = (𝑥)"*+,
    , 𝑦)"*+,
    )
    𝑥-%..
    = 𝑥)"*+,
    − 𝑥&'%(
    𝑦-%..
    = 𝑦)"*+,
    − 𝑦&'%(
    𝑥-%..
    𝑦-%..
    +
    +


    flag “5” flag “1”
    flag “-1”
    flag “-5”
    y
    flag “1”

    View full-size slide

  22. graph structure
    • Node information
    Øshape, color, material
    embedding vector
    • node2vec [Grover+, 2016]
    Dataset physical training dataset 27
    [[0.54, 0.29, 0.61…],
    [[0.82, 0.91, 0.15…],

    [[0.14, 0.35, 0.69…]]
    Example of embedding vector

    View full-size slide

  23. graph structure
    object
    position
    Dataset physical training dataset
    • Dataset created from physical characteristics of the environment
    28
    object
    recognition
    combination
    velocity
    acceleration
    Position direction flags
    between objects
    embedding
    vector

    View full-size slide

  24. Ex 1: Extracting Predicted Change Points Ex 2: Text Generation
    Experiment Summary 29

    View full-size slide

  25. Ex 1: Extracting Predicted Change Points
    Purpose
    • whether the predicted change point of an
    event can be extracted correctly
    Setting
    • Data Set
    ØCLEVRER
    ØPhysical training data
    Scope of coverage: 6 patterns x 10 frames
    Situations in which physical changes of
    objects occur, such as collision, disappearance,
    appearance.
    Experiment Summary 30

    View full-size slide

  26. Ex1︓ Accuracy Calculation Method
    • Examine the accuracy (%) of annotation collision information and flag timing
    Example
    • collision→19 frame, by eye → 21 frame
    • The correct answer range was set to 19-21 frame
    • flag︓18, 19, 20, 22 → accuracy︓2/4×100=50 (%)
    31
    19 frame 20 frame 21 frame

    View full-size slide

  27. 32
    Ex1︓result
    i ii iii iv v vi
    Physical data 33.3 50 50 33.3 66.7 50
    annotation 66.7 50 66.7 40 50 50
    Accuracy
    Original
    image
    Predicted
    image
    t=1 t=12
    m=1 m=0 m=0
    m=1 m=1 m=0 m=1 m=1
    collision accuracy︓2/6*100=33.3%
    Result of range i
    m=0 m=1

    View full-size slide

  28. 33
    Ex1︓result
    i ii iii iv v vi
    Physical data 33.3 50 50 33.3 66.7 50
    annotation 66.7 50 66.7 40 50 50
    Accuracy
    Original
    image
    Predicted
    image
    t=1 t=12
    m=1 m=0 m=0
    m=1 m=1 m=0 m=1 m=1
    collision accuracy︓2/6*100=33.3%
    Result of range i
    m=0 m=1
    accuracy with physical training data
    predictions with accuracy equivalent to annotation data

    View full-size slide

  29. Ex 1: Extracting Predicted Change Points
    Purpose
    • whether the predicted change point of an
    event can be extracted correctly
    Setting
    • Data Set
    ØCLEVRER
    ØPhysical training data
    Scope of coverage: 6 patterns x 10 frames
    Situations in which physical changes of
    objects occur, such as collision, disappearance,
    appearance.
    Ex 2: Text Generation
    Purpose
    • Express reasoning as language to connect
    the real world and language
    Setting
    • Dataset
    ØPaired data of graph embedding vectors
    and language data
    • Collision situations only
    Experiment Summary 34

    View full-size slide

  30. Ex2︓ Creation of Templates
    • nine templates
    Ø3(before・collision・after)×3(sentence type)
    • Object type
    Ø”color” “shape”
    Øex) blue sphere, gray cylinder, etc.
    35
    「⻘⾊の球と灰⾊の球が近づく」
    “Blue sphere and gray sphere approach.”
    「⻘⾊の球が灰⾊の球に近づく」
    “Blue sphere approaches gray sphere.”
    「灰⾊の球が⻘⾊の球に近づく」
    “Gray sphere approaches blue sphere.”
    「⻘⾊の球と灰⾊の球がぶつかる」
    “Blue sphere and gray sphere collide.”
    「⻘⾊の球が灰⾊の球にはじかれる」
    “Blue sphere is repulsed by gray sphere.”
    「灰⾊の球が⻘⾊の球にはじかれる」
    “Gray sphere is repulsed by blue sphere.”
    collision
    before collision
    after collision 「⻘⾊の球と灰⾊の球が離れる」
    “Blue sphere and gray sphere leave.”
    「⻘⾊の球から灰⾊の球が離れる」
    “Gray sphere away from blue sphere.”
    「灰⾊の球から⻘⾊の球が離れる」
    “Blue sphere away from gray sphere.”
    Example of text templates:Colliding Objects “blue sphere”, “gray sphere”
    5 frames
    5 frames
    • A and B approach
    • A approaches B
    • B approaches A
    • A and B collide
    • A is repulsed by B
    • B is repulsed by A
    • A and B leave
    • A away from B
    • B away from A
    before collision
    after
    template
    ※ A・B︓objects

    View full-size slide

  31. 36
    Ex2︓ text generating model
    test
    Trained Decoder Model
    generated text
    indicating predicted
    content
    pred graph embedding
    input
    #
    𝐴!"_ℓ
    Decoder
    Softmax
    w1 w2 wt


    w1 w2 wt

    Decoder train model
    text
    pair data
    train
    Linear
    graph embedding
    219,303 pieces 10,965 pieces

    View full-size slide

  32. 37
    Ex2︓result
    Range i Range ii
    Range iv Range vi
    original image
    Predicted image
    「緑⾊の球と⾚⾊の円柱がぶつかる」
    “Green sphere and red cylinder collide."
    「緑⾊の球が⾚⾊の円柱にはじかれる」
    “Green sphere is repulsed by red cylinder.”
    「⾚⾊の円柱が緑⾊の球にはじかれる」
    “Red cylinder is repulsed by green sphere.”
    correct text
    緑⾊の円柱が⾚⾊の円柱にはじかれる
    Green cylinder is repulsed by red cylinder.
    generated text
    「灰⾊の球と⻘⾊の円柱がぶつかる」
    “Gray sphere and blue cylinder collide."
    「灰⾊の球が⻘⾊の円柱にはじかれる」
    “Gray sphere is repulsed by blue cylinder.”
    「⻘⾊の円柱が灰⾊の球にはじかれる」
    “Blue cylinder is repulsed by gray sphere.”
    灰⾊の球が⻘⾊の⽴⽅体にはじかれる
    Gray sphere is repulsed by blue cube.
    「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」
    “Cyan cube and cyan cylinder collide."
    「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」
    “Cyan cube is repulsed by cyan cylinder.”
    「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」
    “Cyan cylinder is repulsed by cyan cube.”
    ⽔⾊の⽴⽅体が⻘⾊の球にぶつかる
    Cyan cube is repulsed by blue sphere.
    「緑⾊の円柱と茶⾊の⽴⽅体がぶつかる」
    “Green cylinder and brown cube collide."
    「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」
    “Green cylinder is repulsed by brown cube.”
    「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」
    “Brown cube is repulsed by green cylinder.”
    緑⾊の円柱が茶⾊の⽴⽅体にぶつかる
    Green cylinder is repulsed by brown cube.
    object’s color ✔,shape ✘ object’s color ✔,shape ✔
    object’s color ✔,shape ✘ object’s color ✘,shape ✘
    correct text
    generated text
    correct text
    generated text
    correct text
    generated text
    original image
    original image
    original image
    Predicted image
    Predicted image
    Predicted image

    View full-size slide

  33. Ex2︓ discussion of the results of the range vi 38
    20 frames before
    25 frames
    before collision 15 frames before
    5 frames before
    10 frames before collision
    Incorrect reason for both color and
    shape of object
    Possibility that "cyan cube" and "blue
    sphere" were judged to have collided
    Range vi

    er.

    「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」
    “Cyan cube and cyan cylinder collide."
    「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」
    “Cyan cube is repulsed by cyan cylinder.”
    「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」
    “Cyan cylinder is repulsed by cyan cube.”
    ⽔⾊の⽴⽅体が⻘⾊の球にぶつかる
    Cyan cube is repulsed by blue sphere.
    「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」
    “Green cylinder is repulsed by brown cube.”
    「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」
    “Brown cube is repulsed by green cylinder.”
    緑⾊の円柱が茶⾊の⽴⽅体にぶつかる
    Green cylinder is repulsed by brown cube.
    object’s color ✔,shape ✔
    object’s color ✘,shape ✘
    correct text
    generated text
    generated text
    original image
    Predicted image
    Predicted image

    View full-size slide

  34. Ex2︓ BLEU 39
    BLEU@2 BLEU@3 BLEU@4
    score 79.7 74.5 68.8
    Range i Range ii
    Range iv Range vi
    original image
    Predicted image
    「緑⾊の球と⾚⾊の円柱がぶつかる」
    “Green sphere and red cylinder collide."
    「緑⾊の球が⾚⾊の円柱にはじかれる」
    “Green sphere is repulsed by red cylinder.”
    「⾚⾊の円柱が緑⾊の球にはじかれる」
    “Red cylinder is repulsed by green sphere.”
    correct text
    緑⾊の円柱が⾚⾊の円柱にはじかれる
    Green cylinder is repulsed by red cylinder.
    generated text
    「灰⾊の球と⻘⾊の円柱がぶつかる」
    “Gray sphere and blue cylinder collide."
    「灰⾊の球が⻘⾊の円柱にはじかれる」
    “Gray sphere is repulsed by blue cylinder.”
    「⻘⾊の円柱が灰⾊の球にはじかれる」
    “Blue cylinder is repulsed by gray sphere.”
    「⽔⾊の⽴⽅体と⽔⾊の円柱がぶつかる」
    “Cyan cube and cyan cylinder collide."
    「⽔⾊の⽴⽅体が⽔⾊の円柱にはじかれる」
    “Cyan cube is repulsed by cyan cylinder.”
    「⽔⾊の円柱が⽔⾊の⽴⽅体にはじかれる」
    “Cyan cylinder is repulsed by cyan cube.”
    「緑⾊の円柱と茶⾊の⽴⽅体がぶつかる」
    “Green cylinder and brown cube collide."
    「緑⾊の円柱が茶⾊の⽴⽅体にはじかれる」
    “Green cylinder is repulsed by brown cube.”
    「茶⾊の⽴⽅体が緑⾊の円柱にはじかれる」
    “Red cylinder is repulsed by green sphere.”
    緑⾊の円柱が茶⾊の⽴⽅体にぶつかる
    Green cylinder is repulsed by brown cube.
    object’s color ✔,shape ✘ object’s color ✔,shape ✔
    correct text correct text
    correct text
    generated text
    original image
    original image
    original image
    Predicted image
    Predicted image
    Predicted image
    since the average is taken,
    it is possible that the score is a little low

    View full-size slide

  35. Conclusion
    • Construct a predictive inference model that mimics the hierarchical structure
    of the human brain
    Øadd flag “m” representing change points to the hierarchical structure of PredNet
    Øbased on experimental results, timing of change points can also be obtained for
    predictive content
    • generated a language of inference to connect real-world events and objects
    as a language
    Øon the basis of the experimental results, it was possible to generate a language for the
    content of the inferences
    40

    View full-size slide

  36. Future Tasks
    • Use of real world-like data
    • When cooking
    Øgo to kitchen → prepare cutting board → cut ingredients → fry
    • in the real-life environment, extraction of easy-to-understand change points
    and prediction of what actions will be necessary
    41

    View full-size slide