Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A-Exam Slides

Ozan Sener
September 10, 2015

A-Exam Slides

Ozan Sener

September 10, 2015
Tweet

More Decks by Ozan Sener

Other Decks in Research

Transcript

  1. Unsupervised Discovery of Structure in
    Human Videos
    Ozan Sener
    Joint work with Ashutosh Saxena, Silvio Savarese, Ashesh Jain and Amir Zamir
    Committee: Ashutosh Saxena, David Mimno, Emin Gun Sirer

    View Slide

  2. We envision robots
    doing human like activities while working with humans
    courtesy of Sung et al. courtesy of Koppula et al.

    View Slide

  3. understanding humans, their environments, objects and
    activities
    It requires

    View Slide

  4. What is the next step?
    What is he doing?
    How can I perform X activity?

    View Slide

  5. Understanding Videos
    Image Centric
    Video is an trivial
    extensions of images
    Video Centric
    We need video specific
    features/models
    [Kantarov CVP14, Hou et al ECCV14,
    THUMOS15, Schmid CVPR15]

    View Slide

  6. Understanding Videos
    Image Centric
    Rich models like CRF
    Easy to model context
    Super linear in #of-frames
    Hard to obtain supervision
    (~10s activitiy, ~10s objects)
    Video Centric
    Scales linearly in #of-frames
    Requires only frame labels
    Does not model the context
    Inefficient
    (~30sec for ~1sec of vid)
    Exclusively supervised
    Hard to scale in #of-videos (~100s videos)

    View Slide

  7. What we have?
    ~30sec to process 1sec of video
    support ~10 activities
    learning from ~100 videos
    covering only indoor/sport
    environments
    What we need?
    real-time
    any activity
    learn from all available
    information
    any environment
    ?

    View Slide

  8. Discover, understand and share the
    underlying semantic structure of
    the videos.

    View Slide

  9. Structured understanding of a
    single video
    Large-scale understanding of
    video collections
    Sharing knowledge to other
    domains and modalities
    Outline

    View Slide

  10. Structured understanding of a
    single video
    Large-scaled understanding
    of video collections
    Sharing knowledge to other
    domains and modalities
    Outline

    View Slide

  11. Revisit the Image Based Approach
    O2
    O1
    H
    Context of humans and objects are successfully
    modeled as CRFs
    P
    (
    O1,...,T
    1
    , O1,...,T
    2
    , H1,...,T | 1,...,T
    O1
    , 1,...,T
    O2
    , 1,...,T
    H )

    exp
    0
    @
    X
    v2V
    E
    ( v) +
    X
    v,w2E
    E
    ( v
    , w)
    1
    A

    View Slide

  12. How to Find MAP[Koppula RSS 2013]
    Compute features
    Define the energy function
    Solve the Combinatorial Optimization
    Activity/Object labels for the
    past and future
    Sample
    Possible
    Futures

    View Slide

  13. Shortcomings[Koppula RSS 2013]
    Compute features
    Define the energy function
    Solve the Combinatorial Optimization
    Activity/Object labels for the
    past and future
    Sample
    Possible
    Futures
    We also need
    probabilities in addition
    to MAP solution
    (future is unknown)
    Dimension ~ 1o6xT ~ 103600
    (#ObjLabels#Objectsx#ActLabels)Time
    O

    (TNOLOLA)3

    View Slide

  14. Structured Diversity
    Although the state dimensionality is high, probability concentrates on a few modes

    View Slide

  15. Structured Diversity
    Modes are likely and structurally diverse
    yt,i
    = arg max
    y
    belt
    (
    y
    )
    s.t.
    (
    y, yt,i
    )
    8j < i

    View Slide

  16. HMM – Recursive Belief Estimation
    HMM Derivation [Rabiner]
    belt(
    y
    ) / p(
    y
    t =
    y
    |
    x
    1, . . . ,
    x
    t)
    | {z }
    ↵t(y)
    p(
    x
    t+1, . . . ,
    x
    T |
    y
    t =
    y
    )
    | {z }
    t(y)
    ↵t(
    y
    t) = p(
    x
    t|
    y
    t)
    X
    yt 1
    ↵t 1(
    y
    t 1)p(
    y
    t|
    y
    t 1)
    t(
    y
    t) =
    X
    yt+1
    p(
    x
    t+1|
    y
    t+1) t+1(
    y
    t+1)p(
    y
    t+1|
    y
    t)

    View Slide

  17. rCRF: Structured Diversity meets HMM
    Proposition: A Belief over rCRF is a CRF
    bel
    (
    yt
    )
    /
    exp
    2
    4
    X
    v,w2Et

    Eb
    ( v, w) ˜
    Eb
    ( v, w)

    X
    v2Vt
    0
    @Eu
    ( v) ˜
    Eu
    ( v) +
    X
    yt 1
    ↵t 1
    (
    yt 1
    ) log
    p
    (
    yt
    v
    |yt 1
    v )
    1
    X
    yt+1
    t+1
    (
    yt+1
    )
    bel
    (
    yt+1
    ) log
    p
    (
    yt+1
    v
    |yt
    v)
    1
    A
    3
    5
    Binary Term
    Unary Term

    View Slide

  18. rCRF: Algorithm
    Compute energy function for each frame-wise CRF
    Forward-Backward loop for message passing
    Compute the energy function of rCRF
    Sample by using Lagrangian relaxation [Batra et al]

    View Slide

  19. rCRF: Algorithm
    Compute energy function for each frame-wise CRF
    Forward-Backward loop for message passing
    Compute the energy function of rCRF
    Sample by using Lagrangian relaxation [Batra et al]
    O

    (TNOLOLA)3

    ! O

    T (NOLOLA)3

    Computes probabilities for past/present/future states as a
    part of the formulation with no random sampling

    View Slide

  20. Resulting Belief

    View Slide

  21. Efficiency and Accuracy Improvement
    rCRF is 30x faster than the state-of-the-art algorithms and runs in real-time
    Accurate handling of uncertainty also increases accuracy

    View Slide

  22. Efficiency and Accuracy Improvement
    Resulting belief also stays informative trough time

    View Slide

  23. Structured understanding of a
    single video
    Large-scaled understanding
    of video collection
    Sharing knowledge to other
    domains and modalities
    Outline

    View Slide

  24. is Unsupervised Learning Possible
    Is there an underlying structure in the YouTube “How-To” videos?

    View Slide

  25. Is there an underlying structure in the YouTube “How-To” videos?
    1st Result

    View Slide

  26. Is there an underlying structure in the YouTube “How-To” videos?
    2nd Result

    View Slide

  27. We discover activities by
    using a NP-Bayes approach
    We learn a multi-modal
    dictionary
    Summary of the Approach
    We automatically download and filter large multi-modal activity
    corpus from YouTube

    View Slide

  28. Dictionary Learning (Language)
    We use the tf-idf metric by considering each video as a document.
    We choose the K mostfrequent words with max tf−idf
    Dictionary for category “Hard Boil an Egg” with K=50
    sort, place, water, egg, bottom, fresh, pot, crack, cold, cover, time, overcooking,
    hot, shell, stove, turn, cook, boil, break, pinch, salt, peel, lid, point, haigh, rules,
    perfectly, hard, smell, fast, soft, chill, ice, bowl, remove, aside, store, set,
    temperature, coagulates, yolk, drain, swirl, shake, white, roll, handle, surface, flat

    View Slide

  29. Dictionary Learning (Visual)
    Multi Video Edges
    Proposal Graph for Video 1 Proposal Graph for Video 2 Proposal Graph for Video 3

    View Slide

  30. Dictionary Learning (Visual)
    Multi Video Edges
    Proposal Graph for Video 1 Proposal Graph for Video 2 Proposal Graph for Video 3
    arg max
    X
    i2V
    x
    (i)T
    A
    (i)
    x
    (i)
    x
    (i)T
    x
    (i) +
    X
    i2V
    X
    j2N (i)
    x
    (i)T
    A
    (i,j)
    x
    (j)
    x
    (i)T 11T
    x
    (j)
    where
    V
    is set of videos,
    N
    (
    i
    ) is neighbour videos of
    i
    ,
    and A is similarity matrices
    This function is quasi convex and can be
    optimized via SGD as
    arg max
    X
    i2V
    x
    (i)T
    A
    (i)
    x
    (i)
    x
    (i)T
    x
    (i) +
    X
    i2V
    X
    j2N (i)
    x
    (i)T
    A
    (i,j)
    x
    (j)
    x
    (i)T 11T
    x
    (j)
    where
    V
    is set of videos,
    N
    (
    i
    ) is neighbour videos of
    i
    ,
    and A is similarity matrices
    r
    x
    (i)
    =
    2
    A
    (
    i
    )
    x
    (
    i
    ) 2
    x
    (
    i
    )r(i)
    x
    (
    i
    )T
    x
    (
    i
    )
    +
    X
    i2N
    Ai
    ,
    j
    xj
    x
    (
    j
    )T1r(i,j)
    x
    (
    i
    )T11T
    x
    (
    j
    )

    View Slide

  31. Learned Dictionaries
    Semantically Correct

    View Slide

  32. Learned Dictionaries
    Semantically Correct

    View Slide

  33. Learned Dictionaries
    Accuracy vs Semantic Meaning

    View Slide

  34. Representing Each Frame

    View Slide

  35. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    We jointly model activities and videos

    View Slide

  36. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    Each activity is modeled as
    likelihood of seeing each
    dictionary item.
    e.g. probability of having a word
    “egg” and having object 5
    ✓k

    View Slide

  37. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    Each videos choose subset of
    activities via Indian Buffet Process
    fi

    View Slide

  38. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    And transition probabilities
    between activities
    ⌘i

    View Slide

  39. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    Given activities/transition probabilities, it is HMM

    View Slide

  40. Unsupervised Discovery via NP-Bayes
    k = 1, . . . , 1 i = 1, . . . , N
    c
    y2 · · ·
    ⌘i
    fi
    y1
    · · ·
    kth activity step
    ⇡i
    B0
    (·)
    ith video
    wk
    z4
    y3
    z3
    y4
    z1
    z2
    ✓k
    We learn by Gibbs Sampling

    View Slide

  41. Discovered Activities

    View Slide

  42. View Slide

  43. Evaluation
    Both modalities are complementary and joint modeling is necessary!
    Multi-video mid-level descriptions are critical for the accuracy

    View Slide

  44. Structured understanding of a
    single video
    Large-scaled unsupervised
    understanding of human
    activities
    Sharing knowledge to other
    domains and modalities
    Outline

    View Slide

  45. Graph Perspective of Large-Scaled Activities
    How to make pancakes

    View Slide

  46. Graph Perspective of Large-Scaled Activities
    How to make pancakes

    View Slide

  47. Graph Perspective of Large-Scaled Activities
    How to make pancakes
    Egg
    Heat
    Pan
    Flip
    Beat
    Flour

    View Slide

  48. Graph Perspective of Large-Scaled Activities
    How to make pancakes
    Egg
    Heat
    Pan
    Flip
    Beat
    Flour

    View Slide

  49. Can we go further? RoboBrain
    Snapshot of the RoboBrain graph
    45,000 concepts (nodes)
    98,000 relations (edges)
    Connecting knowledge from Internet sources and manyprojects

    View Slide

  50. How to scale the knowledge
    cv
    cv cv
    cv
    cv
    cv
    cv
    cup
    water
    pour
    liquid
    bottle
    kept
    fridge
    appearance
    has_grasp
    Input is in the form of “feeds”
    A feed is collection of binary relations
    Concept Concept
    Relation
    Concept Concept
    Relation

    View Slide

  51. How to scale the knowledge
    cv
    cv cv
    cv
    cv
    cv
    cv
    cup
    water
    pour
    liquid
    bottle
    kept
    fridge
    appearance
    has_grasp
    Bottle has appearance
    Table has appearance

    View Slide

  52. How to scale the knowledge
    cv
    cv cv
    cv
    cv
    cv
    cv
    cup
    water
    pour
    liquid
    bottle
    kept
    fridge
    appearance
    has_grasp
    Bottle has appearance
    Table has appearance
    cv
    table

    View Slide

  53. How to scale the knowledge
    cv
    cv cv
    cv
    cv
    cv
    cv
    cup
    water
    pour
    liquid
    bottle
    kept
    fridge
    appearance
    has_grasp
    Cup spatially distributed
    Table spatially distributed
    cv
    table

    View Slide

  54. System Architecture

    View Slide

  55. How we processed 25k videos in a day

    View Slide

  56. Robotics-as-a-Service
    We shared a reliable
    interface to multiple
    universities.
    Humans to Robots Lab
    @ Brown University
    successfully used
    RoboBrain-as-a-Service

    View Slide

  57. What we have?
    ~30sec to process 1sec of video
    support ~10 activities
    learning from ~100 videos
    covering only indoor
    sport environments
    What we need?
    real-time
    any activity
    learn from all available
    information
    any environment humans go
    rCRF with structured div
    unsupervised learning w/ NP-Bayes
    Large-scaled learning on YouTube
    Using multiple domains via RB

    View Slide

  58. How can we scale further?
    Linking more and more modalities and
    domains with efficient and theoretically
    sound models

    View Slide

  59. Cross-Domain Information
    Videos
    with
    no
    Structure
    Images
    and Words
    with
    Structure
    How to transfer knowledge?

    View Slide

  60. Transductive Approach (ongoing/future work)
    How to handle domain shift?
    Domain adaptation vs Domain invariance
    Domain invariant feature
    Learning followed by Induction
    Induction followed by
    transformation

    View Slide

  61. Transductive Approach (ongoing/future work)
    Domain invariant feature
    Learning followed by Induction
    Induction followed by
    transformation
    It is generally hard
    Sometimes impossible
    [Vapnik]
    What if such feature
    does not exist

    View Slide

  62. Domain Transduction [Ongoing work]
    It might be possible to solve
    Domain transfer with no induction
    We are developing a max-margin framework
    based on coordinate-ascent of transduction
    and domain adaptation
    [ongoing work]

    View Slide

  63. Adaptive Transduction [Future Work]
    Some domains are
    Intractably large like YouTube etc.
    Can we solve the problem adaptively

    View Slide

  64. Adaptive Transduction – Robot in the Loop
    Sampled
    Videos
    Domain
    Transduction
    Adaptive
    Sampling

    View Slide

  65. rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos
    Ozan Sener, Ashutosh Saxena. In RSS 2015
    Unsupervised Semantic Parsing of Video Collections
    Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena. In ICCV 2015
    RoboBrain: Large-Scale Knowledge Engine for Robots
    Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra,
    Hema S. Koppula. In ISRR 2015
    3D Semantic Parsing of Large-Scale Buildings
    Iro Armeni, Ozan Sener et al. (In submission to CVPR 2016)
    Unsupervised Discovery of Spatio-Temporal Semantic Descriptors
    Under preperation for TPAMI submission

    View Slide

  66. Joint Work With
    Ashutosh Saxena Silvio Savarese
    Ozan Sener
    Ashesh Jain
    Deedy Das
    Amir R. Zamir
    Aditya Jami Dipendra K. Misra Jay Hack
    Hema Koppula

    View Slide

  67. Thank You

    View Slide