A-Exam Slides

1507f22ca8d84c83e52362f69e428698?s=47 Ozan Sener
September 10, 2015

A-Exam Slides

1507f22ca8d84c83e52362f69e428698?s=128

Ozan Sener

September 10, 2015
Tweet

Transcript

  1. Unsupervised Discovery of Structure in Human Videos Ozan Sener Joint

    work with Ashutosh Saxena, Silvio Savarese, Ashesh Jain and Amir Zamir Committee: Ashutosh Saxena, David Mimno, Emin Gun Sirer
  2. We envision robots doing human like activities while working with

    humans courtesy of Sung et al. courtesy of Koppula et al.
  3. understanding humans, their environments, objects and activities It requires

  4. What is the next step? What is he doing? How

    can I perform X activity?
  5. Understanding Videos Image Centric Video is an trivial extensions of

    images Video Centric We need video specific features/models [Kantarov CVP14, Hou et al ECCV14, THUMOS15, Schmid CVPR15]
  6. Understanding Videos Image Centric Rich models like CRF Easy to

    model context Super linear in #of-frames Hard to obtain supervision (~10s activitiy, ~10s objects) Video Centric Scales linearly in #of-frames Requires only frame labels Does not model the context Inefficient (~30sec for ~1sec of vid) Exclusively supervised Hard to scale in #of-videos (~100s videos)
  7. What we have? ~30sec to process 1sec of video support

    ~10 activities learning from ~100 videos covering only indoor/sport environments What we need? real-time any activity learn from all available information any environment ?
  8. Discover, understand and share the underlying semantic structure of the

    videos.
  9. Structured understanding of a single video Large-scale understanding of video

    collections Sharing knowledge to other domains and modalities Outline
  10. Structured understanding of a single video Large-scaled understanding of video

    collections Sharing knowledge to other domains and modalities Outline
  11. Revisit the Image Based Approach O2 O1 H Context of

    humans and objects are successfully modeled as CRFs P ( O1,...,T 1 , O1,...,T 2 , H1,...,T | 1,...,T O1 , 1,...,T O2 , 1,...,T H ) ⇠ exp 0 @ X v2V E ( v) + X v,w2E E ( v , w) 1 A
  12. How to Find MAP[Koppula RSS 2013] Compute features Define the

    energy function Solve the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures
  13. Shortcomings[Koppula RSS 2013] Compute features Define the energy function Solve

    the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures We also need probabilities in addition to MAP solution (future is unknown) Dimension ~ 1o6xT ~ 103600 (#ObjLabels#Objectsx#ActLabels)Time O ⇣ (TNOLOLA)3 ⌘
  14. Structured Diversity Although the state dimensionality is high, probability concentrates

    on a few modes
  15. Structured Diversity Modes are likely and structurally diverse yt,i =

    arg max y belt ( y ) s.t. ( y, yt,i ) 8j < i
  16. HMM – Recursive Belief Estimation HMM Derivation [Rabiner] belt( y

    ) / p( y t = y | x 1, . . . , x t) | {z } ↵t(y) p( x t+1, . . . , x T | y t = y ) | {z } t(y) ↵t( y t) = p( x t| y t) X yt 1 ↵t 1( y t 1)p( y t| y t 1) t( y t) = X yt+1 p( x t+1| y t+1) t+1( y t+1)p( y t+1| y t)
  17. rCRF: Structured Diversity meets HMM Proposition: A Belief over rCRF

    is a CRF bel ( yt ) / exp 2 4 X v,w2Et ⇣ Eb ( v, w) ˜ Eb ( v, w) ⌘ X v2Vt 0 @Eu ( v) ˜ Eu ( v) + X yt 1 ↵t 1 ( yt 1 ) log p ( yt v |yt 1 v ) 1 X yt+1 t+1 ( yt+1 ) bel ( yt+1 ) log p ( yt+1 v |yt v) 1 A 3 5 Binary Term Unary Term
  18. rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward

    loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al]
  19. rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward

    loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al] O ⇣ (TNOLOLA)3 ⌘ ! O ⇣ T (NOLOLA)3 ⌘ Computes probabilities for past/present/future states as a part of the formulation with no random sampling
  20. Resulting Belief

  21. Efficiency and Accuracy Improvement rCRF is 30x faster than the

    state-of-the-art algorithms and runs in real-time Accurate handling of uncertainty also increases accuracy
  22. Efficiency and Accuracy Improvement Resulting belief also stays informative trough

    time
  23. Structured understanding of a single video Large-scaled understanding of video

    collection Sharing knowledge to other domains and modalities Outline
  24. is Unsupervised Learning Possible Is there an underlying structure in

    the YouTube “How-To” videos?
  25. Is there an underlying structure in the YouTube “How-To” videos?

    1st Result
  26. Is there an underlying structure in the YouTube “How-To” videos?

    2nd Result
  27. We discover activities by using a NP-Bayes approach We learn

    a multi-modal dictionary Summary of the Approach We automatically download and filter large multi-modal activity corpus from YouTube
  28. Dictionary Learning (Language) We use the tf-idf metric by considering

    each video as a document. We choose the K mostfrequent words with max tf−idf Dictionary for category “Hard Boil an Egg” with K=50 sort, place, water, egg, bottom, fresh, pot, crack, cold, cover, time, overcooking, hot, shell, stove, turn, cook, boil, break, pinch, salt, peel, lid, point, haigh, rules, perfectly, hard, smell, fast, soft, chill, ice, bowl, remove, aside, store, set, temperature, coagulates, yolk, drain, swirl, shake, white, roll, handle, surface, flat
  29. Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video

    1 Proposal Graph for Video 2 Proposal Graph for Video 3
  30. Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video

    1 Proposal Graph for Video 2 Proposal Graph for Video 3 arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices This function is quasi convex and can be optimized via SGD as arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices r x (i) = 2 A ( i ) x ( i ) 2 x ( i )r(i) x ( i )T x ( i ) + X i2N Ai , j xj x ( j )T1r(i,j) x ( i )T11T x ( j )
  31. Learned Dictionaries Semantically Correct

  32. Learned Dictionaries Semantically Correct

  33. Learned Dictionaries Accuracy vs Semantic Meaning

  34. Representing Each Frame

  35. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We jointly model activities and videos
  36. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each activity is modeled as likelihood of seeing each dictionary item. e.g. probability of having a word “egg” and having object 5 ✓k
  37. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each videos choose subset of activities via Indian Buffet Process fi
  38. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k And transition probabilities between activities ⌘i
  39. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Given activities/transition probabilities, it is HMM
  40. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We learn by Gibbs Sampling
  41. Discovered Activities

  42. None
  43. Evaluation Both modalities are complementary and joint modeling is necessary!

    Multi-video mid-level descriptions are critical for the accuracy
  44. Structured understanding of a single video Large-scaled unsupervised understanding of

    human activities Sharing knowledge to other domains and modalities Outline
  45. Graph Perspective of Large-Scaled Activities How to make pancakes

  46. Graph Perspective of Large-Scaled Activities How to make pancakes

  47. Graph Perspective of Large-Scaled Activities How to make pancakes Egg

    Heat Pan Flip Beat Flour
  48. Graph Perspective of Large-Scaled Activities How to make pancakes Egg

    Heat Pan Flip Beat Flour
  49. Can we go further? RoboBrain Snapshot of the RoboBrain graph

    45,000 concepts (nodes) 98,000 relations (edges) Connecting knowledge from Internet sources and manyprojects
  50. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Input is in the form of “feeds” A feed is collection of binary relations Concept Concept Relation Concept Concept Relation …
  51. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance
  52. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance cv table
  53. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Cup spatially distributed Table spatially distributed cv table
  54. System Architecture

  55. How we processed 25k videos in a day

  56. Robotics-as-a-Service We shared a reliable interface to multiple universities. Humans

    to Robots Lab @ Brown University successfully used RoboBrain-as-a-Service
  57. What we have? ~30sec to process 1sec of video support

    ~10 activities learning from ~100 videos covering only indoor sport environments What we need? real-time any activity learn from all available information any environment humans go rCRF with structured div unsupervised learning w/ NP-Bayes Large-scaled learning on YouTube Using multiple domains via RB
  58. How can we scale further? Linking more and more modalities

    and domains with efficient and theoretically sound models
  59. Cross-Domain Information Videos with no Structure Images and Words with

    Structure How to transfer knowledge?
  60. Transductive Approach (ongoing/future work) How to handle domain shift? Domain

    adaptation vs Domain invariance Domain invariant feature Learning followed by Induction Induction followed by transformation
  61. Transductive Approach (ongoing/future work) Domain invariant feature Learning followed by

    Induction Induction followed by transformation It is generally hard Sometimes impossible [Vapnik] What if such feature does not exist
  62. Domain Transduction [Ongoing work] It might be possible to solve

    Domain transfer with no induction We are developing a max-margin framework based on coordinate-ascent of transduction and domain adaptation [ongoing work]
  63. Adaptive Transduction [Future Work] Some domains are Intractably large like

    YouTube etc. Can we solve the problem adaptively
  64. Adaptive Transduction – Robot in the Loop Sampled Videos Domain

    Transduction Adaptive Sampling
  65. rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos

    Ozan Sener, Ashutosh Saxena. In RSS 2015 Unsupervised Semantic Parsing of Video Collections Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena. In ICCV 2015 RoboBrain: Large-Scale Knowledge Engine for Robots Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, Hema S. Koppula. In ISRR 2015 3D Semantic Parsing of Large-Scale Buildings Iro Armeni, Ozan Sener et al. (In submission to CVPR 2016) Unsupervised Discovery of Spatio-Temporal Semantic Descriptors Under preperation for TPAMI submission
  66. Joint Work With Ashutosh Saxena Silvio Savarese Ozan Sener Ashesh

    Jain Deedy Das Amir R. Zamir Aditya Jami Dipendra K. Misra Jay Hack Hema Koppula
  67. Thank You