Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A-Exam Slides

Ozan Sener
September 10, 2015

A-Exam Slides

Ozan Sener

September 10, 2015
Tweet

More Decks by Ozan Sener

Other Decks in Research

Transcript

  1. Unsupervised Discovery of Structure in Human Videos Ozan Sener Joint

    work with Ashutosh Saxena, Silvio Savarese, Ashesh Jain and Amir Zamir Committee: Ashutosh Saxena, David Mimno, Emin Gun Sirer
  2. We envision robots doing human like activities while working with

    humans courtesy of Sung et al. courtesy of Koppula et al.
  3. Understanding Videos Image Centric Video is an trivial extensions of

    images Video Centric We need video specific features/models [Kantarov CVP14, Hou et al ECCV14, THUMOS15, Schmid CVPR15]
  4. Understanding Videos Image Centric Rich models like CRF Easy to

    model context Super linear in #of-frames Hard to obtain supervision (~10s activitiy, ~10s objects) Video Centric Scales linearly in #of-frames Requires only frame labels Does not model the context Inefficient (~30sec for ~1sec of vid) Exclusively supervised Hard to scale in #of-videos (~100s videos)
  5. What we have? ~30sec to process 1sec of video support

    ~10 activities learning from ~100 videos covering only indoor/sport environments What we need? real-time any activity learn from all available information any environment ?
  6. Structured understanding of a single video Large-scale understanding of video

    collections Sharing knowledge to other domains and modalities Outline
  7. Structured understanding of a single video Large-scaled understanding of video

    collections Sharing knowledge to other domains and modalities Outline
  8. Revisit the Image Based Approach O2 O1 H Context of

    humans and objects are successfully modeled as CRFs P ( O1,...,T 1 , O1,...,T 2 , H1,...,T | 1,...,T O1 , 1,...,T O2 , 1,...,T H ) ⇠ exp 0 @ X v2V E ( v) + X v,w2E E ( v , w) 1 A
  9. How to Find MAP[Koppula RSS 2013] Compute features Define the

    energy function Solve the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures
  10. Shortcomings[Koppula RSS 2013] Compute features Define the energy function Solve

    the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures We also need probabilities in addition to MAP solution (future is unknown) Dimension ~ 1o6xT ~ 103600 (#ObjLabels#Objectsx#ActLabels)Time O ⇣ (TNOLOLA)3 ⌘
  11. Structured Diversity Modes are likely and structurally diverse yt,i =

    arg max y belt ( y ) s.t. ( y, yt,i ) 8j < i
  12. HMM – Recursive Belief Estimation HMM Derivation [Rabiner] belt( y

    ) / p( y t = y | x 1, . . . , x t) | {z } ↵t(y) p( x t+1, . . . , x T | y t = y ) | {z } t(y) ↵t( y t) = p( x t| y t) X yt 1 ↵t 1( y t 1)p( y t| y t 1) t( y t) = X yt+1 p( x t+1| y t+1) t+1( y t+1)p( y t+1| y t)
  13. rCRF: Structured Diversity meets HMM Proposition: A Belief over rCRF

    is a CRF bel ( yt ) / exp 2 4 X v,w2Et ⇣ Eb ( v, w) ˜ Eb ( v, w) ⌘ X v2Vt 0 @Eu ( v) ˜ Eu ( v) + X yt 1 ↵t 1 ( yt 1 ) log p ( yt v |yt 1 v ) 1 X yt+1 t+1 ( yt+1 ) bel ( yt+1 ) log p ( yt+1 v |yt v) 1 A 3 5 Binary Term Unary Term
  14. rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward

    loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al]
  15. rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward

    loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al] O ⇣ (TNOLOLA)3 ⌘ ! O ⇣ T (NOLOLA)3 ⌘ Computes probabilities for past/present/future states as a part of the formulation with no random sampling
  16. Efficiency and Accuracy Improvement rCRF is 30x faster than the

    state-of-the-art algorithms and runs in real-time Accurate handling of uncertainty also increases accuracy
  17. Structured understanding of a single video Large-scaled understanding of video

    collection Sharing knowledge to other domains and modalities Outline
  18. We discover activities by using a NP-Bayes approach We learn

    a multi-modal dictionary Summary of the Approach We automatically download and filter large multi-modal activity corpus from YouTube
  19. Dictionary Learning (Language) We use the tf-idf metric by considering

    each video as a document. We choose the K mostfrequent words with max tf−idf Dictionary for category “Hard Boil an Egg” with K=50 sort, place, water, egg, bottom, fresh, pot, crack, cold, cover, time, overcooking, hot, shell, stove, turn, cook, boil, break, pinch, salt, peel, lid, point, haigh, rules, perfectly, hard, smell, fast, soft, chill, ice, bowl, remove, aside, store, set, temperature, coagulates, yolk, drain, swirl, shake, white, roll, handle, surface, flat
  20. Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video

    1 Proposal Graph for Video 2 Proposal Graph for Video 3
  21. Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video

    1 Proposal Graph for Video 2 Proposal Graph for Video 3 arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices This function is quasi convex and can be optimized via SGD as arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices r x (i) = 2 A ( i ) x ( i ) 2 x ( i )r(i) x ( i )T x ( i ) + X i2N Ai , j xj x ( j )T1r(i,j) x ( i )T11T x ( j )
  22. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We jointly model activities and videos
  23. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each activity is modeled as likelihood of seeing each dictionary item. e.g. probability of having a word “egg” and having object 5 ✓k
  24. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each videos choose subset of activities via Indian Buffet Process fi
  25. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k And transition probabilities between activities ⌘i
  26. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Given activities/transition probabilities, it is HMM
  27. Unsupervised Discovery via NP-Bayes k = 1, . . .

    , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We learn by Gibbs Sampling
  28. Evaluation Both modalities are complementary and joint modeling is necessary!

    Multi-video mid-level descriptions are critical for the accuracy
  29. Structured understanding of a single video Large-scaled unsupervised understanding of

    human activities Sharing knowledge to other domains and modalities Outline
  30. Can we go further? RoboBrain Snapshot of the RoboBrain graph

    45,000 concepts (nodes) 98,000 relations (edges) Connecting knowledge from Internet sources and manyprojects
  31. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Input is in the form of “feeds” A feed is collection of binary relations Concept Concept Relation Concept Concept Relation …
  32. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance
  33. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance cv table
  34. How to scale the knowledge cv cv cv cv cv

    cv cv cup water pour liquid bottle kept fridge appearance has_grasp Cup spatially distributed Table spatially distributed cv table
  35. Robotics-as-a-Service We shared a reliable interface to multiple universities. Humans

    to Robots Lab @ Brown University successfully used RoboBrain-as-a-Service
  36. What we have? ~30sec to process 1sec of video support

    ~10 activities learning from ~100 videos covering only indoor sport environments What we need? real-time any activity learn from all available information any environment humans go rCRF with structured div unsupervised learning w/ NP-Bayes Large-scaled learning on YouTube Using multiple domains via RB
  37. How can we scale further? Linking more and more modalities

    and domains with efficient and theoretically sound models
  38. Transductive Approach (ongoing/future work) How to handle domain shift? Domain

    adaptation vs Domain invariance Domain invariant feature Learning followed by Induction Induction followed by transformation
  39. Transductive Approach (ongoing/future work) Domain invariant feature Learning followed by

    Induction Induction followed by transformation It is generally hard Sometimes impossible [Vapnik] What if such feature does not exist
  40. Domain Transduction [Ongoing work] It might be possible to solve

    Domain transfer with no induction We are developing a max-margin framework based on coordinate-ascent of transduction and domain adaptation [ongoing work]
  41. Adaptive Transduction [Future Work] Some domains are Intractably large like

    YouTube etc. Can we solve the problem adaptively
  42. rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos

    Ozan Sener, Ashutosh Saxena. In RSS 2015 Unsupervised Semantic Parsing of Video Collections Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena. In ICCV 2015 RoboBrain: Large-Scale Knowledge Engine for Robots Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, Hema S. Koppula. In ISRR 2015 3D Semantic Parsing of Large-Scale Buildings Iro Armeni, Ozan Sener et al. (In submission to CVPR 2016) Unsupervised Discovery of Spatio-Temporal Semantic Descriptors Under preperation for TPAMI submission
  43. Joint Work With Ashutosh Saxena Silvio Savarese Ozan Sener Ashesh

    Jain Deedy Das Amir R. Zamir Aditya Jami Dipendra K. Misra Jay Hack Hema Koppula