Slide 1

Slide 1 text

Unsupervised Discovery of Structure in Human Videos Ozan Sener Joint work with Ashutosh Saxena, Silvio Savarese, Ashesh Jain and Amir Zamir Committee: Ashutosh Saxena, David Mimno, Emin Gun Sirer

Slide 2

Slide 2 text

We envision robots doing human like activities while working with humans courtesy of Sung et al. courtesy of Koppula et al.

Slide 3

Slide 3 text

understanding humans, their environments, objects and activities It requires

Slide 4

Slide 4 text

What is the next step? What is he doing? How can I perform X activity?

Slide 5

Slide 5 text

Understanding Videos Image Centric Video is an trivial extensions of images Video Centric We need video specific features/models [Kantarov CVP14, Hou et al ECCV14, THUMOS15, Schmid CVPR15]

Slide 6

Slide 6 text

Understanding Videos Image Centric Rich models like CRF Easy to model context Super linear in #of-frames Hard to obtain supervision (~10s activitiy, ~10s objects) Video Centric Scales linearly in #of-frames Requires only frame labels Does not model the context Inefficient (~30sec for ~1sec of vid) Exclusively supervised Hard to scale in #of-videos (~100s videos)

Slide 7

Slide 7 text

What we have? ~30sec to process 1sec of video support ~10 activities learning from ~100 videos covering only indoor/sport environments What we need? real-time any activity learn from all available information any environment ?

Slide 8

Slide 8 text

Discover, understand and share the underlying semantic structure of the videos.

Slide 9

Slide 9 text

Structured understanding of a single video Large-scale understanding of video collections Sharing knowledge to other domains and modalities Outline

Slide 10

Slide 10 text

Structured understanding of a single video Large-scaled understanding of video collections Sharing knowledge to other domains and modalities Outline

Slide 11

Slide 11 text

Revisit the Image Based Approach O2 O1 H Context of humans and objects are successfully modeled as CRFs P ( O1,...,T 1 , O1,...,T 2 , H1,...,T | 1,...,T O1 , 1,...,T O2 , 1,...,T H ) ⇠ exp 0 @ X v2V E ( v) + X v,w2E E ( v , w) 1 A

Slide 12

Slide 12 text

How to Find MAP[Koppula RSS 2013] Compute features Define the energy function Solve the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures

Slide 13

Slide 13 text

Shortcomings[Koppula RSS 2013] Compute features Define the energy function Solve the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures We also need probabilities in addition to MAP solution (future is unknown) Dimension ~ 1o6xT ~ 103600 (#ObjLabels#Objectsx#ActLabels)Time O ⇣ (TNOLOLA)3 ⌘

Slide 14

Slide 14 text

Structured Diversity Although the state dimensionality is high, probability concentrates on a few modes

Slide 15

Slide 15 text

Structured Diversity Modes are likely and structurally diverse yt,i = arg max y belt ( y ) s.t. ( y, yt,i ) 8j < i

Slide 16

Slide 16 text

HMM – Recursive Belief Estimation HMM Derivation [Rabiner] belt( y ) / p( y t = y | x 1, . . . , x t) | {z } ↵t(y) p( x t+1, . . . , x T | y t = y ) | {z } t(y) ↵t( y t) = p( x t| y t) X yt 1 ↵t 1( y t 1)p( y t| y t 1) t( y t) = X yt+1 p( x t+1| y t+1) t+1( y t+1)p( y t+1| y t)

Slide 17

Slide 17 text

rCRF: Structured Diversity meets HMM Proposition: A Belief over rCRF is a CRF bel ( yt ) / exp 2 4 X v,w2Et ⇣ Eb ( v, w) ˜ Eb ( v, w) ⌘ X v2Vt 0 @Eu ( v) ˜ Eu ( v) + X yt 1 ↵t 1 ( yt 1 ) log p ( yt v |yt 1 v ) 1 X yt+1 t+1 ( yt+1 ) bel ( yt+1 ) log p ( yt+1 v |yt v) 1 A 3 5 Binary Term Unary Term

Slide 18

Slide 18 text

rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al]

Slide 19

Slide 19 text

rCRF: Algorithm Compute energy function for each frame-wise CRF Forward-Backward loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al] O ⇣ (TNOLOLA)3 ⌘ ! O ⇣ T (NOLOLA)3 ⌘ Computes probabilities for past/present/future states as a part of the formulation with no random sampling

Slide 20

Slide 20 text

Resulting Belief

Slide 21

Slide 21 text

Efficiency and Accuracy Improvement rCRF is 30x faster than the state-of-the-art algorithms and runs in real-time Accurate handling of uncertainty also increases accuracy

Slide 22

Slide 22 text

Efficiency and Accuracy Improvement Resulting belief also stays informative trough time

Slide 23

Slide 23 text

Structured understanding of a single video Large-scaled understanding of video collection Sharing knowledge to other domains and modalities Outline

Slide 24

Slide 24 text

is Unsupervised Learning Possible Is there an underlying structure in the YouTube “How-To” videos?

Slide 25

Slide 25 text

Is there an underlying structure in the YouTube “How-To” videos? 1st Result

Slide 26

Slide 26 text

Is there an underlying structure in the YouTube “How-To” videos? 2nd Result

Slide 27

Slide 27 text

We discover activities by using a NP-Bayes approach We learn a multi-modal dictionary Summary of the Approach We automatically download and filter large multi-modal activity corpus from YouTube

Slide 28

Slide 28 text

Dictionary Learning (Language) We use the tf-idf metric by considering each video as a document. We choose the K mostfrequent words with max tf−idf Dictionary for category “Hard Boil an Egg” with K=50 sort, place, water, egg, bottom, fresh, pot, crack, cold, cover, time, overcooking, hot, shell, stove, turn, cook, boil, break, pinch, salt, peel, lid, point, haigh, rules, perfectly, hard, smell, fast, soft, chill, ice, bowl, remove, aside, store, set, temperature, coagulates, yolk, drain, swirl, shake, white, roll, handle, surface, flat

Slide 29

Slide 29 text

Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video 1 Proposal Graph for Video 2 Proposal Graph for Video 3

Slide 30

Slide 30 text

Dictionary Learning (Visual) Multi Video Edges Proposal Graph for Video 1 Proposal Graph for Video 2 Proposal Graph for Video 3 arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices This function is quasi convex and can be optimized via SGD as arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices r x (i) = 2 A ( i ) x ( i ) 2 x ( i )r(i) x ( i )T x ( i ) + X i2N Ai , j xj x ( j )T1r(i,j) x ( i )T11T x ( j )

Slide 31

Slide 31 text

Learned Dictionaries Semantically Correct

Slide 32

Slide 32 text

Learned Dictionaries Semantically Correct

Slide 33

Slide 33 text

Learned Dictionaries Accuracy vs Semantic Meaning

Slide 34

Slide 34 text

Representing Each Frame

Slide 35

Slide 35 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We jointly model activities and videos

Slide 36

Slide 36 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each activity is modeled as likelihood of seeing each dictionary item. e.g. probability of having a word “egg” and having object 5 ✓k

Slide 37

Slide 37 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each videos choose subset of activities via Indian Buffet Process fi

Slide 38

Slide 38 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k And transition probabilities between activities ⌘i

Slide 39

Slide 39 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Given activities/transition probabilities, it is HMM

Slide 40

Slide 40 text

Unsupervised Discovery via NP-Bayes k = 1, . . . , 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k We learn by Gibbs Sampling

Slide 41

Slide 41 text

Discovered Activities

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Evaluation Both modalities are complementary and joint modeling is necessary! Multi-video mid-level descriptions are critical for the accuracy

Slide 44

Slide 44 text

Structured understanding of a single video Large-scaled unsupervised understanding of human activities Sharing knowledge to other domains and modalities Outline

Slide 45

Slide 45 text

Graph Perspective of Large-Scaled Activities How to make pancakes

Slide 46

Slide 46 text

Graph Perspective of Large-Scaled Activities How to make pancakes

Slide 47

Slide 47 text

Graph Perspective of Large-Scaled Activities How to make pancakes Egg Heat Pan Flip Beat Flour

Slide 48

Slide 48 text

Graph Perspective of Large-Scaled Activities How to make pancakes Egg Heat Pan Flip Beat Flour

Slide 49

Slide 49 text

Can we go further? RoboBrain Snapshot of the RoboBrain graph 45,000 concepts (nodes) 98,000 relations (edges) Connecting knowledge from Internet sources and manyprojects

Slide 50

Slide 50 text

How to scale the knowledge cv cv cv cv cv cv cv cup water pour liquid bottle kept fridge appearance has_grasp Input is in the form of “feeds” A feed is collection of binary relations Concept Concept Relation Concept Concept Relation …

Slide 51

Slide 51 text

How to scale the knowledge cv cv cv cv cv cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance

Slide 52

Slide 52 text

How to scale the knowledge cv cv cv cv cv cv cv cup water pour liquid bottle kept fridge appearance has_grasp Bottle has appearance Table has appearance cv table

Slide 53

Slide 53 text

How to scale the knowledge cv cv cv cv cv cv cv cup water pour liquid bottle kept fridge appearance has_grasp Cup spatially distributed Table spatially distributed cv table

Slide 54

Slide 54 text

System Architecture

Slide 55

Slide 55 text

How we processed 25k videos in a day

Slide 56

Slide 56 text

Robotics-as-a-Service We shared a reliable interface to multiple universities. Humans to Robots Lab @ Brown University successfully used RoboBrain-as-a-Service

Slide 57

Slide 57 text

What we have? ~30sec to process 1sec of video support ~10 activities learning from ~100 videos covering only indoor sport environments What we need? real-time any activity learn from all available information any environment humans go rCRF with structured div unsupervised learning w/ NP-Bayes Large-scaled learning on YouTube Using multiple domains via RB

Slide 58

Slide 58 text

How can we scale further? Linking more and more modalities and domains with efficient and theoretically sound models

Slide 59

Slide 59 text

Cross-Domain Information Videos with no Structure Images and Words with Structure How to transfer knowledge?

Slide 60

Slide 60 text

Transductive Approach (ongoing/future work) How to handle domain shift? Domain adaptation vs Domain invariance Domain invariant feature Learning followed by Induction Induction followed by transformation

Slide 61

Slide 61 text

Transductive Approach (ongoing/future work) Domain invariant feature Learning followed by Induction Induction followed by transformation It is generally hard Sometimes impossible [Vapnik] What if such feature does not exist

Slide 62

Slide 62 text

Domain Transduction [Ongoing work] It might be possible to solve Domain transfer with no induction We are developing a max-margin framework based on coordinate-ascent of transduction and domain adaptation [ongoing work]

Slide 63

Slide 63 text

Adaptive Transduction [Future Work] Some domains are Intractably large like YouTube etc. Can we solve the problem adaptively

Slide 64

Slide 64 text

Adaptive Transduction – Robot in the Loop Sampled Videos Domain Transduction Adaptive Sampling

Slide 65

Slide 65 text

rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos Ozan Sener, Ashutosh Saxena. In RSS 2015 Unsupervised Semantic Parsing of Video Collections Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena. In ICCV 2015 RoboBrain: Large-Scale Knowledge Engine for Robots Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, Hema S. Koppula. In ISRR 2015 3D Semantic Parsing of Large-Scale Buildings Iro Armeni, Ozan Sener et al. (In submission to CVPR 2016) Unsupervised Discovery of Spatio-Temporal Semantic Descriptors Under preperation for TPAMI submission

Slide 66

Slide 66 text

Joint Work With Ashutosh Saxena Silvio Savarese Ozan Sener Ashesh Jain Deedy Das Amir R. Zamir Aditya Jami Dipendra K. Misra Jay Hack Hema Koppula

Slide 67

Slide 67 text

Thank You