model context Super linear in #of-frames Hard to obtain supervision (~10s activitiy, ~10s objects) Video Centric Scales linearly in #of-frames Requires only frame labels Does not model the context Inefficient (~30sec for ~1sec of vid) Exclusively supervised Hard to scale in #of-videos (~100s videos)
~10 activities learning from ~100 videos covering only indoor/sport environments What we need? real-time any activity learn from all available information any environment ?
humans and objects are successfully modeled as CRFs P ( O1,...,T 1 , O1,...,T 2 , H1,...,T | 1,...,T O1 , 1,...,T O2 , 1,...,T H ) ⇠ exp 0 @ X v2V E ( v) + X v,w2E E ( v , w) 1 A
the Combinatorial Optimization Activity/Object labels for the past and future Sample Possible Futures We also need probabilities in addition to MAP solution (future is unknown) Dimension ~ 1o6xT ~ 103600 (#ObjLabels#Objectsx#ActLabels)Time O ⇣ (TNOLOLA)3 ⌘
) / p( y t = y | x 1, . . . , x t) | {z } ↵t(y) p( x t+1, . . . , x T | y t = y ) | {z } t(y) ↵t( y t) = p( x t| y t) X yt 1 ↵t 1( y t 1)p( y t| y t 1) t( y t) = X yt+1 p( x t+1| y t+1) t+1( y t+1)p( y t+1| y t)
loop for message passing Compute the energy function of rCRF Sample by using Lagrangian relaxation [Batra et al] O ⇣ (TNOLOLA)3 ⌘ ! O ⇣ T (NOLOLA)3 ⌘ Computes probabilities for past/present/future states as a part of the formulation with no random sampling
each video as a document. We choose the K mostfrequent words with max tf−idf Dictionary for category “Hard Boil an Egg” with K=50 sort, place, water, egg, bottom, fresh, pot, crack, cold, cover, time, overcooking, hot, shell, stove, turn, cook, boil, break, pinch, salt, peel, lid, point, haigh, rules, perfectly, hard, smell, fast, soft, chill, ice, bowl, remove, aside, store, set, temperature, coagulates, yolk, drain, swirl, shake, white, roll, handle, surface, flat
1 Proposal Graph for Video 2 Proposal Graph for Video 3 arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices This function is quasi convex and can be optimized via SGD as arg max X i2V x (i)T A (i) x (i) x (i)T x (i) + X i2V X j2N (i) x (i)T A (i,j) x (j) x (i)T 11T x (j) where V is set of videos, N ( i ) is neighbour videos of i , and A is similarity matrices r x (i) = 2 A ( i ) x ( i ) 2 x ( i )r(i) x ( i )T x ( i ) + X i2N Ai , j xj x ( j )T1r(i,j) x ( i )T11T x ( j )
, 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each activity is modeled as likelihood of seeing each dictionary item. e.g. probability of having a word “egg” and having object 5 ✓k
, 1 i = 1, . . . , N c y2 · · · ⌘i fi y1 · · · kth activity step ⇡i B0 (·) ith video wk z4 y3 z3 y4 z1 z2 ✓k Each videos choose subset of activities via Indian Buffet Process fi
cv cv cup water pour liquid bottle kept fridge appearance has_grasp Input is in the form of “feeds” A feed is collection of binary relations Concept Concept Relation Concept Concept Relation …
~10 activities learning from ~100 videos covering only indoor sport environments What we need? real-time any activity learn from all available information any environment humans go rCRF with structured div unsupervised learning w/ NP-Bayes Large-scaled learning on YouTube Using multiple domains via RB
Domain transfer with no induction We are developing a max-margin framework based on coordinate-ascent of transduction and domain adaptation [ongoing work]
Ozan Sener, Ashutosh Saxena. In RSS 2015 Unsupervised Semantic Parsing of Video Collections Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena. In ICCV 2015 RoboBrain: Large-Scale Knowledge Engine for Robots Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, Hema S. Koppula. In ISRR 2015 3D Semantic Parsing of Large-Scale Buildings Iro Armeni, Ozan Sener et al. (In submission to CVPR 2016) Unsupervised Discovery of Spatio-Temporal Semantic Descriptors Under preperation for TPAMI submission