Slide 1

Slide 1 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Action Recognition  September 3, 2018 Katsunori Ohnishi DeNA Co., Ltd. 1

Slide 2

Slide 2 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n #$" n Action recognition n   n ! n  % Deep % Deep % Temporal Aggregation n  Tips n   n  2

Slide 3

Slide 3 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n ,@ VU ( ) Y Twitter: @ohnishi_ka n 8R Y 2014-41-2017-91: B4~M2.52<,9Computer VisionIJ MCO • 5N (;SEB) TQ: http://katsunoriohnishi.github.io/ Y CVPR2016 (spotlight oral, acceptance rate=9.7%): egocentric vision (wrist-mounted camera) Y ACMMM2016 (poster, acceptance rate=30%): action recognition (0W state-of-the-art) Y AAAI2018 (oral, acceptance rate=10.9%): video generation (FTGAN) Y 2017-101->D: DeNA AI "&*3 • FGDeNA)"$#%*=6:X9PA7 (+!'4/? Y → https://www.wantedly.com/projects/209980 Y LK.H  3

Slide 4

Slide 4 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Action Recognition n #$  / -( Image classification#, / ) action recognition = &".*human action recognition • ! fine-grained egocentric   '#+% 4  Fine-grained egocentric Dog-centric Action recognition RGBD Evaluation of video activity localizations integrating quality and quantity measurements [C. Wolf+, CVIU14] Recognizing Activities of Daily Living with a Wrist-mounted Camera [K. Ohnishi+, CVPR16] A Database for Fine Grained Activity Detection of Cooking Activities [M. Rohrbach+, CVPR12] First-Person Animal Activity Recognition from Egocentric Videos [Y. Iwashita+, ICPR14] Recognizing Human Actions: A Local SVM Approach [C. Schuldt+, ICPR04] HMDB: A Large Video Database for Human Motion Recognition [H. Kuehne+, ICCV11] Ucf101: A dataset of 101 human actions classes from videos in the wild [K. Soomro+, arXiv2012]

Slide 5

Slide 5 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n "$ & KTH, UCF101, HMDB51 • "$  UCF101 101 13320… n "$ & Activity-net, Kinetics, Youtube8M n  % & AVA, Moments in times, SLAC 5   UCF101 #!

Slide 6

Slide 6 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n YouTube-8M Video Understanding Challenge L https://www.kaggle.com/c/youtube8m L CVPR17ECCV18workshop8J, &0 .39Kaggle8J ! L frame-levelAKGE!  test:;@39"%(' • kaggle, action recognition&0. =?>F L ,)(*-39%$/+#C  n ActivityNet Challenge L http://activity-net.org/challenges/2018/ L 4  ! L ActivityNet3)'% • Temporal Proposal (T) • Temporal localization (T) • Video Captioning L I<71BH,)(*-  )'%  • Kinetics: classification (human action) • AVA: Spatio-temporal localization (XYT) • Moments-in-time: classification (event) L ! 25D6  6

Slide 7

Slide 7 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n ,4I ^ 2000-6FW ?05 B ^ EY>P SIFTR/:3IUK ] ^ V!'*%  EY>P1O local descriptor→codingglobal feature→JNPD7 2;9 n @=HA ^ STIP [I. Laptev, IJCV04] ^ Dense Trajectory [H. Wang+, ICCV11] ^ Improved Dense Trajectory [H. Wang+, ICCV13] 7 • S+.L$)"&G \ 8

Slide 8

Slide 8 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n Improved Dense Trajectories (iDT) [H. Wang+, ICCV13]  Dense Trajectories [H. Wang+, ICCV11]   8 2   optical flow foreground optical flow Improved dense trajectories (green) (background dense trajectories (white))

Slide 9

Slide 9 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n 3/#'$ !*"%&()* 9 28 SIFT/0:?6> -9Fisher Vector  7-49 Fisher vector+,<5;  .=  http://www.isi.imi.i.u-tokyo.ac.jp/~harada/pdf/SSII_harada20120608.pdf    https://www.slideshare.net/takao-y/fisher-vector … input 1: Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM Fisher kernels on visual vocabularies for image categorization [F. Perronnin, CVPR07] [F. Pedregosa+, JMLR11]

Slide 10

Slide 10 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition n 2, *)3& 4 CNN'0 4 Two-stream • Hand-crafted feature- '1 +*)() 4 3D Convolution • C3D'0 • C3D"./Two-stream( • %3D conv# ! 4 Optical flow %# $ 10

Slide 11

Slide 11 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: CNN n Spatio-temporal ConvNet [A. Karpathy+, CVPR 14] U CNN(FS$ (E*3)0)

Slide 12

Slide 12 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: Two-stream n Two-stream [K. Simonyan+, NIPS15] W 2D CNN*>, & )-;6  T9 J@1=#I4%2S • Spatial-stream: RGBAR& );6:J@ (input: RGB) • Temporal-stream: Optical flow-;6:J@ (input: optical flow 10 frames) • Frame-by-frame8Q -A,("! 5O/B8QH W Hand-crafted feature1=CNN7UV 12 Two-stream convolutional networks for action recognition in videos [K. Simonyan+, NIPS15] UCF101 HMDB51 iDT 85.9% 57.2% Spatio-temporal ConvNet 65.4% - RGB-stream 73.0% 40.5% Flow-stream 83.7% 54.6% Two-steam 88.0% 59.4% • -.0K$GMFED? (< ) • 2DCNNP1=E*NL3CMF*+' *imagenet pre-trained

Slide 13

Slide 13 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n C3D [D. Tran +, ICCV15] V 16frame9I#/53D convolution !CNNTH • XYT3D convolution 9 -6B?K0#.18U ! V UCF101F *MS(&%')pre-training L; V DN V ICCV15 @, arxiv3 + 24 reject"= ! 13 Learning Spatiotemporal Features with 3D Convolutional Networks [D. Tran +, ICCV15] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam 88.0% 59.4% C3D (1net) 82.3% - 3D conv-69 K08U !:O! 7PAGQ9I (&J<#C !$2R

Slide 14

Slide 14 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n P3D [Z. Qiu+, ICCV17] 1 /-C3D, ! ' (    1 #$3D conv → 2D conv (XY) + 1D conv (T))0 1  ",.%*pre-training +& 14 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Spatial 2D conv Temporal 1D conv

Slide 15

Slide 15 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n P3D [Z. Qiu+, ICCV17] Q OHC3D , -/.)=> !    Q 153D conv → 2D conv (XY) + 1D conv (T)@P Q  $"0GM9C+)(*,pre-trainingF; 15 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Two-stream (ResNet152) 91.8% Spatial 2D conv Temporal 1D conv 3D conv269E38N % :I % 7J?BK#9C +)D<&A %'   again4L

Slide 16

Slide 16 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n C3D, P3D #& ( + 3D conv  " n  $! + )'  3D conv % *  [K. Hara+, CVPR18] 16 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017

Slide 17

Slide 17 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n C3D, P3D #& ( + 3D conv  " n  $! + )'  3D conv % *  [K. Hara+, CVPR18] 17 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017 2017 Kinetics!

Slide 18

Slide 18 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n KineticsD. I A?H; ,3 • '$#%(E216097:G   

Slide 19

Slide 19 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$  /31' 19 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 …

Slide 20

Slide 20 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$  /31' 20 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 … ?

Slide 21

Slide 21 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n I3D Two-stream3B% • XY-5?0 D#T-5 H .A9/ $1  " ;63D conv" EF: ,  21 time

Slide 22

Slide 22 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n 3D convolution F"N  "/M [D.A. Huang+, CVPR18] O LE • 3D CNN 36 @J"N  O =A • 6 @J1 9>C;47D→6 @J"N !  • I:%)' G?<8%)'2K"- B>9 0 • Two-stream I3D Optical flow"3D conv5H#*$*&( 9> + ,  . 22 What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets [D.A. Huang+, CVPR18]

Slide 23

Slide 23 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: 3D convolution n 3D convCd)- 'D$E*)!%K N ;!g ;W h CVPR18H=$_A\MQ,8*"*!>a h   FY, US,*#]?$CVPR/ICCV/ECCV'8  h *,eL)/2&  9^  !OJ h $+-3D convD,E   )`?fG143,:<8'*)3D conv7@!RB • GPUG0.Vc- 23 DT 0. XZ, (N I6 5P

Slide 24

Slide 24 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition: Optical flow   n Optical flow 285&K $ )I [L Sevilla-Lara+, CVPR18] L HA •  Optical flow(3285 K "%$  L .C7; • Optical flowF-(EPE)action recognitionF- L  #0, • B6?<9flowF-action recognitionF- L B69F-*$40, $ L 7> • "285&1D$ !# Optical flowappearanceE=/GJ " @ • Optical flowF-!#2B6?& #K $+ ': 24 On the Integration of Optical Flow and Action Recognition [L Sevilla-Lara+, CVPR18]

Slide 25

Slide 25 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.   25 AVA XYZT bounding box   human action localization Moments-in-time 3     #" Kinetics-600 Kinetics 400 600 ! [C. Gu+, CVPR18] [M. Monfort+, arXiv2018] [W. Kay+, arXiv2017]

Slide 26

Slide 26 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n *.#%$,4(&   7 2D conv "frame-by-frame 3D conv4(&') + 7 -032 (100 frames, 232 frames, 50 frames) *./51! 6 26

Slide 27

Slide 27 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n %&GTFEUg-H* k Score&UgZLb\-H* → 0:57]  KS I^#)< k LSTM-3129/B, * → f@KS ; (  • V486-FCd#UgZWN.BA`e >[ P(+*QM * !"?UgZ ? • haGT fencingi*A→ fencingiC *A →… 'QM$cY -j *J_ *  27 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR$ACMMM$AAAI$= RO=D X!> # …

Slide 28

Slide 28 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. … input  Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM [F. Pedregosa+, JMLR11] Temporal Aggregation n 03(') ;:, *,+=429D  &$%.  !-  F → @$"…! F Fisher Vector%8  • CNN9DSIFTC</75GMMB1  #   • FV>A6EVLAD [H. Jegou+, CVPR10] %8 ? 28 Aggregating local descriptors into a compact image representation [H. Jegou+, CVPR10]

Slide 29

Slide 29 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n LCD [Z. Xu+, CVPR15] ' VGG16 pool5XY  #512dim feature"! • 224x224 $& feature7x7=49% •  $  VLAD global feature"! 29 A discriminative CNN video representation for event detection [Z. Xu+, CVPR15] … input $ CNN Pool5 (e.g. 2x2x512) Local descriptors VLAD SVM global feature CNN CNN

Slide 30

Slide 30 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n ActionVLAD [R. Girdhar+, CVPR17] E NetVLAD [R Arandjelović+, CVPR16] &.84@,  • NetVLADVLAD&NN+ ClusterassignD< >C  $/2&softmax7 assign$ 3) • VLAD'*9B=LCD E 06VLAD&5?715#;%!:$ • End2end-A (CNN06 %"! 30 ActionVLAD: Learning spatio-temporal aggregation for action classification [R. Girdhar+, CVPR17]

Slide 31

Slide 31 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n TLE [A. Diba+, CVPR17] 4 VLADCompact Bilinear Pooling [Y. Gao+, CVPR16]  4 Temporal Aggregation!(0%"'- 4 VLAD2 /3,+1 • SVM#VLAD",+&* NN )$  . 31 Deep Temporal Linear Encoding Networks [A. Diba+, CVPR17]

Slide 32

Slide 32 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n ' BOQ]^ ` Two-stream (ResNet)  #2D conv=[EL$P<U Optical flow!D  n Single modelState-of-the-art EL ` H;!RGB-stream& ` Optical flow GN%BO \"J"Z*KV( $8M ` 3127.[W@+$XSU  32

Slide 33

Slide 33 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n -?:+1% s CNN`p'  aTLEcoding • TLEActionVLAD(?.?9=RL s iDT!] • CNNi(?.?9=%Zc0-(jA % • FisherVector#iDT7)8,nQ_EDo%kF@% s OUTVKmMd$ s X^g "Y\lHbP!Y[erf% 33 GIC &

Slide 34

Slide 34 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation n +  ,SaRQ#bu3T0 z Score,bug#Xoi3T0 → 7D>Aj$ W`$Uk )/F z LSTM3<9:C6#L2!0 → sJ!W`E. $ • c=B?3FCq)bug!d[4L!Kmr !Hh ].10^Y$ 0 '(Ibug$? • v!n#Safencing#w 0K→ fencing#wO 0K →… #- ^Y!  *#pf 3y 0!Vl$ 0  34 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR*ACMMM*AAAI*#G# _\GP$e"'H )… Z'#8@5; S3T/1%fx 2 )/inputM#atNT0& ↓ Two-stream

Slide 35

Slide 35 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.    n %  &  2 LSTM& 2 3D conv& 2 Optical flow&  •   %   (*, [L Sevilla-Lara+, CVPR18] 35 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC 01% +)& '-/! . % &  #"$& 

Slide 36

Slide 36 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.    2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow  MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18]    ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ")  VGAN    

Slide 37

Slide 37 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.    2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow  MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18]    ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ")  ! VGAN    

Slide 38

Slide 38 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. n !  Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture K. Ohnishi+, AAAI 2018 (oral presentation)  https://arxiv.org/abs/1711.09618 38 Optical flow 

Slide 39

Slide 39 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n ed q Action classificationS _ J • 8>Temporal action localization!Spatio-temporal localizationB@W *2, q 3D conv(lbE%>OK 7 %J  q AugmentationAP  L % n PoseTQ q PoseTQ (Y^@.-+NI$cZAL % • ]nUfmh #< poseF #H&"FV % gM # • ')data distillationA_kC(`i6%X n Tips q -/4,jo %#F]9?041:\&optical flowR[ > a q Kinetics Youtube35+GD%;=#F] p" %# 39

Slide 40

Slide 40 text

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n IX#k >[ m MVd&@B"Z^jf%1 m HiGgY/* m XYXYT! QPK O "ACO(n2)→ O(n3)! 2:5 • IX(,?"B1 "D) #hl! n <7934;= m J+L1 ,?%''N  n &_ ]U#\ S $ m -#WIX!`*  n ce!.EF!.IX 86"ab0+R ,>T 40