Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Action Recognitionの歴史と最新動向

Action Recognitionの歴史と最新動向

https://twitter.com/0hnishi

- 自己紹介
- Action recognitionとは
- データセット
- コンペ例
- 代表的な手法
- Deep以前
- Deep以後
- Temporal Aggregation
- 実用やコンペ向けTips
- 動画生成
- まとめ

Katsunori Ohnishi

September 03, 2018
Tweet

More Decks by Katsunori Ohnishi

Other Decks in Technology

Transcript

  1. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Copyright (C)

    2018 DeNA Co.,Ltd. All Rights Reserved. Action Recognition  September 3, 2018 Katsunori Ohnishi DeNA Co., Ltd. 1
  2. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    #$" n Action recognition n   n ! n  % Deep % Deep % Temporal Aggregation n  Tips n   n  2
  3. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    ,@ VU ( ) Y Twitter: @ohnishi_ka n 8R Y 2014-41-2017-91: B4~M2.52<,9Computer VisionIJ MCO • 5N (;SEB) TQ: http://katsunoriohnishi.github.io/ Y CVPR2016 (spotlight oral, acceptance rate=9.7%): egocentric vision (wrist-mounted camera) Y ACMMM2016 (poster, acceptance rate=30%): action recognition (0W state-of-the-art) Y AAAI2018 (oral, acceptance rate=10.9%): video generation (FTGAN) Y 2017-101->D: DeNA AI "&*3 • FGDeNA)"$#%*=6:X9PA7 (+!'4/? Y → https://www.wantedly.com/projects/209980 Y LK.H  3
  4. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Action Recognition

    n #$  / -( Image classification#, / ) action recognition = &".*human action recognition • ! fine-grained egocentric   '#+% 4  Fine-grained egocentric Dog-centric Action recognition RGBD Evaluation of video activity localizations integrating quality and quantity measurements [C. Wolf+, CVIU14] Recognizing Activities of Daily Living with a Wrist-mounted Camera [K. Ohnishi+, CVPR16] A Database for Fine Grained Activity Detection of Cooking Activities [M. Rohrbach+, CVPR12] First-Person Animal Activity Recognition from Egocentric Videos [Y. Iwashita+, ICPR14] Recognizing Human Actions: A Local SVM Approach [C. Schuldt+, ICPR04] HMDB: A Large Video Database for Human Motion Recognition [H. Kuehne+, ICCV11] Ucf101: A dataset of 101 human actions classes from videos in the wild [K. Soomro+, arXiv2012]
  5. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    "$ & KTH, UCF101, HMDB51 • "$  UCF101 101 13320… n "$ & Activity-net, Kinetics, Youtube8M n  % & AVA, Moments in times, SLAC 5   UCF101 #!
  6. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    YouTube-8M Video Understanding Challenge L https://www.kaggle.com/c/youtube8m L CVPR17ECCV18workshop8J, &0 .39Kaggle8J ! L frame-levelAKGE!  test:;@39"%(' • kaggle, action recognition&0. =?>F L ,)(*-39%$/+#C  n ActivityNet Challenge L http://activity-net.org/challenges/2018/ L 4  ! L ActivityNet3)'% • Temporal Proposal (T) • Temporal localization (T) • Video Captioning L I<71BH,)(*-  )'%  • Kinetics: classification (human action) • AVA: Spatio-temporal localization (XYT) • Moments-in-time: classification (event) L ! 25D6  6
  7. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n

    ,4I ^ 2000-6FW ?05 B ^ EY>P SIFTR/:3IUK ] ^ V!'*%  EY>P1O local descriptor→codingglobal feature→JNPD7 2;9 n @=HA ^ STIP [I. Laptev, IJCV04] ^ Dense Trajectory [H. Wang+, ICCV11] ^ Improved Dense Trajectory [H. Wang+, ICCV13] 7 • S+.L$)"&G \ 8<MZC[Q    http://hirokatsukataoka.net/temp/presen/170121STAIRLab_slideshar e.pdf • \ XT#("HA   https://arxiv.org/pdf/1605.04988.pdf On space-time interest points [I. Laptev, IJCV04] Action Recognition by Dense Trajectories [H. Wang+, ICCV11] Action Recognition with Improved Trajectories [H. Wang+, ICCV13]
  8. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n

    Improved Dense Trajectories (iDT) [H. Wang+, ICCV13]  Dense Trajectories [H. Wang+, ICCV11]   8 2   optical flow foreground optical flow Improved dense trajectories (green) (background dense trajectories (white))
  9. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n

    3/#'$ !*"%&()* 9 28 SIFT/0:?6> -9Fisher Vector  7-49 Fisher vector+,<5;  .=  http://www.isi.imi.i.u-tokyo.ac.jp/~harada/pdf/SSII_harada20120608.pdf    https://www.slideshare.net/takao-y/fisher-vector … input 1: Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM Fisher kernels on visual vocabularies for image categorization [F. Perronnin, CVPR07] [F. Pedregosa+, JMLR11]
  10. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition

    n 2, *)3& 4 CNN'0 4 Two-stream • Hand-crafted feature- '1 +*)() 4 3D Convolution • C3D'0 • C3D"./Two-stream( • %3D conv# ! 4 Optical flow %# $ 10
  11. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    CNN n Spatio-temporal ConvNet [A. Karpathy+, CVPR 14] U CNN(<GCN:AK U AlexNet58(RGB ch → 10 frames ch (gray)  U Omulti scale Fusion$  DJL U Sports1M4HM.,+-/pre-training UCF10165.4 (iDT85.9%) 11 Large-scale video classification with convolutional neural network [A. Karpathy+, CVPR14] • 58(10 frames&conv1R9ch6?< =%$'  • RGB(gray 132, ITB $frame-by-frame;P 7 "score@Q=$!#>FS$ (E*3)0)
  12. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    Two-stream n Two-stream [K. Simonyan+, NIPS15] W 2D CNN*>, & )-;6  T9 J@1=#I4%2S • Spatial-stream: RGBAR& );6:J@ (input: RGB) • Temporal-stream: Optical flow-;6:J@ (input: optical flow 10 frames) • Frame-by-frame8Q -A,("! 5O/B8QH W Hand-crafted feature1=CNN7UV 12 Two-stream convolutional networks for action recognition in videos [K. Simonyan+, NIPS15] UCF101 HMDB51 iDT 85.9% 57.2% Spatio-temporal ConvNet 65.4% - RGB-stream 73.0% 40.5% Flow-stream 83.7% 54.6% Two-steam 88.0% 59.4% • -.0K$GMFED? (< ) • 2DCNNP1=E*NL3CMF*+' *imagenet pre-trained
  13. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n C3D [D. Tran +, ICCV15] V 16frame9I#/53D convolution !CNNTH • XYT3D convolution 9 -6B?K0#.18U ! V UCF101F *MS(&%')pre-training L; V <E>DN V ICCV15 @, arxiv3 + 24 reject"= ! 13 Learning Spatiotemporal Features with 3D Convolutional Networks [D. Tran +, ICCV15] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam 88.0% 59.4% C3D (1net) 82.3% - 3D conv-69 K08U !:O! 7PAGQ9I (&J<#C !$2R
  14. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n P3D [Z. Qiu+, ICCV17] 1 /-C3D, ! ' (    1 #$3D conv → 2D conv (XY) + 1D conv (T))0 1  ",.%*pre-training +& 14 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Spatial 2D conv Temporal 1D conv
  15. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n P3D [Z. Qiu+, ICCV17] Q OHC3D , -/.)=> !    Q 153D conv → 2D conv (XY) + 1D conv (T)@P Q  $"0GM9C+)(*,pre-trainingF; 15 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Two-stream (ResNet152) 91.8% Spatial 2D conv Temporal 1D conv 3D conv269E38N % :I % 7J?BK#9C +)D<&A %'   again4L
  16. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n C3D, P3D #& ( + 3D conv  " n  $! + )'  3D conv % *  [K. Hara+, CVPR18] 16 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017
  17. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n C3D, P3D #& ( + 3D conv  " n  $! + )'  3D conv % *  [K. Hara+, CVPR18] 17 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017 2017 Kinetics!
  18. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n KineticsD. I A?H; ,<Chuman action dataset! I 3D convB)%(+"5F • Pre-train -UCF1014<C/= 18 The Kinetics human action video dataset [W. Kay+, arXiv17] • Youtube8M@ <C"!*& 80>3 • '$#%(E216097:G   
  19. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$  /31' 19 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 …
  20. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$  /31' 20 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 … ?
  21. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n I3D Two-stream<C H !3D convolution)21 ;6G # n ((47@) H   3D conv8 1=+*XYT-5'&>3B% • XY-5?0 D#T-5 H .A9/ $1  " ;63D conv" EF: ,  21 time
  22. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n 3D convolution F"N  "/M [D.A. Huang+, CVPR18] O LE • 3D CNN 36 @J"N  O =A • 6 @J1 9>C;47D→6 @J"N !  • I:%)' G?<8%)'2K"- B>9 0 • Two-stream I3D Optical flow"3D conv5H#*$*&( 9> + ,  . 22 What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets [D.A. Huang+, CVPR18]
  23. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    3D convolution n 3D convCd)- 'D$E*)!%K N ;!g ;W h CVPR18H=$_A\MQ,8*"*!>a h   FY, US,*#]?$CVPR/ICCV/ECCV'8  h *,eL)/2&  9^  !OJ h $+-3D convD,E   )`?fG143,:<8'*)3D conv7@!RB • GPUG0.Vc- 23 DT 0. XZ, (N I6 5P
  24. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:

    Optical flow   n Optical flow 285&K $ )I [L Sevilla-Lara+, CVPR18] L HA •  Optical flow(3285 K "%$  L .C7; • Optical flowF-(EPE)action recognitionF- L  #0, • B6?<9flowF-action recognitionF- L B69F-*$40, $ L 7> • "285&1D$ !# Optical flowappearanceE=/GJ " @ • Optical flowF-!#2B6?& #K $+ ': 24 On the Integration of Optical Flow and Action Recognition [L Sevilla-Lara+, CVPR18]
  25. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  

    25 AVA XYZT bounding box   human action localization Moments-in-time 3     #" Kinetics-600 Kinetics 400 600 ! [C. Gu+, CVPR18] [M. Monfort+, arXiv2018] [W. Kay+, arXiv2017]
  26. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n *.#%$,4(&   7 2D conv "frame-by-frame 3D conv4(&') + 7 -032 (100 frames, 232 frames, 50 frames) *./51! 6 26
  27. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n %&GTFEUg-H* k Score&UgZLb\-H* → 0:57]  KS I^#)< k LSTM-3129/B, * → f@KS ; (  • V486-FCd#UgZWN.BA`e >[ P(+*QM * !"?UgZ ? • haGT fencingi*A→ fencingiC *A →… 'QM$cY -j *J_ *  27 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR$ACMMM$AAAI$= RO=D X!> # …
  28. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. … input

     Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM [F. Pedregosa+, JMLR11] Temporal Aggregation n 03(') ;:, *,+=429D  &$%.  !-  F → @$"…! F Fisher Vector%8  • CNN9DSIFTC</75GMMB1  #   • FV>A6EVLAD [H. Jegou+, CVPR10] %8 ? 28 Aggregating local descriptors into a compact image representation [H. Jegou+, CVPR10]
  29. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n LCD [Z. Xu+, CVPR15] ' VGG16 pool5XY  #512dim feature"! • 224x224 $& feature7x7=49% •  $  VLAD global feature"! 29 A discriminative CNN video representation for event detection [Z. Xu+, CVPR15] … input $ CNN Pool5 (e.g. 2x2x512) Local descriptors VLAD SVM global feature CNN CNN
  30. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n ActionVLAD [R. Girdhar+, CVPR17] E NetVLAD [R Arandjelović+, CVPR16] &.84@,  • NetVLADVLAD&NN+ ClusterassignD< >C  $/2&softmax7 assign$ 3) • VLAD'*9B=LCD E 06VLAD&5?715#;%!:$ • End2end-A (CNN06 %"! 30 ActionVLAD: Learning spatio-temporal aggregation for action classification [R. Girdhar+, CVPR17]
  31. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n TLE [A. Diba+, CVPR17] 4 VLADCompact Bilinear Pooling [Y. Gao+, CVPR16]  4 Temporal Aggregation!(0%"'- 4 VLAD2 /3,+1 • SVM#VLAD",+&* NN )$  . 31 Deep Temporal Linear Encoding Networks [A. Diba+, CVPR17]
  32. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n

    ' BOQ]^ ` Two-stream (ResNet)  #2D conv=[EL$P<U Optical flow!D  n Single modelState-of-the-art EL ` H<I3D + TLE BA ` AY64GPU ( !)F9RC n 5,60-4EL^ ` Two-stream optical flowGN*GPU9: U • 0/.&!optical flow stream& • _?  IT >;!RGB-stream& ` Optical flow GN%BO \"J"Z*KV( $8M ` 3127.[W@+$XSU  32
  33. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n

    -?:+1% s CNN`p'  aTLEcoding • TLEActionVLAD(?.?9=RL s iDT!] • CNNi(?.?9=%Zc0-(jA % • FisherVector#iDT7)8<oqX^GI $SNJ" s Tips: PCA] (dim=64). K=256. FVpower norm • CPUB,<*5/03; W h % s 624>,nQ_EDo%kF@% s OUTVKmMd$ s X^g "Y\lHbP!Y[erf% 33 GIC &
  34. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation

    n +  ,SaRQ#bu3T0 z Score,bug#Xoi3T0 → 7D>Aj$ W`$Uk )/F z LSTM3<9:C6#L2!0 → sJ!W`E. $ • c=B?3FCq)bug!d[4L!Kmr !Hh ].10^Y$ 0 '(Ibug$? • v!n#Safencing#w 0K→ fencing#wO 0K →… #- ^Y!  *#pf 3y 0!Vl$ 0  34 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR*ACMMM*AAAI*#G# _\GP$e"'H )… Z'#8@5; S3T/1%fx 2 )/inputM#atNT0& ↓ Two-stream
  35. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  

     n %  &  2 LSTM& 2 3D conv& 2 Optical flow&  •   %   (*, [L Sevilla-Lara+, CVPR18] 35 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC 01% +)& '-/! . % &  #"$& 
  36. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  

     2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow  MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18]    ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ")  VGAN    
  37. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  

     2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow  MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18]    ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ")  ! VGAN    
  38. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. n !

     Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture K. Ohnishi+, AAAI 2018 (oral presentation)  https://arxiv.org/abs/1711.09618 38 Optical flow 
  39. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    ed q Action classificationS _ J • 8>Temporal action localization!Spatio-temporal localizationB@W *2, q 3D conv(lbE%>OK 7 %J  q AugmentationAP  L % n PoseTQ q PoseTQ (Y^@.-+NI$cZAL % • ]nUfmh #< poseF #H&"FV % gM # • ')data distillationA_kC(`i6%X n Tips q -/4,jo %#F]9?041:\&optical flowR[ > a q Kinetics Youtube35+GD%;=#F] p" %# 39
  40. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.  n

    IX#k >[ m MVd&@B"Z^jf%1 m HiGgY/* m XYXYT! QPK O "ACO(n2)→ O(n3)! 2:5 • IX(,?"B1 "D) #hl! n <7934;= m J+L1 ,?%''N  n &_ ]U#\ S $ m -#WIX!`*  n ce!.EF!.IX 86"ab0+R ,>T 40