Action Recognitionの歴史と最新動向

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. n
#$" n Action recognition n n ! n % Deep % Deep % Temporal Aggregation n Tips n n 2

,@ VU ( ) Y Twitter: @ohnishi_ka n 8R Y 2014-41-2017-91: B4~M2.52<,9Computer VisionIJ MCO • 5N (;SEB) TQ: http://katsunoriohnishi.github.io/ Y CVPR2016 (spotlight oral, acceptance rate=9.7%): egocentric vision (wrist-mounted camera) Y ACMMM2016 (poster, acceptance rate=30%): action recognition (0W state-of-the-art) Y AAAI2018 (oral, acceptance rate=10.9%): video generation (FTGAN) Y 2017-101->D: DeNA AI "&*3 • FGDeNA)"$#%*=6:X9PA7 (+!'4/? Y → https://www.wantedly.com/projects/209980 Y LK.H 3

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Action Recognition
n #$ / -( Image classification#, / ) action recognition = &".*human action recognition • ! fine-grained egocentric '#+% 4 Fine-grained egocentric Dog-centric Action recognition RGBD Evaluation of video activity localizations integrating quality and quantity measurements [C. Wolf+, CVIU14] Recognizing Activities of Daily Living with a Wrist-mounted Camera [K. Ohnishi+, CVPR16] A Database for Fine Grained Activity Detection of Cooking Activities [M. Rohrbach+, CVPR12] First-Person Animal Activity Recognition from Egocentric Videos [Y. Iwashita+, ICPR14] Recognizing Human Actions: A Local SVM Approach [C. Schuldt+, ICPR04] HMDB: A Large Video Database for Human Motion Recognition [H. Kuehne+, ICCV11] Ucf101: A dataset of 101 human actions classes from videos in the wild [K. Soomro+, arXiv2012]

"$ & KTH, UCF101, HMDB51 • "$ UCF101 101 13320… n "$ & Activity-net, Kinetics, Youtube8M n % & AVA, Moments in times, SLAC 5 UCF101 #!

YouTube-8M Video Understanding Challenge L https://www.kaggle.com/c/youtube8m L CVPR17ECCV18workshop8J, &0 .39Kaggle8J ! L frame-levelAKGE! test:;@39"%(' • kaggle, action recognition&0. =?>F L ,)(*-39%$/+#C n ActivityNet Challenge L http://activity-net.org/challenges/2018/ L 4 ! L ActivityNet3)'% • Temporal Proposal (T) • Temporal localization (T) • Video Captioning L I<71BH,)(*- )'% • Kinetics: classification (human action) • AVA: Spatio-temporal localization (XYT) • Moments-in-time: classification (event) L ! 25D6 6

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNN n
,4I ^ 2000-6FW ?05 B ^ EY>P SIFTR/:3IUK ] ^ V!'*% EY>P1O local descriptor→codingglobal feature→JNPD7 2;9 n @=HA ^ STIP [I. Laptev, IJCV04] ^ Dense Trajectory [H. Wang+, ICCV11] ^ Improved Dense Trajectory [H. Wang+, ICCV13] 7 • S+.L$)"&G \ 8<MZC[Q http://hirokatsukataoka.net/temp/presen/170121STAIRLab_slideshar e.pdf • \ XT#("HA https://arxiv.org/pdf/1605.04988.pdf On space-time interest points [I. Laptev, IJCV04] Action Recognition by Dense Trajectories [H. Wang+, ICCV11] Action Recognition with Improved Trajectories [H. Wang+, ICCV13]

Improved Dense Trajectories (iDT) [H. Wang+, ICCV13] Dense Trajectories [H. Wang+, ICCV11] 8 2 optical flow foreground optical flow Improved dense trajectories (green) (background dense trajectories (white))

3/#'$ !*"%&()* 9 28 SIFT/0:?6> -9Fisher Vector 7-49 Fisher vector+,<5; .= http://www.isi.imi.i.u-tokyo.ac.jp/~harada/pdf/SSII_harada20120608.pdf https://www.slideshare.net/takao-y/fisher-vector … input 1: Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM Fisher kernels on visual vocabularies for image categorization [F. Perronnin, CVPR07] [F. Pedregosa+, JMLR11]

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition
n 2, *)3& 4 CNN'0 4 Two-stream • Hand-crafted feature- '1 +*)() 4 3D Convolution • C3D'0 • C3D"./Two-stream( • %3D conv# ! 4 Optical flow %# $ 10

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. CNNaction recognition:
CNN n Spatio-temporal ConvNet [A. Karpathy+, CVPR 14] U CNN(<GCN:AK U AlexNet58(RGB ch → 10 frames ch (gray) U Omulti scale Fusion$ DJL U Sports1M4HM.,+-/pre-training UCF10165.4 (iDT85.9%) 11 Large-scale video classification with convolutional neural network [A. Karpathy+, CVPR14] • 58(10 frames&conv1R9ch6?< =%$' • RGB(gray 132, ITB $frame-by-frame;P 7 "score@Q=$!#>FS$ (E*3)0)

Two-stream n Two-stream [K. Simonyan+, NIPS15] W 2D CNN*>, & )-;6 T9 J@1=#I4%2S • Spatial-stream: RGBAR& );6:J@ (input: RGB) • Temporal-stream: Optical flow-;6:J@ (input: optical flow 10 frames) • Frame-by-frame8Q -A,("! 5O/B8QH W Hand-crafted feature1=CNN7UV 12 Two-stream convolutional networks for action recognition in videos [K. Simonyan+, NIPS15] UCF101 HMDB51 iDT 85.9% 57.2% Spatio-temporal ConvNet 65.4% - RGB-stream 73.0% 40.5% Flow-stream 83.7% 54.6% Two-steam 88.0% 59.4% • -.0K$GMFED? (< ) • 2DCNNP1=E*NL3CMF*+' *imagenet pre-trained

3D convolution n C3D [D. Tran +, ICCV15] V 16frame9I#/53D convolution !CNNTH • XYT3D convolution 9 -6B?K0#.18U ! V UCF101F *MS(&%')pre-training L; V <E>DN V ICCV15 @, arxiv3 + 24 reject"= ! 13 Learning Spatiotemporal Features with 3D Convolutional Networks [D. Tran +, ICCV15] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam 88.0% 59.4% C3D (1net) 82.3% - 3D conv-69 K08U !:O! 7PAGQ9I (&J<#C !$2R

3D convolution n P3D [Z. Qiu+, ICCV17] 1 /-C3D, ! ' ( 1 #$3D conv → 2D conv (XY) + 1D conv (T))0 1 ",.%*pre-training +& 14 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Spatial 2D conv Temporal 1D conv

3D convolution n P3D [Z. Qiu+, ICCV17] Q OHC3D , -/.)=> ! Q 153D conv → 2D conv (XY) + 1D conv (T)@P Q $"0GM9C+)(*,pre-trainingF; 15 Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17] UCF101 HMDB51 iDT 85.9% 57.2% Two-steam (Alexnet) 88.0% 59.4% P3D (ResNet) 88.6% - Two-stream (ResNet152) 91.8% Spatial 2D conv Temporal 1D conv 3D conv269E38N % :I % 7J?BK#9C +)D<&A %' again4L

3D convolution n C3D, P3D #& ( + 3D conv " n $! + )' 3D conv % * [K. Hara+, CVPR18] 16 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017

3D convolution n C3D, P3D #& ( + 3D conv " n $! + )' 3D conv % * [K. Hara+, CVPR18] 17 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18] 2012 2011 2015 2017 2017 Kinetics!

3D convolution n KineticsD. I A?H; ,<Chuman action dataset! I 3D convB)%(+"5F • Pre-train -UCF1014<C/= 18 The Kinetics human action video dataset [W. Kay+, arXiv17] • Youtube8M@ <C"!*& 80>3 • '$#%(E216097:G

3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$ /31' 19 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 …

3D convolution n I3D [J. Carreira +, ICCV17] D Kinetics dataset)*DeepMind( 95 D 3D conv4 .6?Inception D 64 GPUs for training, 16 GPUs for predict D ><+/308state-of-the-art • RGB@;$#"BC • Two-stream):optical flow-= %score&&!$ /31' 20 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17] UCF101 HMDB51 RGB-I3D 95.6% 74.8% Flow-I3D 96.7% 77.1% Two-stream I3D 98.0% 80.7% 2,A7 … ?

3D convolution n I3D Two-stream<C H !3D convolution)21 ;6G # n ((47@) H 3D conv8 1=+*XYT-5'&>3B% • XY-5?0 D#T-5 H .A9/ $1 " ;63D conv" EF: , 21 time

3D convolution n 3D convolution F"N "/M [D.A. Huang+, CVPR18] O LE • 3D CNN 36 @J"N O =A • 6 @J1 9>C;47D→6 @J"N ! • I:%)' G?<8%)'2K"- B>9 0 • Two-stream I3D Optical flow"3D conv5H#*$*&( 9> + , . 22 What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets [D.A. Huang+, CVPR18]

3D convolution n 3D convCd)- 'D$E*)!%K N ;!g ;W h CVPR18H=$_A\MQ,8*"*!>a h FY, US,*#]?$CVPR/ICCV/ECCV'8 h *,eL)[b/2& 9^ !OJ h $+-3D convD,E )`?fG143,:<8'*)3D conv7@!RB • GPUG0.Vc- 23 DT 0. XZ, (N I6 5P

Optical flow n Optical flow 285&K $ )I [L Sevilla-Lara+, CVPR18] L HA • Optical flow(3285 K "%$ L .C7; • Optical flowF-(EPE)action recognitionF- L #0, • B6?<9flowF-action recognitionF- L B69F-*$40, $ L 7> • "285&1D$ !# Optical flowappearanceE=/GJ " @ • Optical flowF-!#2B6?& #K $+ ': 24 On the Integration of Optical Flow and Action Recognition [L Sevilla-Lara+, CVPR18]

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
25 AVA XYZT bounding box human action localization Moments-in-time 3 #" Kinetics-600 Kinetics 400 600 ! [C. Gu+, CVPR18] [M. Monfort+, arXiv2018] [W. Kay+, arXiv2017]

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Temporal Aggregation
n *.#%$,4(& 7 2D conv "frame-by-frame 3D conv4(&') + 7 -032 (100 frames, 232 frames, 50 frames) *./51! 6 26

n %&GTFEUg-H* k Score&UgZLb\-H* → 0:57] KS I^#)< k LSTM-3129/B, * → f@KS ; ( • V486-FCd#UgZWN.BA`e >[ P(+*QM * !"?UgZ ? • haGT fencingi*A→ fencingiC *A →… 'QM$cY -j *J_ * 27 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR$ACMMM$AAAI$= RO=D X!> # …

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. … input
Local descriptor iDT Video descriptor Fisher Vector [F. Perronnin+, CVPR07] Classifier SVM [F. Pedregosa+, JMLR11] Temporal Aggregation n 03(') ;:, *,+=429D &$%. !- F → @$"…! F Fisher Vector%8 • CNN9DSIFTC</75GMMB1 # • FV>A6EVLAD [H. Jegou+, CVPR10] %8 ? 28 Aggregating local descriptors into a compact image representation [H. Jegou+, CVPR10]

n LCD [Z. Xu+, CVPR15] ' VGG16 pool5XY #512dim feature"! • 224x224 $& feature7x7=49% • $ VLAD global feature"! 29 A discriminative CNN video representation for event detection [Z. Xu+, CVPR15] … input $ CNN Pool5 (e.g. 2x2x512) Local descriptors VLAD SVM global feature CNN CNN

n ActionVLAD [R. Girdhar+, CVPR17] E NetVLAD [R Arandjelović+, CVPR16] &.84@, • NetVLADVLAD&NN+ ClusterassignD< >C $/2&softmax7 assign$ 3) • VLAD'*9B=LCD E 06VLAD&5?715#;%!:$ • End2end-A (CNN06 %"! 30 ActionVLAD: Learning spatio-temporal aggregation for action classification [R. Girdhar+, CVPR17]

n TLE [A. Diba+, CVPR17] 4 VLADCompact Bilinear Pooling [Y. Gao+, CVPR16] 4 Temporal Aggregation!(0%"'- 4 VLAD2 /3,+1 • SVM#VLAD",+&* NN )$ . 31 Deep Temporal Linear Encoding Networks [A. Diba+, CVPR17]

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n
' BOQ]^ ` Two-stream (ResNet) #2D conv=[EL$P<U Optical flow!D n Single modelState-of-the-art EL ` H<I3D + TLE BA ` AY64GPU ( !)F9RC n 5,60-4EL^ ` Two-stream optical flowGN*GPU9: U • 0/.&!optical flow stream& • _? IT >;!RGB-stream& ` Optical flow GN%BO \"J"Z*KV( $8M ` 3127.[W@+$XSU 32

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. Tips n
-?:+1% s CNN`p' aTLEcoding • TLEActionVLAD(?.?9=RL s iDT!] • CNNi(?.?9=%Zc0-(jA % • FisherVector#iDT7)8<oqX^GI $SNJ" s Tips: PCA] (dim=64). K=256. FVpower norm • CPUB,<*5/03; W h % s 624>,nQ_EDo%kF@% s OUTVKmMd$ s X^g "Y\lHbP!Y[erf% 33 GIC &

n + ,SaRQ#bu3T0 z Score,bug#Xoi3T0 → 7D>Aj$ W`$Uk )/F z LSTM3<9:C6#L2!0 → sJ!W`E. $ • c=B?3FCq)bug!d[4L!Kmr !Hh ].10^Y$ 0 '(Ibug$? • v!n#Safencing#w 0K→ fencing#wO 0K →… #- ^Y! *#pf 3y 0!Vl$ 0 34 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC CVPR*ACMMM*AAAI*#G# _\GP$e"'H )… Z'#8@5; S3T/1%fx 2 )/inputM#atNT0& ↓ Two-stream

n % & 2 LSTM& 2 3D conv& 2 Optical flow& • % (*, [L Sevilla-Lara+, CVPR18] 35 … … CNN LSTM FC CNN LSTM FC CNN LSTM FC 01% +)& '-/! . % & #"$&

2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18] ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ") VGAN

2D conv + LSTM 3D conv 3D conv # Two-stream Optical flow MoCoGAN [S. Tulyakov+, CVPR18] VGAN [C. Vondrick+, NIPS16] TGAN [M. Saito+, ICCV17] FTGAN [K. Ohnishi+, AAAI18] ! LRCN [J. Donahue+, CVPR15] C3D [D. Tran+, ICCV15] P3D [Z. Qiu+, ICCV17] Two-stream [K. Simonyan+, NIPS15] I3D [J. Carreira +, ICCV17] ( ") ! VGAN

Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved. n !
Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture K. Ohnishi+, AAAI 2018 (oral presentation) https://arxiv.org/abs/1711.09618 38 Optical flow

ed q Action classificationS _ J • 8>Temporal action localization!Spatio-temporal localizationB@W *2, q 3D conv(lbE%>OK 7 %J q AugmentationAP L % n PoseTQ q PoseTQ (Y^@.-+NI$cZAL % • ]nUfmh #< poseF #H&"FV % gM # • ')data distillationA_kC(`i6%X n Tips q -/4,jo %#F]9?041:\&optical flowR[ > a q Kinetics Youtube35+GD%;=#F] p" %# 39

IX#k >[ m MVd&@B"Z^jf%1 m HiGgY/* m XYXYT! QPK O "ACO(n2)→ O(n3)! 2:5 • IX(,?"B1 "D) #hl! n <7934;= m J+L1 ,?%''N n &_ ]U#\ S $ m -#WIX!`* n ce!.EF!.IX 86"ab0+R ,>T 40

Action Recognitionの歴史と最新動向

Action Recognitionの歴史と最新動向

More Decks by Katsunori Ohnishi

Other Decks in Technology

Featured

Transcript