Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

http://www.acml-conf.org/2019/tutorials/ushiku-tsuruoka/
This is the latter part of the first tutorial in ACML 2019.

Be0f86176276318b4b9775d795278f7e?s=128

Yoshitaka Ushiku

November 10, 2019
Tweet

Transcript

  1. ACML 2019 Tutorial 1: Deep Learning for Natural Language Processing

    and Computer Vision Computer Vision + beyond OMRON SINIC X / Ridge-i Yoshitaka Ushiku losnuevetoros
  2. 2011 2012 2014

  3. 2011 2012 2014 Speech recognition error 30% → less than

    20% [Seide+, InterSpeech 2011]
  4. 2011 2012 2014 Speech recognition error 30% → less than

    20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012]
  5. 2011 2012 2014 Speech recognition error 30% → less than

    20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012] Machine translation system Complicated → simple [Sutskever+, NIPS 2014]
  6. 2012: Impact of Deep Learning Academic AI startup A famous

    company Many slides refer to the first use of CNN (AlexNet) on ImageNet
  7. 2012: Impact of Deep Learning Academic AI startup A famous

    company Many slides refer to the first use of CNN (AlexNet) on ImageNet
  8. 2012: Impact of Deep Learning Academic AI startup A famous

    company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet
  9. 2012: Impact of Deep Learning Academic AI startup A famous

    company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet
  10. 2012: Impact of Deep Learning Academic AI startup A famous

    company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet
  11. 2012: Impact of Deep Learning According to the official site…

    1st team w/ DL Error rate: 15% [http://image-net.org/challenges/LSVRC/2012/results.html]
  12. 2012: Impact of Deep Learning According to the official site…

    1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html]
  13. 2012: Impact of Deep Learning According to the official site…

    1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html]
  14. 2012: Impact of Deep Learning According to the official site…

    1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html] It’s me!!
  15. Yoshitaka Ushiku Ph.D. 2013.5~2013.8 Research Intern, Microsoft Research 2014.4 Ph.D.

    (The University of Tokyo) 2014.4~2016.3 Research Scientist, NTT CS Lab. 2016.4~ Lecturer, The University of Tokyo 2018.4~ Principal Investigator, OMRON SINIC X Corp. 2019.1~ Chief Research Officer, Ridge-i Co., Ltd. [Ushiku+, ACMMM 2012] [Ushiku+, ICCV 2015] Image Captioning Image Captioning with Sentiment Terms Cross-modal Retrieval with Videos and Texts [Yamaguchi+, ICCV 2017] A guy is skiing with no shirt on and yellow snow pants. A zebra standing in a field with a tree in the dirty background. [Shin+, BMVC 2016] A yellow train on the tracks near a train station.
  16. Today’s tutorial • Computer Vision: Short History – Detection, segmentation,

    and 3D rendering • Computer Vision from the Point of View of Machine Learning – Domain adaptation • Computer Vision and Natural Language Processing – Vision & Language
  17. Computer Vision: Short History

  18. 2011 2012 2014 Speech recognition error 30% → less than

    20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012] Machine translation system Complicated → simple [Sutskever+, NIPS 2014]
  19. 2011 2012 2014

  20. 2011 2012 2014 DNN

  21. 2011 2012 2014 DNN CNN

  22. 2011 2012 2014 DNN CNN RNN

  23. Convolution • The filter value is multiplied pix.-by-pix. – for

    the area at (, ) of the left image – Their sum is stored at the corresponding position of the right image • Multiple 2D filters → 3D array 22 [Dumoulin+Visin, 2016]
  24. AlexNet • AlexNet – 2 GPUs – 1 week for

    training • 5 conv. layers + 3 fully connected layers 23 [Krizhevsky+, NIPS 2012]
  25. VGGNet and Inception • VGGNet[Simonyan+Zisserman, ICLR 2015] – CNN desined

    by Oxford's Visual Geometry Group – “Depth“ is highlighted • Inception [Szegedy, CVPR 2015] – CNN by Google – Inception Block (right bottom) is applied repeatedly. – After reducing the number of channels by 1x1 convolution, 3x3 or 5x5 convolution is applied → high expressiveness with fewer parameters 24
  26. ResNet • VGGNet has 16 / 19 layers – Further

    depth does not improve accuracy – Gradients may disappear through back propagation • ResNet: skip connection among layers – Gradients are kept through identity mapping – ResNet = ensemble of multiple CNNs [Veit+, NIPS 2016] – Neural Ordinary Differential Equations [Chen+ NeurIPS 2018] 25 [He+, CVPR 2016]
  27. Object Detection • RCNN (Region CNN) [Girshick+, CVPR 2014] –

    region proposal from an image – CNN over each region • Faster RCNN [Ren+, NIPS 2015] – RCNN requires multiple calculation of CNN for a single image – Faster RCNN: Apply CNN only once to the whole image and estimate candidate area at the same time → High speed and precision 26
  28. Semantic Segmentation • U-Net [Ronneberger+, MICCAI 2015] – Autoencoder +

    skip connection – The finer parts of each region can be segmented precisely • DeepLab v3[Chen+, ECCV 2018] – Feature extraction from multiple resolution – Skip connection
  29. From 2D to 3D: PointNet [Qi+, CVPR 2017]

  30. Neural 3D Mesh Renderer [Kato+, CVPR 2018]

  31. Neural 3D Mesh Renderer Single 2D image 3D model [Kato+,

    CVPR 2018]
  32. Neural 3D Mesh Renderer 3D mesh rendering engine that is

    made differentiable for neural networks 3D model inference Rendering Error between estimation and reference 2D image 3D model Estimated 2D image (Silhouette) Reference silhouette 3D model estimator and rendering engine are updated with backpropagation Originally differentiable Differentiable
  33. Applications 3D meshing of images style transfer from 2D to

    3D 3D Deep Dream
  34. Computer Vision from the Point of View of Machine Learning

  35. Unsupervised Domain Adaptation (UDA) • Source Domain: Data are associated

    with ground truth, but we don’t want to recognize them as an application. • Target Domain: We want to recognize them, but there are no data associated with ground truth. • Semi-supervised Domain Adaptation: There are some target samples with ground truth. Video Game Real World
  36. UDA by Pseudo-Labeling [Saito+, ICML 2017]

  37. UDA by Pseudo-Labeling 1st: Training on MNIST → Add pseudo

    labels for easy samples 2nd~: Training on MNIST+α → Add more pseudo labels eight nine Asymmetric Tri-training for Domain Adaptation [Saito+, ICML 2017]
  38. p1 p2 pt S+Tl Tl S : source samples Tl

    : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F1 ,F2 : Labeling networks Ft : Target specific network F : Shared network Proposed Architecture
  39. p1 p2 pt S+Tl Tl S : source samples Tl

    : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F is updated using gradients from F1 ,F2 ,Ft Proposed Architecture
  40. p1 p2 pt S S S : source samples Tl

    : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S All networks are trained using only source samples. 1. Initial training
  41. p1 p2 T Input X F1 F2 F T If

    F1 and F2 agree on their predictions, and either of their probability is larger than threshold value, corresponding labels are given to the target sample. T: Target samples 2. Labeling target samples
  42. F1 , F2 : source and pseudo-labeled samples Ft :

    pseudo-labeled ones F : learn from all gradients p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl 3. Retraining network using pseudo-labeled target samples
  43. p1 p2 pt S+Tl Tl S : source samples Tl

    : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl Repeat the 2nd step and 3rd step until convergence! 3. Retraining network using pseudo-labeled target samples
  44. Overall objective Overall Objective W1 W2 p1 p2 pt S+Tl

    F1 F2 Ft F S+Tl Tl CrossEntropy To force F1 and F2 to learn from different features.
  45. Experiments • Four adaptation scenarios between digits datasets – MNIST,

    SVHN, SYN DIGIT (synthesized digits) • One adaptation scenario between traffic signs datasets – GTSRB (real traffic signs), SYN SIGN (synthesized signs) GTSRB SYN SIGNS SYN DIGITS SVHN MNIST MNIST-M
  46. Accuracy on Target Domain • Our method outperformed other methods.

    – The effect of BN is obvious in some settings. – The effect of weight constraint is not obvious. Source MNIST MNIST SVHN SYNDIG SYN NUM Method Target MN-M SVHN MNIST SVHN GTSRB Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2 Source Only (with BN) 57.1 34.9 70.1 85.5 75.7 DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7 MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1 DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1 K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - - Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2 Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0 Ours 94.0 52.8 86.8 92.9 96.2
  47. Another approach: Generative models • Generator = Feature extractor •

    Minimizing domain shifts – Backgrounds – Postures – Lighting conditions – … Features that we • can recognize in terms of the objects • cannot recognize in terms of the domains are desirable to realize domain adaptation
  48. Deep Domain Confusion (DDC) Simultaneous maximization of: • Classification accuracy

    on source domain • Overlap between source & target domains [Tzeng+, arXiv 2014]
  49. Network Architecture of DDC • Optimization of Classification Loss+Domain Loss

    • Domain Loss: – Maximum Mean Discrepancy (MMD): Overlap between source and target domains – Objective function: Linear combination of MMD and classification loss Averaged feature on source domain Averaged feature on target domain
  50. Experiments using Office Dataset • Three datasets: – for the

    same objects – under different environments • Proposed method: – Maximizing mean discrepancy leads the best accuracy.
  51. Qualitative Discussion: t-SNE Plot Before adaptation: Different distributions for the

    same “Monitors” on source (blue) and target (green) domains
  52. Qualitative Discussion: t-SNE Plot After adaptation: Each distribution overlaps the

    other
  53. Deep Adaptation Networks (DAN) • Multiple Kernels for MMD (ML-MMD)

    • In comparison to DDC: – MMD calculation among multiple layers – Nonlinear distance with multiple kernels • Experimental Results on Office Dataset [Long+, ICML 2015]
  54. Domain Adversarial Neural Networks (DANN) • The original name didn’t

    include “adversarial” – The name “Domain Adversarial Neural Networks” appears in the journal version [Ganin+, JMLR 2016] – Maybe confused with Deep Adaptation Networks (DAN) • Similar motivation to GANs: Adversarial learning to generate (extract) domain-invariant feature vectors – GAN: generated data vs. real data – DANN: feature vectors on a source domain vs. feature vectors on a target domain [Ganin+Lempitsky, ICML 2015]
  55. Network Architecture of DANN • tries to extract domain-invariant features

    • classifies source data • aims to distinguish two domains
  56. Adversarial Learning • Domain classification loss – attempts to minimize

    – attempts to maximize • 問題点: の勾配に対して – Gradient descent for – Gradient ascent for How to reverse the directions of and ?
  57. Gradient Reversal Layer (GRL) A “function” that • does nothing

    during forwarding • reverses the sign during backpropagation is introduced in GRL Simultaneous gradient descent + ascent
  58. Experimental Results • Office Dataset • Digit Dataset Feature distributions

    SYN NUMBERS (red) →SVHN (blue) Adapt
  59. Adversarial Discriminative Domain Adaptation Adversarial learning similar to DANN [Tzeng+,

    CVPR 2017]
  60. Disadvantages of DANN • DANN uses a single feature extractor

    for both domains ✓The number of parameters can be reduced ×It might be impossible to extract features of different domains using the same extractor. • Gradient Reversal Layer ✓It is faithful to the objective function of GANs ×Gradients from the discriminator may be vanished early in the training
  61. ADDA • Features are extracted by CNN which is different

    in each domain CNN for the source domain is pre-trained • Use losses for inverted label common in GANs instead of Gradient Reversal ( :Target feature extractor : Domain discriminator)
  62. Experimental Results State-of-the-art on Office and digit datasets

  63. Maximum Classifier Discrepancy (MCD) [Saito+, CVPR 2018]

  64. Maximum Classifier Discrepancy (MCD) So far, we've tried to match

    domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. [Saito+, CVPR 2018]
  65. Maximum Classifier Discrepancy (MCD) So far, we've tried to match

    domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. • We should match classifiers instead of domains. [Saito+, CVPR 2018]
  66. Maximum Classifier Discrepancy (MCD) 1. Preparing two classifiers for each

    class Classifier trained on source domain ・avoid the dotted line area ・may cross the solid line area This diagonal line area (Discrepancy Region) should be eliminated.
  67. Maximum Classifier Discrepancy (MCD) 2. Finding as many discrepancies as

    possible Only classifiers are updated
  68. Maximum Classifier Discrepancy (MCD) 3. Learning extractors to reduce discrepancy

    Only generator is updated
  69. Maximum Classifier Discrepancy (MCD) Repeating 2. and 3. until its

    convergence
  70. Experimental Results State-of-the-art on digit dataset

  71. Experimental Results Semantic Segmentation using synthesized data and real data

  72. Adversarial Dropout Regularization (ADR) [Saito+, ICLR 2018]

  73. Adversarial Dropout Regularization (ADR) So far, we've tried to match

    domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. • We should match classifiers instead of domains. [Saito+, ICLR 2018] ... oh, I think I heard it a while ago.
  74. ADR ≃ MCD by dropout These two classifiers are •

    Trained directly by MCD • Generated by dropout (proposed)
  75. Training is similar to MCD

  76. Experimental Results State-of-the-art on digit datasets

  77. Experimental Results Semantic Segmentation using synthesized data and real data

  78. Open Set Domain Adaptation (OSDA) [Saito+, ECCV 2018]

  79. Source Target Closed Domain Adaptation Open Set Domain Adaptation Source

    Target Unknown ・ Source and target completely share class in domain adaptation. ・ Target samples are unlabeled. ・ Open set: Target contains unknown class. cf. Reversed setting = Partial Domain Adaptation [Cao+, ECCV 2018] OSDA by Backpropagation [Saito+, ECCV 2018]
  80. OSDA by Backpropagation [Saito+, ECCV 2018]

  81. Domain Adaptation for Object Detection [Saito+, CVPR 2019]

  82. Strong-Weak Distribution Alignment [Saito+, CVPR 2019]

  83. Computer Vision and Natural Language Processing

  84. 2014: Another impact of Deep Learning • Deep learning appears

    in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) • Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  85. Growth of user generated contents Especially in content posting/sharing service

    • Facebook: 300 million photos per day • YouTube: 400-hours videos per minute Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular. Pairs of a sentence + a video / photo →Collectable in large quantities
  86. Exploratory researches on Vision and Language Captioning an image associated

    with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.
  87. Exploratory researches on Vision and Language Captioning an image associated

    with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week. As a result of these backgrounds: Various research topics such as …
  88. Image Captioning [Ushiku+, ACM Multimedia 2012]

  89. Image Captioning [Ushiku+, ACM Multimedia 2012]

  90. Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR

    2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 0 … for image 0 : beginning of the sentence 1 = LSTM CNN = LSTM St−1 , = 2 … − 1 : end of the sentence [Vinyals+, CVPR 2015]
  91. Video Captioning A man is holding a box of doughnuts.

    Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  92. Multilingual Captioning Transfer learning among languages [Miyazaki+Shimizu, ACL 2016] •

    Vision-Language grounding Wim is transferred • Efficient learning using small amount of captions an elephant is an elephant 一匹の 象が 土の 一匹の 象が
  93. Image Caption Translation Ein Masten mit zwei Ampeln fur Autofahrer.

    (German) A pole with two lights for drivers. (English) [Hitschler+, ACL 2016]
  94. Visual Question Answering [Fukui+, EMNLP 2016]

  95. VQA=Multiclass Classification Feature + is applied to an usual classifier

    Question What objects are found on the bed? Answer bed sheets, pillow Image Image feature Question feature Integrated feature +
  96. Image Generation from Captions This bird is blue with white

    and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]
  97. Towards more realistic image generation StackGAN [Zhang+, 2016] Two-step GANs

    • First GAN generates small and fuzzy image • Second GAN enlarges and refines it
  98. Visual Dialog (VisDial) Continuous Visual Question and its Answering Questioner

    Answerer A couple of people in the snow on skis. [Das+, CVPR 2017]
  99. Visual Dialog (VisDial) Questioner Answerer A couple of people in

    the snow on skis. What are their genders? Are they both adults? Do they wear goggles? Do they have hats on? Are there any other people? What color is man’s hat? Is it snowing now? What is woman wearing? Are they smiling? Do you see trees? 1 man 1 woman Yes Looks like sunglasses Man does No Black No Blue jacket and black pants Yes Yes [Das+, CVPR 2017]
  100. Vision-and-Language Navigation (VNL) [Anderson+, ICCV 2017]

  101. Summary • Computer Vision: Short History • Computer Vision from

    Machine Learning • Introduction of Vision and Language • Contributions of Deep Learning – Most research themes exist before Deep Learning – Commodity techs for processing images, videos and natural languages – Evolution of recognition and generation Towards a new stage of vision and language!
  102. None
  103. supplementary material Details about Visual Captioning

  104. Every picture tells a story Dataset: Images + <object, action,

    scene> + Captions 1. Predict <object, action, scene> for an input image using MRF 2. Search for the existing caption associated with similar <object, action, scene> <Horse, Ride, Field> [Farhadi+, ECCV 2010]
  105. Every picture tells a story <pet, sleep, ground> See something

    unexpected. <transportation, move, track> A man stands next to a train on a cloudy day. [Farhadi+, ECCV 2010]
  106. Retrieve? Generate? • Retrieve • Generate – Template-based e.g. generating

    a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  107. Retrieve? Generate? • Retrieve – A small gray dog on

    a leash. • Generate – Template-based e.g. generating a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  108. Retrieve? Generate? • Retrieve – A small gray dog on

    a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  109. Retrieve? Generate? • Retrieve – A small gray dog on

    a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small white dog standing on a leash. A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  110. Captioning with multi-keyphrases [Ushiku+, ACM MM 2012]

  111. End of sentence [Ushiku+, ACM MM 2012]

  112. Benefits of Deep Learning • Refinement of image recognition [Krizhevsky+,

    NIPS 2012] • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  113. Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR

    2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 0 … for image 0 : beginning of the sentence 1 = LSTM CNN = LSTM St−1 , = 2 … − 1 : end of the sentence [Vinyals+, CVPR 2015]
  114. Examples of generated captions [https://github.com/tensorflow/models/tree/master/im2txt] [Vinyals+, CVPR 2015]

  115. Comparison to [Ushiku+, ACM MM 2012] Input image [Ushiku+, ACM

    MM 2012]: Conventional object recognition Fisher Vector + Linear classifier Neural image captioning: Conventional object recognition Convolutional Neural Network Neural image captioning Conventional machine translation Recurrent Neural Network + beam search [Ushiku+, ACM MM 2012]: Conventional machine translation Log Linear Model + beam search Estimation of important words Connect the words with grammar model • Trained using only images and captions • Approaches are similar to each other
  116. Current development: Accuracy • Attention-based captioning [Xu+, ICML 2015] –

    Focus on some areas for predicting each word! – Both attention and caption models are trained using pairs of an image & caption
  117. Current development: Problem setting Dense captioning [Lin+, BMVC 2015] [Johnson+,

    CVPR 2016]
  118. Current development: Problem setting Generating captions for a photo sequence

    [Park+Kim, NIPS 2015][Huang+, NAACL 2016] The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.
  119. Current development: Problem setting Captioning using sentiment terms [Mathews+, AAAI

    2016][Shin+, BMVC 2016] Neutral caption Positive caption
  120. Before Deep Learning • Grounding of languages and objects in

    videos [Yu+Siskind, ACL 2013] – Learning from only videos and their captions – Experiment with a small object with few objects – Controlled and small dataset • Deep Learning should suite for this problem – Image Captioning: single image → word sequence – Video Captioning: image sequence → word sequence
  121. End-to-end learning by Deep Learning • LRCN [Donahue+, CVPR 2015]

    – CNN+RNN for • Action recognition • Image / Video Captioning • Video to Text [Venugopalan+, ICCV 2015] – CNNs to recognize • Objects from RGB frames • Actions from flow images – RNN for captioning
  122. Video Captioning A man is holding a box of doughnuts.

    Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  123. Video Captioning A boat is floating on the water near

    a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]
  124. Video Retrieval from Caption • Input: Captions • Output: A

    video related to the caption 10 sec video clip from 40 min database! • Video captioning is also addressed A woman in blue is playing ping pong in a room. A guy is skiing with no shirt on and yellow snow pants. A man is water skiing while attached to a long rope. [Yamaguchi+, ICCV 2017]