ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

Slide 1

Slide 1 text

ACML 2019 Tutorial 1: Deep Learning for Natural Language Processing and Computer Vision Computer Vision + beyond OMRON SINIC X / Ridge-i Yoshitaka Ushiku losnuevetoros

Slide 2

Slide 2 text

2011 2012 2014

Slide 3

Slide 3 text

2011 2012 2014 Speech recognition error 30% → less than 20% [Seide+, InterSpeech 2011]

Slide 4

Slide 4 text

2011 2012 2014 Speech recognition error 30% → less than 20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012]

Slide 5

Slide 5 text

2011 2012 2014 Speech recognition error 30% → less than 20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012] Machine translation system Complicated → simple [Sutskever+, NIPS 2014]

Slide 6

Slide 6 text

2012: Impact of Deep Learning Academic AI startup A famous company Many slides refer to the first use of CNN (AlexNet) on ImageNet

Slide 7

Slide 7 text

2012: Impact of Deep Learning Academic AI startup A famous company Many slides refer to the first use of CNN (AlexNet) on ImageNet

Slide 8

Slide 8 text

2012: Impact of Deep Learning Academic AI startup A famous company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

Slide 9

Slide 9 text

2012: Impact of Deep Learning Academic AI startup A famous company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

Slide 10

Slide 10 text

2012: Impact of Deep Learning Academic AI startup A famous company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

Slide 11

Slide 11 text

2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% [http://image-net.org/challenges/LSVRC/2012/results.html]

Slide 12

Slide 12 text

2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html]

Slide 13

Slide 13 text

2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html]

Slide 14

Slide 14 text

2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html] It’s me!!

Slide 15

Slide 15 text

Yoshitaka Ushiku Ph.D. 2013.5～2013.8 Research Intern, Microsoft Research 2014.4 Ph.D. (The University of Tokyo) 2014.4～2016.3 Research Scientist, NTT CS Lab. 2016.4～ Lecturer, The University of Tokyo 2018.4～ Principal Investigator, OMRON SINIC X Corp. 2019.1～ Chief Research Officer, Ridge-i Co., Ltd. [Ushiku+, ACMMM 2012] [Ushiku+, ICCV 2015] Image Captioning Image Captioning with Sentiment Terms Cross-modal Retrieval with Videos and Texts [Yamaguchi+, ICCV 2017] A guy is skiing with no shirt on and yellow snow pants. A zebra standing in a field with a tree in the dirty background. [Shin+, BMVC 2016] A yellow train on the tracks near a train station.

Slide 16

Slide 16 text

Today’s tutorial • Computer Vision: Short History – Detection, segmentation, and 3D rendering • Computer Vision from the Point of View of Machine Learning – Domain adaptation • Computer Vision and Natural Language Processing – Vision & Language

Slide 17

Slide 17 text

Computer Vision: Short History

Slide 18

Slide 18 text

Slide 19

Slide 19 text

2011 2012 2014

Slide 20

Slide 20 text

2011 2012 2014 DNN

Slide 21

Slide 21 text

2011 2012 2014 DNN CNN

Slide 22

Slide 22 text

2011 2012 2014 DNN CNN RNN

Slide 23

Slide 23 text

Convolution • The filter value is multiplied pix.-by-pix. – for the area at (, ) of the left image – Their sum is stored at the corresponding position of the right image • Multiple 2D filters → 3D array 22 [Dumoulin+Visin, 2016]

Slide 24

Slide 24 text

AlexNet • AlexNet – 2 GPUs – 1 week for training • 5 conv. layers + 3 fully connected layers 23 [Krizhevsky+, NIPS 2012]

Slide 25

Slide 25 text

VGGNet and Inception • VGGNet[Simonyan+Zisserman, ICLR 2015] – CNN desined by Oxford's Visual Geometry Group – “Depth“ is highlighted • Inception [Szegedy, CVPR 2015] – CNN by Google – Inception Block (right bottom) is applied repeatedly. – After reducing the number of channels by 1x1 convolution, 3x3 or 5x5 convolution is applied → high expressiveness with fewer parameters 24

Slide 26

Slide 26 text

ResNet • VGGNet has 16 / 19 layers – Further depth does not improve accuracy – Gradients may disappear through back propagation • ResNet: skip connection among layers – Gradients are kept through identity mapping – ResNet = ensemble of multiple CNNs [Veit+, NIPS 2016] – Neural Ordinary Differential Equations [Chen+ NeurIPS 2018] 25 [He+, CVPR 2016]

Slide 27

Slide 27 text

Object Detection • RCNN (Region CNN) [Girshick+, CVPR 2014] – region proposal from an image – CNN over each region • Faster RCNN [Ren+, NIPS 2015] – RCNN requires multiple calculation of CNN for a single image – Faster RCNN: Apply CNN only once to the whole image and estimate candidate area at the same time → High speed and precision 26

Slide 28

Slide 28 text

Semantic Segmentation • U-Net [Ronneberger+, MICCAI 2015] – Autoencoder + skip connection – The finer parts of each region can be segmented precisely • DeepLab v3[Chen+, ECCV 2018] – Feature extraction from multiple resolution – Skip connection

Slide 29

Slide 29 text

From 2D to 3D: PointNet [Qi+, CVPR 2017]

Slide 30

Slide 30 text

Neural 3D Mesh Renderer [Kato+, CVPR 2018]

Slide 31

Slide 31 text

Neural 3D Mesh Renderer Single 2D image 3D model [Kato+, CVPR 2018]

Slide 32

Slide 32 text

Neural 3D Mesh Renderer 3D mesh rendering engine that is made differentiable for neural networks 3D model inference Rendering Error between estimation and reference 2D image 3D model Estimated 2D image (Silhouette) Reference silhouette 3D model estimator and rendering engine are updated with backpropagation Originally differentiable Differentiable

Slide 33

Slide 33 text

Applications 3D meshing of images style transfer from 2D to 3D 3D Deep Dream

Slide 34

Slide 34 text

Computer Vision from the Point of View of Machine Learning

Slide 35

Slide 35 text

Unsupervised Domain Adaptation (UDA) • Source Domain: Data are associated with ground truth, but we don’t want to recognize them as an application. • Target Domain: We want to recognize them, but there are no data associated with ground truth. • Semi-supervised Domain Adaptation: There are some target samples with ground truth. Video Game Real World

Slide 36

Slide 36 text

UDA by Pseudo-Labeling [Saito+, ICML 2017]

Slide 37

Slide 37 text

UDA by Pseudo-Labeling 1st: Training on MNIST → Add pseudo labels for easy samples 2nd~: Training on MNIST+α → Add more pseudo labels eight nine Asymmetric Tri-training for Domain Adaptation [Saito+, ICML 2017]

Slide 38

Slide 38 text

p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F1 ,F2 : Labeling networks Ft : Target specific network F : Shared network Proposed Architecture

Slide 39

Slide 39 text

p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F is updated using gradients from F1 ,F2 ,Ft Proposed Architecture

Slide 40

Slide 40 text

p1 p2 pt S S S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S All networks are trained using only source samples. 1. Initial training

Slide 41

Slide 41 text

p1 p2 T Input X F1 F2 F T If F1 and F2 agree on their predictions, and either of their probability is larger than threshold value, corresponding labels are given to the target sample. T: Target samples 2. Labeling target samples

Slide 42

Slide 42 text

F1 , F2 : source and pseudo-labeled samples Ft : pseudo-labeled ones F : learn from all gradients p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl 3. Retraining network using pseudo-labeled target samples

Slide 43

Slide 43 text

p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl Repeat the 2nd step and 3rd step until convergence! 3. Retraining network using pseudo-labeled target samples

Slide 44

Slide 44 text

Overall objective Overall Objective W1 W2 p1 p2 pt S+Tl F1 F2 Ft F S+Tl Tl CrossEntropy To force F1 and F2 to learn from different features.

Slide 45

Slide 45 text

Experiments • Four adaptation scenarios between digits datasets – MNIST, SVHN, SYN DIGIT (synthesized digits) • One adaptation scenario between traffic signs datasets – GTSRB (real traffic signs), SYN SIGN (synthesized signs) GTSRB SYN SIGNS SYN DIGITS SVHN MNIST MNIST-M

Slide 46

Slide 46 text

Accuracy on Target Domain • Our method outperformed other methods. – The effect of BN is obvious in some settings. – The effect of weight constraint is not obvious. Source MNIST MNIST SVHN SYNDIG SYN NUM Method Target MN-M SVHN MNIST SVHN GTSRB Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2 Source Only (with BN) 57.1 34.9 70.1 85.5 75.7 DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7 MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1 DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1 K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - - Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2 Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0 Ours 94.0 52.8 86.8 92.9 96.2

Slide 47

Slide 47 text

Another approach: Generative models • Generator = Feature extractor • Minimizing domain shifts – Backgrounds – Postures – Lighting conditions – … Features that we • can recognize in terms of the objects • cannot recognize in terms of the domains are desirable to realize domain adaptation

Slide 48

Slide 48 text

Deep Domain Confusion (DDC) Simultaneous maximization of: • Classification accuracy on source domain • Overlap between source & target domains [Tzeng+, arXiv 2014]

Slide 49

Slide 49 text

Network Architecture of DDC • Optimization of Classification Loss+Domain Loss • Domain Loss: – Maximum Mean Discrepancy (MMD): Overlap between source and target domains – Objective function: Linear combination of MMD and classification loss Averaged feature on source domain Averaged feature on target domain

Slide 50

Slide 50 text

Experiments using Office Dataset • Three datasets: – for the same objects – under different environments • Proposed method: – Maximizing mean discrepancy leads the best accuracy.

Slide 51

Slide 51 text

Qualitative Discussion: t-SNE Plot Before adaptation: Different distributions for the same “Monitors” on source (blue) and target (green) domains

Slide 52

Slide 52 text

Qualitative Discussion: t-SNE Plot After adaptation: Each distribution overlaps the other

Slide 53

Slide 53 text

Deep Adaptation Networks (DAN) • Multiple Kernels for MMD (ML-MMD) • In comparison to DDC: – MMD calculation among multiple layers – Nonlinear distance with multiple kernels • Experimental Results on Office Dataset [Long+, ICML 2015]

Slide 54

Slide 54 text

Domain Adversarial Neural Networks (DANN) • The original name didn’t include “adversarial” – The name “Domain Adversarial Neural Networks” appears in the journal version [Ganin+, JMLR 2016] – Maybe confused with Deep Adaptation Networks (DAN) • Similar motivation to GANs: Adversarial learning to generate (extract) domain-invariant feature vectors – GAN: generated data vs. real data – DANN: feature vectors on a source domain vs. feature vectors on a target domain [Ganin+Lempitsky, ICML 2015]

Slide 55

Slide 55 text

Network Architecture of DANN • tries to extract domain-invariant features • classifies source data • aims to distinguish two domains

Slide 56

Slide 56 text

Adversarial Learning • Domain classification loss – attempts to minimize – attempts to maximize • 問題点：の勾配に対して – Gradient descent for – Gradient ascent for How to reverse the directions of and ?

Slide 57

Slide 57 text

Gradient Reversal Layer (GRL) A “function” that • does nothing during forwarding • reverses the sign during backpropagation is introduced in GRL Simultaneous gradient descent + ascent

Slide 58

Slide 58 text

Experimental Results • Office Dataset • Digit Dataset Feature distributions SYN NUMBERS (red) →SVHN (blue) Adapt

Slide 59

Slide 59 text

Adversarial Discriminative Domain Adaptation Adversarial learning similar to DANN [Tzeng+, CVPR 2017]

Slide 60

Slide 60 text

Disadvantages of DANN • DANN uses a single feature extractor for both domains ✓The number of parameters can be reduced ×It might be impossible to extract features of different domains using the same extractor. • Gradient Reversal Layer ✓It is faithful to the objective function of GANs ×Gradients from the discriminator may be vanished early in the training

Slide 61

Slide 61 text

ADDA • Features are extracted by CNN which is different in each domain CNN for the source domain is pre-trained • Use losses for inverted label common in GANs instead of Gradient Reversal （ :Target feature extractor : Domain discriminator）

Slide 62

Slide 62 text

Experimental Results State-of-the-art on Office and digit datasets

Slide 63

Slide 63 text

Maximum Classifier Discrepancy (MCD) [Saito+, CVPR 2018]

Slide 64

Slide 64 text

Maximum Classifier Discrepancy (MCD) So far, we've tried to match domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. [Saito+, CVPR 2018]

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Maximum Classifier Discrepancy (MCD) 1. Preparing two classifiers for each class Classifier trained on source domain ・avoid the dotted line area ・may cross the solid line area This diagonal line area (Discrepancy Region) should be eliminated.

Slide 67

Slide 67 text

Maximum Classifier Discrepancy (MCD) 2. Finding as many discrepancies as possible Only classifiers are updated

Slide 68

Slide 68 text

Maximum Classifier Discrepancy (MCD) 3. Learning extractors to reduce discrepancy Only generator is updated

Slide 69

Slide 69 text

Maximum Classifier Discrepancy (MCD) Repeating 2. and 3. until its convergence

Slide 70

Slide 70 text

Experimental Results State-of-the-art on digit dataset

Slide 71

Slide 71 text

Experimental Results Semantic Segmentation using synthesized data and real data

Slide 72

Slide 72 text

Adversarial Dropout Regularization (ADR) [Saito+, ICLR 2018]

Slide 73

Slide 73 text

Adversarial Dropout Regularization (ADR) So far, we've tried to match domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. • We should match classifiers instead of domains. [Saito+, ICLR 2018] ... oh, I think I heard it a while ago.

Slide 74

Slide 74 text

ADR ≃ MCD by dropout These two classifiers are • Trained directly by MCD • Generated by dropout (proposed)

Slide 75

Slide 75 text

Training is similar to MCD

Slide 76

Slide 76 text

Experimental Results State-of-the-art on digit datasets

Slide 77

Slide 77 text

Experimental Results Semantic Segmentation using synthesized data and real data

Slide 78

Slide 78 text

Open Set Domain Adaptation (OSDA) [Saito+, ECCV 2018]

Slide 79

Slide 79 text

Source Target Closed Domain Adaptation Open Set Domain Adaptation Source Target Unknown ・ Source and target completely share class in domain adaptation. ・ Target samples are unlabeled. ・ Open set: Target contains unknown class. cf. Reversed setting = Partial Domain Adaptation [Cao+, ECCV 2018] OSDA by Backpropagation [Saito+, ECCV 2018]

Slide 80

Slide 80 text

OSDA by Backpropagation [Saito+, ECCV 2018]

Slide 81

Slide 81 text

Domain Adaptation for Object Detection [Saito+, CVPR 2019]

Slide 82

Slide 82 text

Strong-Weak Distribution Alignment [Saito+, CVPR 2019]

Slide 83

Slide 83 text

Computer Vision and Natural Language Processing

Slide 84

Slide 84 text

2014: Another impact of Deep Learning • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) • Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output

Slide 85

Slide 85 text

Growth of user generated contents Especially in content posting/sharing service • Facebook: 300 million photos per day • YouTube: 400-hours videos per minute Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular. Pairs of a sentence + a video / photo →Collectable in large quantities

Slide 86

Slide 86 text

Exploratory researches on Vision and Language Captioning an image associated with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Image Captioning [Ushiku+, ACM Multimedia 2012]

Slide 89

Slide 89 text

Image Captioning [Ushiku+, ACM Multimedia 2012]

Slide 90

Slide 90 text

Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR 2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 0 … for image 0 : beginning of the sentence 1 = LSTM CNN = LSTM St−1 , = 2 … − 1 : end of the sentence [Vinyals+, CVPR 2015]

Slide 91

Slide 91 text

Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]

Slide 92

Slide 92 text

Multilingual Captioning Transfer learning among languages [Miyazaki+Shimizu, ACL 2016] • Vision-Language grounding Wim is transferred • Efficient learning using small amount of captions an elephant is an elephant 一匹の象が土の一匹の象が

Slide 93

Slide 93 text

Image Caption Translation Ein Masten mit zwei Ampeln fur Autofahrer. (German) A pole with two lights for drivers. (English) [Hitschler+, ACL 2016]

Slide 94

Slide 94 text

Visual Question Answering [Fukui+, EMNLP 2016]

Slide 95

Slide 95 text

VQA=Multiclass Classification Feature + is applied to an usual classifier Question What objects are found on the bed? Answer bed sheets, pillow Image Image feature Question feature Integrated feature +

Slide 96

Slide 96 text

Image Generation from Captions This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]

Slide 97

Slide 97 text

Towards more realistic image generation StackGAN [Zhang+, 2016] Two-step GANs • First GAN generates small and fuzzy image • Second GAN enlarges and refines it

Slide 98

Slide 98 text

Visual Dialog (VisDial) Continuous Visual Question and its Answering Questioner Answerer A couple of people in the snow on skis. [Das+, CVPR 2017]

Slide 99

Slide 99 text

Visual Dialog (VisDial) Questioner Answerer A couple of people in the snow on skis. What are their genders? Are they both adults? Do they wear goggles? Do they have hats on? Are there any other people? What color is man’s hat? Is it snowing now? What is woman wearing? Are they smiling? Do you see trees? 1 man 1 woman Yes Looks like sunglasses Man does No Black No Blue jacket and black pants Yes Yes [Das+, CVPR 2017]

Slide 100

Slide 100 text

Vision-and-Language Navigation (VNL) [Anderson+, ICCV 2017]

Slide 101

Slide 101 text

Summary • Computer Vision: Short History • Computer Vision from Machine Learning • Introduction of Vision and Language • Contributions of Deep Learning – Most research themes exist before Deep Learning – Commodity techs for processing images, videos and natural languages – Evolution of recognition and generation Towards a new stage of vision and language!

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

supplementary material Details about Visual Captioning

Slide 104

Slide 104 text

Every picture tells a story Dataset: Images + + Captions 1. Predict for an input image using MRF 2. Search for the existing caption associated with similar [Farhadi+, ECCV 2010]

Slide 105

Slide 105 text

Every picture tells a story See something unexpected. A man stands next to a train on a cloudy day. [Farhadi+, ECCV 2010]

Slide 106

Slide 106 text

Retrieve? Generate? • Retrieve • Generate – Template-based e.g. generating a Subject＋Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Slide 107

Slide 107 text

Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based e.g. generating a Subject＋Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Slide 108

Slide 108 text

Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog＋stand ⇒ A dog stands. – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Slide 109

Slide 109 text

Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog＋stand ⇒ A dog stands. – Template-free A small white dog standing on a leash. A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Slide 110

Slide 110 text

Captioning with multi-keyphrases [Ushiku+, ACM MM 2012]

Slide 111

Slide 111 text

End of sentence [Ushiku+, ACM MM 2012]

Slide 112

Slide 112 text

Benefits of Deep Learning • Refinement of image recognition [Krizhevsky+, NIPS 2012] • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Examples of generated captions [https://github.com/tensorflow/models/tree/master/im2txt] [Vinyals+, CVPR 2015]

Slide 115

Slide 115 text

Comparison to [Ushiku+, ACM MM 2012] Input image [Ushiku+, ACM MM 2012]: Conventional object recognition Fisher Vector + Linear classifier Neural image captioning: Conventional object recognition Convolutional Neural Network Neural image captioning Conventional machine translation Recurrent Neural Network + beam search [Ushiku+, ACM MM 2012]: Conventional machine translation Log Linear Model + beam search Estimation of important words Connect the words with grammar model • Trained using only images and captions • Approaches are similar to each other

Slide 116

Slide 116 text

Current development: Accuracy • Attention-based captioning [Xu+, ICML 2015] – Focus on some areas for predicting each word! – Both attention and caption models are trained using pairs of an image & caption

Slide 117

Slide 117 text

Current development: Problem setting Dense captioning [Lin+, BMVC 2015] [Johnson+, CVPR 2016]

Slide 118

Slide 118 text

Current development: Problem setting Generating captions for a photo sequence [Park+Kim, NIPS 2015][Huang+, NAACL 2016] The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.

Slide 119

Slide 119 text

Current development: Problem setting Captioning using sentiment terms [Mathews+, AAAI 2016][Shin+, BMVC 2016] Neutral caption Positive caption

Slide 120

Slide 120 text

Before Deep Learning • Grounding of languages and objects in videos [Yu+Siskind, ACL 2013] – Learning from only videos and their captions – Experiment with a small object with few objects – Controlled and small dataset • Deep Learning should suite for this problem – Image Captioning: single image → word sequence – Video Captioning: image sequence → word sequence

Slide 121

Slide 121 text

End-to-end learning by Deep Learning • LRCN [Donahue+, CVPR 2015] – CNN+RNN for • Action recognition • Image / Video Captioning • Video to Text [Venugopalan+, ICCV 2015] – CNNs to recognize • Objects from RGB frames • Actions from flow images – RNN for captioning

Slide 122

Slide 122 text

Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]

Slide 123

Slide 123 text

Video Captioning A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]

Slide 124

Slide 124 text

Video Retrieval from Caption • Input: Captions • Output: A video related to the caption 10 sec video clip from 40 min database! • Video captioning is also addressed A woman in blue is playing ping pong in a room. A guy is skiing with no shirt on and yellow snow pants. A man is water skiing while attached to a long rope. [Yamaguchi+, ICCV 2017]