ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

ACML 2019 Tutorial 1: Deep Learning for Natural Language Processing
and Computer Vision Computer Vision + beyond OMRON SINIC X / Ridge-i Yoshitaka Ushiku losnuevetoros

2011 2012 2014

2011 2012 2014 Speech recognition error 30% → less than
20% [Seide+, InterSpeech 2011]

20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012]

20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012] Machine translation system Complicated → simple [Sutskever+, NIPS 2014]

2012: Impact of Deep Learning Academic AI startup A famous
company Many slides refer to the first use of CNN (AlexNet) on ImageNet

company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet

2012: Impact of Deep Learning According to the official site…
1st team w/ DL Error rate: 15% [http://image-net.org/challenges/LSVRC/2012/results.html]

1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html]

1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html] It’s me!!

Yoshitaka Ushiku Ph.D. 2013.5～2013.8 Research Intern, Microsoft Research 2014.4 Ph.D.
(The University of Tokyo) 2014.4～2016.3 Research Scientist, NTT CS Lab. 2016.4～ Lecturer, The University of Tokyo 2018.4～ Principal Investigator, OMRON SINIC X Corp. 2019.1～ Chief Research Officer, Ridge-i Co., Ltd. [Ushiku+, ACMMM 2012] [Ushiku+, ICCV 2015] Image Captioning Image Captioning with Sentiment Terms Cross-modal Retrieval with Videos and Texts [Yamaguchi+, ICCV 2017] A guy is skiing with no shirt on and yellow snow pants. A zebra standing in a field with a tree in the dirty background. [Shin+, BMVC 2016] A yellow train on the tracks near a train station.

Today’s tutorial • Computer Vision: Short History – Detection, segmentation,
and 3D rendering • Computer Vision from the Point of View of Machine Learning – Domain adaptation • Computer Vision and Natural Language Processing – Vision & Language

Computer Vision: Short History

20% [Seide+, InterSpeech 2011] Image classification error 25% →15% [Krizhevsky+, NIPS 2012] Machine translation system Complicated → simple [Sutskever+, NIPS 2014]

2011 2012 2014

2011 2012 2014 DNN

2011 2012 2014 DNN CNN

2011 2012 2014 DNN CNN RNN

Convolution • The filter value is multiplied pix.-by-pix. – for
the area at (, ) of the left image – Their sum is stored at the corresponding position of the right image • Multiple 2D filters → 3D array 22 [Dumoulin+Visin, 2016]

AlexNet • AlexNet – 2 GPUs – 1 week for
training • 5 conv. layers + 3 fully connected layers 23 [Krizhevsky+, NIPS 2012]

VGGNet and Inception • VGGNet[Simonyan+Zisserman, ICLR 2015] – CNN desined
by Oxford's Visual Geometry Group – “Depth“ is highlighted • Inception [Szegedy, CVPR 2015] – CNN by Google – Inception Block (right bottom) is applied repeatedly. – After reducing the number of channels by 1x1 convolution, 3x3 or 5x5 convolution is applied → high expressiveness with fewer parameters 24

ResNet • VGGNet has 16 / 19 layers – Further
depth does not improve accuracy – Gradients may disappear through back propagation • ResNet: skip connection among layers – Gradients are kept through identity mapping – ResNet = ensemble of multiple CNNs [Veit+, NIPS 2016] – Neural Ordinary Differential Equations [Chen+ NeurIPS 2018] 25 [He+, CVPR 2016]

Object Detection • RCNN (Region CNN) [Girshick+, CVPR 2014] –
region proposal from an image – CNN over each region • Faster RCNN [Ren+, NIPS 2015] – RCNN requires multiple calculation of CNN for a single image – Faster RCNN: Apply CNN only once to the whole image and estimate candidate area at the same time → High speed and precision 26

Semantic Segmentation • U-Net [Ronneberger+, MICCAI 2015] – Autoencoder +
skip connection – The finer parts of each region can be segmented precisely • DeepLab v3[Chen+, ECCV 2018] – Feature extraction from multiple resolution – Skip connection

From 2D to 3D: PointNet [Qi+, CVPR 2017]

Neural 3D Mesh Renderer [Kato+, CVPR 2018]

Neural 3D Mesh Renderer Single 2D image 3D model [Kato+,
CVPR 2018]

Neural 3D Mesh Renderer 3D mesh rendering engine that is
made differentiable for neural networks 3D model inference Rendering Error between estimation and reference 2D image 3D model Estimated 2D image (Silhouette) Reference silhouette 3D model estimator and rendering engine are updated with backpropagation Originally differentiable Differentiable

Applications 3D meshing of images style transfer from 2D to
3D 3D Deep Dream

Computer Vision from the Point of View of Machine Learning

Unsupervised Domain Adaptation (UDA) • Source Domain: Data are associated
with ground truth, but we don’t want to recognize them as an application. • Target Domain: We want to recognize them, but there are no data associated with ground truth. • Semi-supervised Domain Adaptation: There are some target samples with ground truth. Video Game Real World

UDA by Pseudo-Labeling [Saito+, ICML 2017]

UDA by Pseudo-Labeling 1st: Training on MNIST → Add pseudo
labels for easy samples 2nd~: Training on MNIST+α → Add more pseudo labels eight nine Asymmetric Tri-training for Domain Adaptation [Saito+, ICML 2017]

p1 p2 pt S+Tl Tl S : source samples Tl
: pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F1 ,F2 : Labeling networks Ft : Target specific network F : Shared network Proposed Architecture

: pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl F is updated using gradients from F1 ,F2 ,Ft Proposed Architecture

p1 p2 pt S S S : source samples Tl
: pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S All networks are trained using only source samples. 1. Initial training

p1 p2 T Input X F1 F2 F T If
F1 and F2 agree on their predictions, and either of their probability is larger than threshold value, corresponding labels are given to the target sample. T: Target samples 2. Labeling target samples

F1 , F2 : source and pseudo-labeled samples Ft :
pseudo-labeled ones F : learn from all gradients p1 p2 pt S+Tl Tl S : source samples Tl : pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl 3. Retraining network using pseudo-labeled target samples

: pseudo-labeled target samples Input X F1 F2 Ft ŷ : Pseudo-label for target sample y : Label for source sample F S+Tl Repeat the 2nd step and 3rd step until convergence! 3. Retraining network using pseudo-labeled target samples

Overall objective Overall Objective W1 W2 p1 p2 pt S+Tl
F1 F2 Ft F S+Tl Tl CrossEntropy To force F1 and F2 to learn from different features.

Experiments • Four adaptation scenarios between digits datasets – MNIST,
SVHN, SYN DIGIT (synthesized digits) • One adaptation scenario between traffic signs datasets – GTSRB (real traffic signs), SYN SIGN (synthesized signs) GTSRB SYN SIGNS SYN DIGITS SVHN MNIST MNIST-M

Accuracy on Target Domain • Our method outperformed other methods.
– The effect of BN is obvious in some settings. – The effect of weight constraint is not obvious. Source MNIST MNIST SVHN SYNDIG SYN NUM Method Target MN-M SVHN MNIST SVHN GTSRB Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2 Source Only (with BN) 57.1 34.9 70.1 85.5 75.7 DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7 MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1 DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1 K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - - Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2 Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0 Ours 94.0 52.8 86.8 92.9 96.2

Another approach: Generative models • Generator = Feature extractor •
Minimizing domain shifts – Backgrounds – Postures – Lighting conditions – … Features that we • can recognize in terms of the objects • cannot recognize in terms of the domains are desirable to realize domain adaptation

Deep Domain Confusion (DDC) Simultaneous maximization of: • Classification accuracy
on source domain • Overlap between source & target domains [Tzeng+, arXiv 2014]

Network Architecture of DDC • Optimization of Classification Loss+Domain Loss
• Domain Loss: – Maximum Mean Discrepancy (MMD): Overlap between source and target domains – Objective function: Linear combination of MMD and classification loss Averaged feature on source domain Averaged feature on target domain

Experiments using Office Dataset • Three datasets: – for the
same objects – under different environments • Proposed method: – Maximizing mean discrepancy leads the best accuracy.

Qualitative Discussion: t-SNE Plot Before adaptation: Different distributions for the
same “Monitors” on source (blue) and target (green) domains

Qualitative Discussion: t-SNE Plot After adaptation: Each distribution overlaps the
other

Deep Adaptation Networks (DAN) • Multiple Kernels for MMD (ML-MMD)
• In comparison to DDC: – MMD calculation among multiple layers – Nonlinear distance with multiple kernels • Experimental Results on Office Dataset [Long+, ICML 2015]

Domain Adversarial Neural Networks (DANN) • The original name didn’t
include “adversarial” – The name “Domain Adversarial Neural Networks” appears in the journal version [Ganin+, JMLR 2016] – Maybe confused with Deep Adaptation Networks (DAN) • Similar motivation to GANs: Adversarial learning to generate (extract) domain-invariant feature vectors – GAN: generated data vs. real data – DANN: feature vectors on a source domain vs. feature vectors on a target domain [Ganin+Lempitsky, ICML 2015]

Network Architecture of DANN • tries to extract domain-invariant features
• classifies source data • aims to distinguish two domains

Adversarial Learning • Domain classification loss – attempts to minimize
– attempts to maximize • 問題点：の勾配に対して – Gradient descent for – Gradient ascent for How to reverse the directions of and ?

Gradient Reversal Layer (GRL) A “function” that • does nothing
during forwarding • reverses the sign during backpropagation is introduced in GRL Simultaneous gradient descent + ascent

Experimental Results • Office Dataset • Digit Dataset Feature distributions
SYN NUMBERS (red) →SVHN (blue) Adapt

Adversarial Discriminative Domain Adaptation Adversarial learning similar to DANN [Tzeng+,
CVPR 2017]

Disadvantages of DANN • DANN uses a single feature extractor
for both domains ✓The number of parameters can be reduced ×It might be impossible to extract features of different domains using the same extractor. • Gradient Reversal Layer ✓It is faithful to the objective function of GANs ×Gradients from the discriminator may be vanished early in the training

ADDA • Features are extracted by CNN which is different
in each domain CNN for the source domain is pre-trained • Use losses for inverted label common in GANs instead of Gradient Reversal （ :Target feature extractor : Domain discriminator）

Experimental Results State-of-the-art on Office and digit datasets

Maximum Classifier Discrepancy (MCD) [Saito+, CVPR 2018]

Maximum Classifier Discrepancy (MCD) So far, we've tried to match
domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. [Saito+, CVPR 2018]

Maximum Classifier Discrepancy (MCD) So far, we've tried to match
domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. • We should match classifiers instead of domains. [Saito+, CVPR 2018]

Maximum Classifier Discrepancy (MCD) 1. Preparing two classifiers for each
class Classifier trained on source domain ・avoid the dotted line area ・may cross the solid line area This diagonal line area (Discrepancy Region) should be eliminated.

Maximum Classifier Discrepancy (MCD) 2. Finding as many discrepancies as
possible Only classifiers are updated

Maximum Classifier Discrepancy (MCD) 3. Learning extractors to reduce discrepancy
Only generator is updated

Maximum Classifier Discrepancy (MCD) Repeating 2. and 3. until its
convergence

Experimental Results State-of-the-art on digit dataset

Experimental Results Semantic Segmentation using synthesized data and real data

Adversarial Dropout Regularization (ADR) [Saito+, ICLR 2018]

Adversarial Dropout Regularization (ADR) So far, we've tried to match
domains, but • Even if the distribution of the two domains overlap, the distribution of each class may not agree. • We should match classifiers instead of domains. [Saito+, ICLR 2018] ... oh, I think I heard it a while ago.

ADR ≃ MCD by dropout These two classifiers are •
Trained directly by MCD • Generated by dropout (proposed)

Training is similar to MCD

Experimental Results State-of-the-art on digit datasets

Experimental Results Semantic Segmentation using synthesized data and real data

Open Set Domain Adaptation (OSDA) [Saito+, ECCV 2018]

Source Target Closed Domain Adaptation Open Set Domain Adaptation Source
Target Unknown ・ Source and target completely share class in domain adaptation. ・ Target samples are unlabeled. ・ Open set: Target contains unknown class. cf. Reversed setting = Partial Domain Adaptation [Cao+, ECCV 2018] OSDA by Backpropagation [Saito+, ECCV 2018]

OSDA by Backpropagation [Saito+, ECCV 2018]

Domain Adaptation for Object Detection [Saito+, CVPR 2019]

Strong-Weak Distribution Alignment [Saito+, CVPR 2019]

Computer Vision and Natural Language Processing

2014: Another impact of Deep Learning • Deep learning appears
in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) • Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output

Growth of user generated contents Especially in content posting/sharing service
• Facebook: 300 million photos per day • YouTube: 400-hours videos per minute Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular. Pairs of a sentence + a video / photo →Collectable in large quantities

Exploratory researches on Vision and Language Captioning an image associated
with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.

Exploratory researches on Vision and Language Captioning an image associated
with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week. As a result of these backgrounds: Various research topics such as …

Image Captioning [Ushiku+, ACM Multimedia 2012]

Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR
2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 0 … for image 0 : beginning of the sentence 1 = LSTM CNN = LSTM St−1 , = 2 … − 1 : end of the sentence [Vinyals+, CVPR 2015]

Video Captioning A man is holding a box of doughnuts.
Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]

Multilingual Captioning Transfer learning among languages [Miyazaki+Shimizu, ACL 2016] •
Vision-Language grounding Wim is transferred • Efficient learning using small amount of captions an elephant is an elephant 一匹の象が土の一匹の象が

Image Caption Translation Ein Masten mit zwei Ampeln fur Autofahrer.
(German) A pole with two lights for drivers. (English) [Hitschler+, ACL 2016]

Visual Question Answering [Fukui+, EMNLP 2016]

VQA=Multiclass Classification Feature + is applied to an usual classifier
Question What objects are found on the bed? Answer bed sheets, pillow Image Image feature Question feature Integrated feature +

Image Generation from Captions This bird is blue with white
and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]

Towards more realistic image generation StackGAN [Zhang+, 2016] Two-step GANs
• First GAN generates small and fuzzy image • Second GAN enlarges and refines it

Visual Dialog (VisDial) Continuous Visual Question and its Answering Questioner
Answerer A couple of people in the snow on skis. [Das+, CVPR 2017]

Visual Dialog (VisDial) Questioner Answerer A couple of people in
the snow on skis. What are their genders? Are they both adults? Do they wear goggles? Do they have hats on? Are there any other people? What color is man’s hat? Is it snowing now? What is woman wearing? Are they smiling? Do you see trees? 1 man 1 woman Yes Looks like sunglasses Man does No Black No Blue jacket and black pants Yes Yes [Das+, CVPR 2017]

Vision-and-Language Navigation (VNL) [Anderson+, ICCV 2017]

Summary • Computer Vision: Short History • Computer Vision from
Machine Learning • Introduction of Vision and Language • Contributions of Deep Learning – Most research themes exist before Deep Learning – Commodity techs for processing images, videos and natural languages – Evolution of recognition and generation Towards a new stage of vision and language!

supplementary material Details about Visual Captioning

Every picture tells a story Dataset: Images + <object, action,
scene> + Captions 1. Predict <object, action, scene> for an input image using MRF 2. Search for the existing caption associated with similar <object, action, scene> <Horse, Ride, Field> [Farhadi+, ECCV 2010]

Every picture tells a story <pet, sleep, ground> See something
unexpected. <transportation, move, track> A man stands next to a train on a cloudy day. [Farhadi+, ECCV 2010]

Retrieve? Generate? • Retrieve • Generate – Template-based e.g. generating
a Subject＋Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Retrieve? Generate? • Retrieve – A small gray dog on
a leash. • Generate – Template-based e.g. generating a Subject＋Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

a leash. • Generate – Template-based dog＋stand ⇒ A dog stands. – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

a leash. • Generate – Template-based dog＋stand ⇒ A dog stands. – Template-free A small white dog standing on a leash. A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset

Captioning with multi-keyphrases [Ushiku+, ACM MM 2012]

End of sentence [Ushiku+, ACM MM 2012]

Benefits of Deep Learning • Refinement of image recognition [Krizhevsky+,
NIPS 2012] • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output

Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR
2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 0 … for image 0 : beginning of the sentence 1 = LSTM CNN = LSTM St−1 , = 2 … − 1 : end of the sentence [Vinyals+, CVPR 2015]

Examples of generated captions [https://github.com/tensorflow/models/tree/master/im2txt] [Vinyals+, CVPR 2015]

Comparison to [Ushiku+, ACM MM 2012] Input image [Ushiku+, ACM
MM 2012]: Conventional object recognition Fisher Vector + Linear classifier Neural image captioning: Conventional object recognition Convolutional Neural Network Neural image captioning Conventional machine translation Recurrent Neural Network + beam search [Ushiku+, ACM MM 2012]: Conventional machine translation Log Linear Model + beam search Estimation of important words Connect the words with grammar model • Trained using only images and captions • Approaches are similar to each other

Current development: Accuracy • Attention-based captioning [Xu+, ICML 2015] –
Focus on some areas for predicting each word! – Both attention and caption models are trained using pairs of an image & caption

Current development: Problem setting Dense captioning [Lin+, BMVC 2015] [Johnson+,
CVPR 2016]

Current development: Problem setting Generating captions for a photo sequence
[Park+Kim, NIPS 2015][Huang+, NAACL 2016] The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.

Current development: Problem setting Captioning using sentiment terms [Mathews+, AAAI
2016][Shin+, BMVC 2016] Neutral caption Positive caption

Before Deep Learning • Grounding of languages and objects in
videos [Yu+Siskind, ACL 2013] – Learning from only videos and their captions – Experiment with a small object with few objects – Controlled and small dataset • Deep Learning should suite for this problem – Image Captioning: single image → word sequence – Video Captioning: image sequence → word sequence

End-to-end learning by Deep Learning • LRCN [Donahue+, CVPR 2015]
– CNN+RNN for • Action recognition • Image / Video Captioning • Video to Text [Venugopalan+, ICCV 2015] – CNNs to recognize • Objects from RGB frames • Actions from flow images – RNN for captioning

Video Captioning A man is holding a box of doughnuts.
Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]

Video Captioning A boat is floating on the water near
a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]

Video Retrieval from Caption • Input: Captions • Output: A
video related to the caption 10 sec video clip from 40 min database! • Video captioning is also addressed A woman in blue is playing ping pong in a room. A guy is skiing with no shirt on and yellow snow pants. A man is water skiing while attached to a long rope. [Yamaguchi+, ICCV 2017]

ACML 2019 Tutorial 1: "Deep Learning for Natura...

ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

More Decks by Yoshitaka Ushiku

Other Decks in Research

Featured

Transcript