Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

ACML 2019 Tutorial 1: "Deep Learning for Natural Language Processing and Computer Vision"

http://www.acml-conf.org/2019/tutorials/ushiku-tsuruoka/
This is the latter part of the first tutorial in ACML 2019.

Yoshitaka Ushiku
PRO

November 10, 2019
Tweet

More Decks by Yoshitaka Ushiku

Other Decks in Research

Transcript

  1. ACML 2019 Tutorial 1:
    Deep Learning for Natural Language Processing and Computer Vision
    Computer Vision + beyond
    OMRON SINIC X / Ridge-i
    Yoshitaka Ushiku
    losnuevetoros

    View Slide

  2. 2011
    2012
    2014

    View Slide

  3. 2011
    2012
    2014
    Speech recognition error
    30% → less than 20%
    [Seide+, InterSpeech 2011]

    View Slide

  4. 2011
    2012
    2014
    Speech recognition error
    30% → less than 20%
    [Seide+, InterSpeech 2011]
    Image classification error
    25% →15%
    [Krizhevsky+, NIPS 2012]

    View Slide

  5. 2011
    2012
    2014
    Speech recognition error
    30% → less than 20%
    [Seide+, InterSpeech 2011]
    Image classification error
    25% →15%
    [Krizhevsky+, NIPS 2012]
    Machine translation system
    Complicated → simple
    [Sutskever+, NIPS 2014]

    View Slide

  6. 2012: Impact of Deep Learning
    Academic AI startup A famous company
    Many slides refer to the first use of CNN (AlexNet) on ImageNet

    View Slide

  7. 2012: Impact of Deep Learning
    Academic AI startup A famous company
    Many slides refer to the first use of CNN (AlexNet) on ImageNet

    View Slide

  8. 2012: Impact of Deep Learning
    Academic AI startup A famous company
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Many slides refer to the first use of CNN (AlexNet) on ImageNet

    View Slide

  9. 2012: Impact of Deep Learning
    Academic AI startup A famous company
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Many slides refer to the first use of CNN (AlexNet) on ImageNet

    View Slide

  10. 2012: Impact of Deep Learning
    Academic AI startup A famous company
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Large gap of error rates
    on ImageNet
    1st team: 15.3%
    2nd team: 26.2%
    Many slides refer to the first use of CNN (AlexNet) on ImageNet

    View Slide

  11. 2012: Impact of Deep Learning
    According to the official site…
    1st team w/ DL
    Error rate: 15%
    [http://image-net.org/challenges/LSVRC/2012/results.html]

    View Slide

  12. 2012: Impact of Deep Learning
    According to the official site…
    1st team w/ DL
    Error rate: 15%
    2nd team w/o DL
    Error rate: 26%
    [http://image-net.org/challenges/LSVRC/2012/results.html]

    View Slide

  13. 2012: Impact of Deep Learning
    According to the official site…
    1st team w/ DL
    Error rate: 15%
    2nd team w/o DL
    Error rate: 26%
    [http://image-net.org/challenges/LSVRC/2012/results.html]

    View Slide

  14. 2012: Impact of Deep Learning
    According to the official site…
    1st team w/ DL
    Error rate: 15%
    2nd team w/o DL
    Error rate: 26%
    [http://image-net.org/challenges/LSVRC/2012/results.html]
    It’s me!!

    View Slide

  15. Yoshitaka Ushiku Ph.D.
    2013.5~2013.8 Research Intern, Microsoft Research
    2014.4 Ph.D. (The University of Tokyo)
    2014.4~2016.3 Research Scientist, NTT CS Lab.
    2016.4~ Lecturer, The University of Tokyo
    2018.4~ Principal Investigator, OMRON SINIC X Corp.
    2019.1~ Chief Research Officer, Ridge-i Co., Ltd.
    [Ushiku+, ACMMM 2012]
    [Ushiku+, ICCV 2015]
    Image Captioning Image Captioning with
    Sentiment Terms
    Cross-modal Retrieval
    with Videos and Texts
    [Yamaguchi+, ICCV 2017]
    A guy is skiing with no shirt on
    and yellow snow pants.
    A zebra standing in a field
    with a tree in the dirty
    background.
    [Shin+, BMVC 2016]
    A yellow train on the tracks
    near a train station.

    View Slide

  16. Today’s tutorial
    • Computer Vision: Short History
    – Detection, segmentation, and 3D rendering
    • Computer Vision from the Point of View
    of Machine Learning
    – Domain adaptation
    • Computer Vision and Natural Language
    Processing
    – Vision & Language

    View Slide

  17. Computer Vision: Short History

    View Slide

  18. 2011
    2012
    2014
    Speech recognition error
    30% → less than 20%
    [Seide+, InterSpeech 2011]
    Image classification error
    25% →15%
    [Krizhevsky+, NIPS 2012]
    Machine translation system
    Complicated → simple
    [Sutskever+, NIPS 2014]

    View Slide

  19. 2011
    2012
    2014

    View Slide

  20. 2011
    2012
    2014
    DNN

    View Slide

  21. 2011
    2012
    2014
    DNN
    CNN

    View Slide

  22. 2011
    2012
    2014
    DNN
    CNN
    RNN

    View Slide

  23. Convolution
    • The filter value is multiplied pix.-by-pix.
    – for the area at (, ) of the left image
    – Their sum is stored at the
    corresponding position of
    the right image
    • Multiple 2D filters → 3D array
    22
    [Dumoulin+Visin, 2016]

    View Slide

  24. AlexNet
    • AlexNet
    – 2 GPUs
    – 1 week for
    training
    • 5 conv. layers + 3 fully connected layers
    23
    [Krizhevsky+, NIPS 2012]

    View Slide

  25. VGGNet and Inception
    • VGGNet[Simonyan+Zisserman, ICLR 2015]
    – CNN desined by Oxford's Visual Geometry Group
    – “Depth“ is
    highlighted
    • Inception [Szegedy, CVPR 2015]
    – CNN by Google
    – Inception Block (right bottom) is applied repeatedly.
    – After reducing the number of channels by 1x1
    convolution, 3x3 or 5x5 convolution is applied
    → high expressiveness with fewer
    parameters
    24

    View Slide

  26. ResNet
    • VGGNet has 16 / 19 layers
    – Further depth does not improve accuracy
    – Gradients may disappear through back propagation
    • ResNet: skip connection among layers
    – Gradients are kept through identity mapping
    – ResNet = ensemble of multiple CNNs [Veit+, NIPS 2016]
    – Neural Ordinary Differential Equations
    [Chen+ NeurIPS 2018]
    25
    [He+, CVPR 2016]

    View Slide

  27. Object Detection
    • RCNN (Region CNN) [Girshick+, CVPR 2014]
    – region proposal
    from an image
    – CNN over each
    region
    • Faster RCNN [Ren+, NIPS 2015]
    – RCNN requires multiple calculation
    of CNN for a single image
    – Faster RCNN: Apply CNN only once to the whole
    image and estimate candidate area at the same time
    → High speed and precision
    26

    View Slide

  28. Semantic Segmentation
    • U-Net [Ronneberger+, MICCAI 2015]
    – Autoencoder + skip connection
    – The finer parts of each region
    can be segmented precisely
    • DeepLab v3[Chen+, ECCV 2018]
    – Feature extraction
    from multiple resolution
    – Skip connection

    View Slide

  29. From 2D to 3D: PointNet
    [Qi+, CVPR 2017]

    View Slide

  30. Neural 3D Mesh Renderer
    [Kato+, CVPR 2018]

    View Slide

  31. Neural 3D Mesh Renderer
    Single 2D image
    3D model
    [Kato+, CVPR 2018]

    View Slide

  32. Neural 3D Mesh Renderer
    3D mesh rendering engine that is made
    differentiable for neural networks
    3D model
    inference
    Rendering
    Error between
    estimation and
    reference
    2D image 3D model
    Estimated
    2D image
    (Silhouette)
    Reference
    silhouette
    3D model estimator and rendering engine
    are updated with backpropagation
    Originally
    differentiable Differentiable

    View Slide

  33. Applications
    3D meshing of images style transfer from 2D to 3D 3D Deep Dream

    View Slide

  34. Computer Vision from the Point of
    View of Machine Learning

    View Slide

  35. Unsupervised Domain Adaptation (UDA)
    • Source Domain:
    Data are associated with ground truth, but we
    don’t want to recognize them as an application.
    • Target Domain:
    We want to recognize them, but there are no data
    associated with ground truth.
    • Semi-supervised Domain Adaptation:
    There are some target samples with ground truth.
    Video
    Game
    Real
    World

    View Slide

  36. UDA by Pseudo-Labeling
    [Saito+, ICML 2017]

    View Slide

  37. UDA by Pseudo-Labeling
    1st: Training on MNIST → Add pseudo labels for easy samples
    2nd~: Training on MNIST+α → Add more pseudo labels
    eight
    nine
    Asymmetric Tri-training for Domain Adaptation
    [Saito+, ICML 2017]

    View Slide

  38. p1
    p2
    pt
    S+Tl
    Tl
    S : source samples
    Tl
    : pseudo-labeled target samples
    Input
    X
    F1
    F2
    Ft
    ŷ : Pseudo-label for target sample
    y : Label for source sample
    F
    S+Tl
    F1
    ,F2
    : Labeling networks
    Ft
    : Target specific network
    F : Shared network
    Proposed Architecture

    View Slide

  39. p1
    p2
    pt
    S+Tl
    Tl
    S : source samples
    Tl
    : pseudo-labeled target samples
    Input
    X
    F1
    F2
    Ft
    ŷ : Pseudo-label for target sample
    y : Label for source sample
    F
    S+Tl
    F is updated using
    gradients from F1
    ,F2
    ,Ft
    Proposed Architecture

    View Slide

  40. p1
    p2
    pt
    S
    S
    S : source samples
    Tl
    : pseudo-labeled target samples
    Input
    X
    F1
    F2
    Ft
    ŷ : Pseudo-label for target sample
    y : Label for source sample
    F
    S
    All networks are trained
    using only source samples.
    1. Initial training

    View Slide

  41. p1
    p2
    T
    Input
    X
    F1
    F2
    F
    T
    If F1
    and F2
    agree on their predictions, and either of their
    probability is larger than threshold value, corresponding
    labels are given to the target sample.
    T: Target samples
    2. Labeling target samples

    View Slide

  42. F1
    , F2
    : source and pseudo-labeled
    samples
    Ft
    : pseudo-labeled ones
    F : learn from all gradients
    p1
    p2
    pt
    S+Tl
    Tl
    S : source samples
    Tl
    : pseudo-labeled target samples
    Input
    X
    F1
    F2
    Ft
    ŷ : Pseudo-label for target sample
    y : Label for source sample
    F
    S+Tl
    3. Retraining network using pseudo-labeled
    target samples

    View Slide

  43. p1
    p2
    pt
    S+Tl
    Tl
    S : source samples
    Tl
    : pseudo-labeled target samples
    Input
    X
    F1
    F2
    Ft
    ŷ : Pseudo-label for target sample
    y : Label for source sample
    F
    S+Tl
    Repeat the 2nd step and 3rd step
    until convergence!
    3. Retraining network using pseudo-labeled
    target samples

    View Slide

  44. Overall objective
    Overall Objective
    W1
    W2
    p1
    p2
    pt
    S+Tl
    F1
    F2
    Ft
    F
    S+Tl
    Tl
    CrossEntropy
    To force F1
    and F2
    to learn from different features.

    View Slide

  45. Experiments
    • Four adaptation scenarios between digits
    datasets
    – MNIST, SVHN, SYN DIGIT (synthesized digits)
    • One adaptation scenario between traffic
    signs datasets
    – GTSRB (real traffic signs), SYN SIGN (synthesized
    signs)
    GTSRB SYN SIGNS
    SYN DIGITS
    SVHN
    MNIST
    MNIST-M

    View Slide

  46. Accuracy on Target Domain
    • Our method outperformed other methods.
    – The effect of BN is obvious in some settings.
    – The effect of weight constraint is not obvious.
    Source MNIST MNIST SVHN SYNDIG SYN NUM
    Method Target MN-M SVHN MNIST SVHN GTSRB
    Source Only (w/o BN) 59.1 37.2 68.1 84.1 79.2
    Source Only (with BN) 57.1 34.9 70.1 85.5 75.7
    DANN [Ganin et al., 2014] 81.5 35.7 71.1 90.3 88.7
    MMD [Long et al., 2015 ICML] 76.9 - 71.1 88.0 91.1
    DSN [Bousmalis et al, 2016 NIPS] 83.2 - 82.7 91.2 93.1
    K-NN Labeling [Sener et al., 2016 NIPS] 86.7 40.3 78.8 - -
    Ours (w/o BN) 85.3 39.8 79.8 93.1 96.2
    Ours (w/o Weight constraint) 94.2 49.7 86.0 92.4 94.0
    Ours 94.0 52.8 86.8 92.9 96.2

    View Slide

  47. Another approach: Generative models
    • Generator = Feature extractor
    • Minimizing domain shifts
    – Backgrounds
    – Postures
    – Lighting conditions
    – …
    Features that we
    • can recognize in terms of the objects
    • cannot recognize in terms of the domains
    are desirable to realize domain adaptation

    View Slide

  48. Deep Domain Confusion (DDC)
    Simultaneous maximization of:
    • Classification accuracy on source domain
    • Overlap between source & target domains
    [Tzeng+, arXiv 2014]

    View Slide

  49. Network Architecture of DDC
    • Optimization of Classification Loss+Domain Loss
    • Domain Loss:
    – Maximum Mean Discrepancy (MMD):
    Overlap between source and target domains
    – Objective function:
    Linear combination of MMD
    and classification loss
    Averaged feature
    on source domain
    Averaged feature
    on target domain

    View Slide

  50. Experiments using Office Dataset
    • Three datasets:
    – for the same objects
    – under different environments
    • Proposed method:
    – Maximizing mean discrepancy leads the best
    accuracy.

    View Slide

  51. Qualitative Discussion: t-SNE Plot
    Before adaptation:
    Different distributions for
    the same “Monitors”
    on source (blue) and
    target (green) domains

    View Slide

  52. Qualitative Discussion: t-SNE Plot
    After adaptation:
    Each distribution overlaps the other

    View Slide

  53. Deep Adaptation Networks (DAN)
    • Multiple Kernels for MMD (ML-MMD)
    • In comparison to DDC:
    – MMD calculation among multiple layers
    – Nonlinear distance with multiple kernels
    • Experimental Results on Office Dataset
    [Long+, ICML 2015]

    View Slide

  54. Domain Adversarial Neural Networks (DANN)
    • The original name didn’t include “adversarial”
    – The name “Domain Adversarial Neural Networks”
    appears in the journal version [Ganin+, JMLR 2016]
    – Maybe confused with Deep Adaptation Networks (DAN)
    • Similar motivation to GANs:
    Adversarial learning to generate (extract)
    domain-invariant feature vectors
    – GAN: generated data vs. real data
    – DANN: feature vectors on a source domain vs. feature
    vectors on a target domain
    [Ganin+Lempitsky, ICML 2015]

    View Slide

  55. Network Architecture of DANN

    tries to extract domain-invariant features

    classifies source data

    aims to distinguish two domains

    View Slide

  56. Adversarial Learning
    • Domain classification loss

    attempts to minimize

    attempts to maximize
    • 問題点:
    の勾配に対して
    – Gradient descent for
    – Gradient ascent for
    How to reverse the directions of
    and
    ?

    View Slide

  57. Gradient Reversal Layer (GRL)
    A “function” that
    • does nothing during forwarding
    • reverses the sign during backpropagation
    is introduced in GRL
    Simultaneous gradient descent + ascent

    View Slide

  58. Experimental Results
    • Office Dataset
    • Digit Dataset
    Feature distributions
    SYN NUMBERS (red)
    →SVHN (blue)
    Adapt

    View Slide

  59. Adversarial Discriminative Domain Adaptation
    Adversarial learning similar to DANN
    [Tzeng+, CVPR 2017]

    View Slide

  60. Disadvantages of DANN
    • DANN uses a single
    feature extractor for
    both domains
    ✓The number of parameters can be reduced
    ×It might be impossible to extract features of
    different domains using the same extractor.
    • Gradient Reversal Layer
    ✓It is faithful to the objective function of GANs
    ×Gradients from the discriminator may be
    vanished early in the training

    View Slide

  61. ADDA
    • Features are extracted by CNN which is
    different in each domain
    CNN for the source domain is pre-trained
    • Use losses for inverted label common in
    GANs instead of Gradient Reversal

    :Target feature extractor : Domain discriminator)

    View Slide

  62. Experimental Results
    State-of-the-art on Office and digit datasets

    View Slide

  63. Maximum Classifier Discrepancy (MCD)
    [Saito+, CVPR 2018]

    View Slide

  64. Maximum Classifier Discrepancy (MCD)
    So far, we've tried to match domains, but
    • Even if the distribution of the two domains overlap,
    the distribution of each class may not agree.
    [Saito+, CVPR 2018]

    View Slide

  65. Maximum Classifier Discrepancy (MCD)
    So far, we've tried to match domains, but
    • Even if the distribution of the two domains overlap,
    the distribution of each class may not agree.
    • We should match classifiers instead of domains.
    [Saito+, CVPR 2018]

    View Slide

  66. Maximum Classifier Discrepancy (MCD)
    1. Preparing two classifiers for each class
    Classifier trained on source domain
    ・avoid the dotted line area
    ・may cross the solid line area
    This diagonal line area
    (Discrepancy Region) should be
    eliminated.

    View Slide

  67. Maximum Classifier Discrepancy (MCD)
    2. Finding as many discrepancies as possible
    Only classifiers are updated

    View Slide

  68. Maximum Classifier Discrepancy (MCD)
    3. Learning extractors to reduce discrepancy
    Only generator is updated

    View Slide

  69. Maximum Classifier Discrepancy (MCD)
    Repeating 2. and 3. until its convergence

    View Slide

  70. Experimental Results
    State-of-the-art on digit dataset

    View Slide

  71. Experimental Results
    Semantic Segmentation
    using synthesized data and real data

    View Slide

  72. Adversarial Dropout Regularization (ADR)
    [Saito+, ICLR 2018]

    View Slide

  73. Adversarial Dropout Regularization (ADR)
    So far, we've tried to match domains, but
    • Even if the distribution of the two domains overlap,
    the distribution of each class may not agree.
    • We should match classifiers instead of domains.
    [Saito+, ICLR 2018]
    ... oh, I think I heard it a while ago.

    View Slide

  74. ADR ≃ MCD by dropout
    These two classifiers are
    • Trained directly by MCD
    • Generated by dropout
    (proposed)

    View Slide

  75. Training is similar to MCD

    View Slide

  76. Experimental Results
    State-of-the-art on digit datasets

    View Slide

  77. Experimental Results
    Semantic Segmentation
    using synthesized data and real data

    View Slide

  78. Open Set Domain Adaptation (OSDA)
    [Saito+, ECCV 2018]

    View Slide

  79. Source Target
    Closed Domain Adaptation Open Set Domain Adaptation
    Source Target
    Unknown
    ・ Source and target completely share class in domain adaptation.
    ・ Target samples are unlabeled.
    ・ Open set: Target contains unknown class.
    cf. Reversed setting = Partial Domain Adaptation [Cao+, ECCV 2018]
    OSDA by Backpropagation
    [Saito+, ECCV 2018]

    View Slide

  80. OSDA by Backpropagation
    [Saito+, ECCV 2018]

    View Slide

  81. Domain Adaptation for Object Detection
    [Saito+, CVPR 2019]

    View Slide

  82. Strong-Weak Distribution Alignment
    [Saito+, CVPR 2019]

    View Slide

  83. Computer Vision and
    Natural Language Processing

    View Slide

  84. 2014: Another impact of Deep Learning
    • Deep learning appears in machine translation
    [Sutskever+, NIPS 2014]
    – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
    problem in RNN
    →Dealing with relations between distant words in a sentence
    – Four-layer LSTM is trained in an end-to-end manner
    →comparable to state-of-the-art (English to French)
    • Emergence of common techs such as CNN/RNN
    Reduction of barriers to get into CV+NLP
    Input
    Output

    View Slide

  85. Growth of user generated contents
    Especially in content posting/sharing service
    • Facebook: 300 million photos per day
    • YouTube: 400-hours videos per minute
    Pōhutukawa blooms this
    time of the year in New
    Zealand. As the flowers
    fall, the ground
    underneath the trees look
    spectacular.
    Pairs of a sentence
    + a video / photo
    →Collectable in
    large quantities

    View Slide

  86. Exploratory researches on Vision and Language
    Captioning an image associated with its article
    [Feng+Lapata, ACL 2010]
    • Input: article + image Output: caption for image
    • Dataset: Sets of article + image + caption
    × 3361
    King Toupu IV died at the
    age of 88 last week.

    View Slide

  87. Exploratory researches on Vision and Language
    Captioning an image associated with its article
    [Feng+Lapata, ACL 2010]
    • Input: article + image Output: caption for image
    • Dataset: Sets of article + image + caption
    × 3361
    King Toupu IV died at the
    age of 88 last week.
    As a result of these backgrounds:
    Various research topics such as …

    View Slide

  88. Image Captioning
    [Ushiku+, ACM Multimedia 2012]

    View Slide

  89. Image Captioning
    [Ushiku+, ACM Multimedia 2012]

    View Slide

  90. Google NIC
    Concatenation of Google’s methods
    • GoogLeNet [Szegedy+, CVPR 2015]
    • MT with LSTM
    [Sutskever+, NIPS 2014]
    Caption (word seq.) 0

    for image
    0
    : beginning of the sentence
    1
    = LSTM CNN

    = LSTM St−1
    , = 2 … − 1

    : end of the sentence
    [Vinyals+, CVPR 2015]

    View Slide

  91. Video Captioning
    A man is holding a box of doughnuts.
    Then he and a woman are standing next each other.
    Then she is holding a plate of food.
    [Shin+, ICIP 2016]

    View Slide

  92. Multilingual Captioning
    Transfer learning among languages
    [Miyazaki+Shimizu, ACL 2016]
    • Vision-Language grounding Wim
    is transferred
    • Efficient learning using small amount of captions
    an elephant is
    an elephant
    一匹の 象が 土の
    一匹の 象が

    View Slide

  93. Image Caption Translation
    Ein Masten mit zwei Ampeln
    fur Autofahrer. (German)
    A pole with two lights
    for drivers. (English)
    [Hitschler+, ACL 2016]

    View Slide

  94. Visual Question Answering
    [Fukui+, EMNLP 2016]

    View Slide

  95. VQA=Multiclass Classification
    Feature +
    is applied to an usual classifier
    Question
    What objects are
    found on the bed?
    Answer
    bed sheets, pillow
    Image
    Image feature

    Question feature

    Integrated feature
    +

    View Slide

  96. Image Generation from Captions
    This bird is blue with white
    and has a very short beak.
    This flower is white and
    yellow in color, with petals
    that are wavy and smooth.
    [Zhang+, 2016]

    View Slide

  97. Towards more realistic image generation
    StackGAN [Zhang+, 2016]
    Two-step GANs
    • First GAN generates small and fuzzy image
    • Second GAN enlarges and refines it

    View Slide

  98. Visual Dialog (VisDial)
    Continuous Visual Question and its Answering
    Questioner Answerer
    A couple of people
    in the snow on skis.
    [Das+, CVPR 2017]

    View Slide

  99. Visual Dialog (VisDial)
    Questioner Answerer
    A couple of people
    in the snow on skis.
    What are their genders?
    Are they both adults?
    Do they wear goggles?
    Do they have hats on?
    Are there any other people?
    What color is man’s hat?
    Is it snowing now?
    What is woman wearing?
    Are they smiling?
    Do you see trees?
    1 man 1 woman
    Yes
    Looks like sunglasses
    Man does
    No
    Black
    No
    Blue jacket and black pants
    Yes
    Yes
    [Das+, CVPR 2017]

    View Slide

  100. Vision-and-Language Navigation (VNL)
    [Anderson+, ICCV 2017]

    View Slide

  101. Summary
    • Computer Vision: Short History
    • Computer Vision from Machine Learning
    • Introduction of Vision and Language
    • Contributions of Deep Learning
    – Most research themes exist before Deep Learning
    – Commodity techs for processing images, videos and
    natural languages
    – Evolution of recognition and generation
    Towards a new stage of vision and language!

    View Slide

  102. View Slide

  103. supplementary material
    Details about Visual Captioning

    View Slide

  104. Every picture tells a story
    Dataset:
    Images + + Captions
    1. Predict for an input
    image using MRF
    2. Search for the existing caption associated with
    similar

    [Farhadi+, ECCV 2010]

    View Slide

  105. Every picture tells a story

    See something unexpected.

    A man stands next to a train
    on a cloudy day.
    [Farhadi+, ECCV 2010]

    View Slide

  106. Retrieve? Generate?
    • Retrieve
    • Generate
    – Template-based
    e.g. generating a Subject+Verb sentence
    – Template-free
    A small gray dog
    on a leash.
    A black dog
    standing in
    grassy area.
    A small white dog
    wearing a flannel
    warmer.
    Input Dataset

    View Slide

  107. Retrieve? Generate?
    • Retrieve
    – A small gray dog on a leash.
    • Generate
    – Template-based
    e.g. generating a Subject+Verb sentence
    – Template-free
    A small gray dog
    on a leash.
    A black dog
    standing in
    grassy area.
    A small white dog
    wearing a flannel
    warmer.
    Input Dataset

    View Slide

  108. Retrieve? Generate?
    • Retrieve
    – A small gray dog on a leash.
    • Generate
    – Template-based
    dog+stand ⇒ A dog stands.
    – Template-free
    A small gray dog
    on a leash.
    A black dog
    standing in
    grassy area.
    A small white dog
    wearing a flannel
    warmer.
    Input Dataset

    View Slide

  109. Retrieve? Generate?
    • Retrieve
    – A small gray dog on a leash.
    • Generate
    – Template-based
    dog+stand ⇒ A dog stands.
    – Template-free
    A small white dog standing on a leash.
    A small gray dog
    on a leash.
    A black dog
    standing in
    grassy area.
    A small white dog
    wearing a flannel
    warmer.
    Input Dataset

    View Slide

  110. Captioning with multi-keyphrases
    [Ushiku+, ACM MM 2012]

    View Slide

  111. End of sentence
    [Ushiku+, ACM MM 2012]

    View Slide

  112. Benefits of Deep Learning
    • Refinement of image recognition [Krizhevsky+, NIPS 2012]
    • Deep learning appears in machine translation
    [Sutskever+, NIPS 2014]
    – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
    problem in RNN
    →Dealing with relations between distant words in a sentence
    – Four-layer LSTM is trained in an end-to-end manner
    →comparable to state-of-the-art (English to French)
    Emergence of common techs such as CNN/RNN
    Reduction of barriers to get into CV+NLP
    Input
    Output

    View Slide

  113. Google NIC
    Concatenation of Google’s methods
    • GoogLeNet [Szegedy+, CVPR 2015]
    • MT with LSTM
    [Sutskever+, NIPS 2014]
    Caption (word seq.) 0

    for image
    0
    : beginning of the sentence
    1
    = LSTM CNN

    = LSTM St−1
    , = 2 … − 1

    : end of the sentence
    [Vinyals+, CVPR 2015]

    View Slide

  114. Examples of generated captions
    [https://github.com/tensorflow/models/tree/master/im2txt]
    [Vinyals+, CVPR 2015]

    View Slide

  115. Comparison to [Ushiku+, ACM MM 2012]
    Input image
    [Ushiku+, ACM MM 2012]:
    Conventional object recognition
    Fisher Vector + Linear classifier
    Neural image captioning:
    Conventional object recognition
    Convolutional Neural Network
    Neural image captioning
    Conventional machine translation
    Recurrent Neural Network + beam search
    [Ushiku+, ACM MM 2012]:
    Conventional machine translation
    Log Linear Model + beam search
    Estimation of important words Connect the words with grammar model
    • Trained using only images and captions
    • Approaches are similar to each other

    View Slide

  116. Current development: Accuracy
    • Attention-based captioning [Xu+, ICML 2015]
    – Focus on some areas for predicting each word!
    – Both attention and caption models are trained
    using pairs of an image & caption

    View Slide

  117. Current development: Problem setting
    Dense captioning
    [Lin+, BMVC 2015] [Johnson+, CVPR 2016]

    View Slide

  118. Current development: Problem setting
    Generating captions for a photo sequence
    [Park+Kim, NIPS 2015][Huang+, NAACL 2016]
    The family
    got
    together for
    a cookout.
    They had a
    lot of
    delicious
    food.
    The dog
    was happy
    to be there.
    They had a
    great time
    on the
    beach.
    They even
    had a swim
    in the water.

    View Slide

  119. Current development: Problem setting
    Captioning using sentiment terms
    [Mathews+, AAAI 2016][Shin+, BMVC 2016]
    Neutral caption
    Positive caption

    View Slide

  120. Before Deep Learning
    • Grounding of languages and objects in videos
    [Yu+Siskind, ACL 2013]
    – Learning from only videos and their captions
    – Experiment with a small object with few objects
    – Controlled and small dataset
    • Deep Learning should suite for this problem
    – Image Captioning: single image → word sequence
    – Video Captioning: image sequence → word
    sequence

    View Slide

  121. End-to-end learning by Deep Learning
    • LRCN
    [Donahue+, CVPR 2015]
    – CNN+RNN for
    • Action recognition
    • Image / Video
    Captioning
    • Video to Text
    [Venugopalan+, ICCV 2015]
    – CNNs to recognize
    • Objects from RGB frames
    • Actions from flow images
    – RNN for captioning

    View Slide

  122. Video Captioning
    A man is holding a box of doughnuts.
    Then he and a woman are standing next each other.
    Then she is holding a plate of food.
    [Shin+, ICIP 2016]

    View Slide

  123. Video Captioning
    A boat is floating on the water near a mountain.
    And a man riding a wave on top of a surfboard.
    Then he on the surfboard in the water.
    [Shin+, ICIP 2016]

    View Slide

  124. Video Retrieval from Caption
    • Input: Captions
    • Output: A video related to the caption
    10 sec video clip from 40 min database!
    • Video captioning is also addressed
    A woman in blue is
    playing ping pong in a
    room.
    A guy is skiing with no
    shirt on and yellow
    snow pants.
    A man is water skiing
    while attached to a
    long rope.
    [Yamaguchi+, ICCV 2017]

    View Slide