How Deep Learning Changes Natural Language Processing

How Deep Learning Changes Natural Language Processing Naoaki Okazaki http://www.chokkan.org/
http://www.nlp.c.titech.ac.jp/ School of Computing Tokyo Institute of Technology How Deep Learning Changes Natural Language Processing 1

Natural Language Processing and Deep Learning • Deep Learning (DL)
made a breakthrough in computer vision • Reduced the error rate of image recognition more than 10% (ILSVRC 2012) • At first, DL had limited impacts on Natural Language Processing (NLP) • Natural languages have symbols that represent semantic information • Recently, most NLP studies use DL • DL-based methods achieve the state-of-the-art performance on most tasks • Commercial Machine Translation systems replaced SMT with NMT • The essence of DL is representation learning How Deep Learning Changes Natural Language Processing 2 very good movie very good movie very good movie very good movie とてもよい映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector)

Take-home message of this talk • Deep Learning (DL) is
not only another ‘toy’ of machine learning, but also changes greatly the approach of NLP studies • DL replaces: • Monolithic ML tools with NN programs (16 mins) • Symbols with distributed representations (8 mins) • Pipeline architectures with end-to-end architectures • New trends of NLP (with brevity from DL) • Multi-modal processing • Incorporation of knowledge • Context modeling How Deep Learning Changes Natural Language Processing 3

Monolithic ML Tools to Neural Network Programs How Deep Learning
Changes Natural Language Processing 4

Start with a toy example: logical NAND • Logical NAND:
= ¬ 1 ∧ 2 • It has functional completeness • Any Boolean function can be implemented by using a combination of NAND gates (useful in practice!) How Deep Learning Changes Natural Language Processing 5 1 2 0 0 1 0 1 1 1 0 1 1 1 0 1 2

Realize logical NAND as a function • Find a function
that satisfies: = (1 , 2 ) • We can manually craft a function like this: 1 , 2 = 1 − 1 + 1 − 2 () = � 1 (if > 0) 0 (otherwise) How Deep Learning Changes Natural Language Processing 6 1 , 2 = 0 0 = 1 = 1 = 1 = 0 1 , 2 = 0 1 1 , 2 = 1 0 1 , 2 = 1 1

Crafting a function is not generic nor practical • Input/output
in a real-world task is more complex • Thus, we train a function from supervision data How Deep Learning Changes Natural Language Processing 7 In March 2005, the New York Times acquired About, Inc. IN NNP CD DT NNP NNP NNP VBD NNP NNP O B-NP I-NP B-NP I-NP I-NP I-NP B-VP B-NP B-NP 2005年 3月，ニューヨーク・タイムズは About 社を買収した． POS: Phrase: Translation: Input: I heard Google and Yahoo were among the other bidders. Response:

Training a model • Finding a function from scratch is
hard • We assume a model with parameters • Let’s assume a linear model (single-layer neural network): = (1 1 + 2 2 + ) • Try: https://chokkan.github.io/deeplearning/demo-slp.html • Train a model: find the parameters that reproduce the input/output of the supervision data How Deep Learning Changes Natural Language Processing 8 Output: ∈ {0,1} Input: 1 , 2 ∈ {0,1} Parameters: 1 , 2 , ∈ ℝ

Using an ML tool (liblinear) https://www.csie.ntu.edu.tw/~cjlin/liblinear/ $ cat nand.txt 1
1:0 2:0 1 1:0 2:1 1 1:1 2:0 0 1:1 2:1 $ ./train -B 1 nand.txt model.txt . optimization finished, #iter = 11 Objective value = -2.191466 nSV = 4 $ ./predict nand.txt model.txt result.txt Accuracy = 100% (4/4) $ cat result.txt 1 1 1 0 How Deep Learning Changes Natural Language Processing 9 1 2 0 0 1 0 1 1 1 0 1 1 1 0 Training data prepared for the NAND logic: Train a linear SVM model (with bias term -B 1) Test the model on the training data Actual output predicted by the model (the model could reproduce the NAND logic perfectly)

Using a DL framework (PyTorch) https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch # Training
data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Weights of the single-layer neural network. w = torch.randn(2, 1, dtype=torch.float, requires_grad=True) b = torch.randn(1, 1, dtype=torch.float, requires_grad=True) eta = 1 for t in range(1000): # Compute outputs on the training data. y_hat = x.mm(w).add(b).sigmoid() # Compute the loss. ll = y * y_hat + (1 - y) * (1 - y_hat) loss = -ll.log().sum() # Compute the gradient of the loss with respect to w and b. loss.backward() with torch.no_grad(): # Update weights using SGD. w -= eta * w.grad; b -= eta * b.grad # Clear the gradients for the next iteration. w.grad.zero_(); b.grad.zero_() How Deep Learning Changes Natural Language Processing 10 = 0 0 0 1 1 0 1 1 , = 1 1 1 0 ∈ ℝ2×1, ∈ ℝ1×1 � = + loss = −∑ log � + 1 − 1 − � ← − loss , ← − loss

A more high-level usage of PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch #
Training data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 11

ML tools look easier than DL frameworks! • (Conventional) ML
tools do not require programming! ☺ • However, what if the training data is XOR? • A single-layer NN cannot realize XOR (Minsky and Papert, 1969) • Change the model into a multi-layer NN: = 1 2 ℎ1 + 2 2 ℎ2 + 1 , ℎ1 = 11 1 1 + 12 1 2 + 1 1 , ℎ2 = 21 1 1 + 22 1 2 + 1 1 • We must find another ML tool for multi-layer NNs! ☹ How Deep Learning Changes Natural Language Processing 12 1 2 0 0 0 0 1 1 1 0 1 1 1 0 1 2 1 2 TRUE FALSE

Changes from NAND to XOR in PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch
# Training data for XOR. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 2, bias=True), # 2 dims (with bias) -> 2 dim torch.nn.Sigmoid(), # Sigmoid function torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 13 First layer Second layer

Summary: Monolithic tools into programs • Before the DL era
• Choose a tool for a model • liblinear (linear SVM), libsvm (kernel SVM), CRFsuite (CRF), • After the DL era • Implement a NN model as a program • Actual data and model (e.g., sequence-to-sequence) are more complicated than those in this example • We can describe any NN models as a program: • We design a model such that it can be trained by generic methods (stochastic gradient descent and backpropagation) How Deep Learning Changes Natural Language Processing 14

Symbolic Representations to Distributed Representations How Deep Learning Changes Natural
Language Processing 15

Distributed representation (Hinton+ 1986) • Local (symbolic) representation • Assigns
a unit (neuron, dimension, symbol) to every concept • Distributed representation (vectors) • Each concept is represented by multiple units (micro-features) • Each unit commits to multiple concepts How Deep Learning Changes Natural Language Processing 16 … … #249 … … #809 … … #18329

Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) How Deep Learning
Changes Natural Language Processing 17 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus (Context width ℎ = 2, # negative samples = 1) Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words

Distributed representations of words (word vectors) • English: GoogleNews-vectors-negative300.bin.gz •
Trained on Google News dataset (100B words) • https://code.google.com/archive/p/word2vec/ • Japanese: (trained by me) • Trained on Japanese Wikipedia articles (400M words) • Use gensim for manipulating them in Python How Deep Learning Changes Natural Language Processing 18

Plot word vectors How Deep Learning Changes Natural Language Processing
19

Measure the similarity of word vectors How Deep Learning Changes
Natural Language Processing 20

Finding similar words How Deep Learning Changes Natural Language Processing
21

Word analogy Paris:France=?:UK 𝑃𝑃 − 𝐹𝐹𝐹𝐹 + 安倍晋三:日本=?:ドイツ How Deep
Learning Changes Natural Language Processing 22 Direction to capital city

Distributed representations for sentences with Recurrent Neural Network (RNN) (Sutskever+
2011) How Deep Learning Changes Natural Language Processing 23 I have ℎ 4 𝑦 ℎℎ a ℎℎ pen ℎℎ softmax Word embeddings Represent a word with a vector ∈ ℝ ℎ ℎ ℎ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector −1 at the previous timestep = (ℎ + ℎℎ−1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 ℎℎ ☺ The parameters ℎ , ℎℎ , 𝑦 are shared over the entire sequence They are trained by the supervision signal 1 , … , 4 , using backpropagation

Summary: Symbolic to distributed • Before the DL era •
Extract features from words and syntactic trees • Unigrams, bigrams, POSs, dependency paths, … • After the DL era • Represent words (or letters) with vectors • Build a NN to compose vectors for sentences from word (or letter) vectors • Train the NN such that it can solve a target task • Here, we write a program to describe the NN model • Distributed representations are trained by product How Deep Learning Changes Natural Language Processing 24

Pipeline to End-to-End Architectures How Deep Learning Changes Natural Language
Processing 25

Statistical Machine Translation How Deep Learning Changes Natural Language Processing
26 GIZA++ KenLM, MITLM, … Translation model Language model Bilingual text Monolingual text Moses (Decoder) Source Target Pipeline architecture

Neural Machine Translation (NMT) (Sutskever+ 2014; Cho+ 2014) How Deep
Learning Changes Natural Language Processing 27 I have a ペンを持つ pen BOS ペンを持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs • Encode an input sentence into a vector, and generate a sentence by decoding (predicting) a word sequence from the feature • Known as encoder-decoder model and sequence-to-sequence model • Machine translation is realized by a single neural network, not by a combination of components! Representation of the input

Critical problem of the early NMT • The NMT model
represents an input with a fixed- size vector • The model has no flexibility about the amount of the information of an input • The model suffers from handling longer sentences How Deep Learning Changes Natural Language Processing 28 I have a ペンを持つ pen BOS ペンを持つ EOS

The idea of attention mechanism How Deep Learning Changes Natural
Language Processing 29 This is a pen BOS + これはペン BOS これはペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input (1) (2) (3) (4) (5)

Attention mechanism (Bahdanau+ 2015, Luong+ 2015) How Deep Learning Changes
Natural Language Processing 30 is (𝑠𝑠 = 2) a (𝑠𝑠 = 3) pen (𝑠𝑠 = 4) これは BOS ( = 1) これ ( = 2) (𝑠𝑠) � = tanh( [ ; ]) (𝑠𝑠) = exp score( , ) ∑′ exp score( , 𝑠 ) = � (𝑠𝑠) = softmax( � ) This (𝑠𝑠 = 1) score , = ⋅ • Different variables of time steps used for the encoder (𝑠𝑠) and decoder () • Computation flow (Luong+ 2015): −1 → → 𝑠𝑠 → → � → → +1 • score ℎ , ℎ : how much the decoder at time step need information from the time step 𝑠𝑠 in the encoder

Attention has an advantage on longer sentences How Deep Learning
Changes Natural Language Processing 31 (Luong+ 2015) local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder

Attention roughly represents alignments How Deep Learning Changes Natural Language
Processing 32 Global attention Local monotonic focus Gold alignment Local predictive focus (Luong+ 2015)

Summary: Pipeline to end-to-end • Before the DL era •
Develop a model/method and combine them • For example, SMT: GIZA + KenLM + Moses • After the DL era • Implement the whole model as a big NN • This approach is called the end-to-end approach • Attention is also trained in the end-to-end fashion • As we will see in the next section, many studies take this approach for various tasks How Deep Learning Changes Natural Language Processing 33

New trends of NLP (1): Multi-modal processing How Deep Learning

End-to-end caption generation (Vinyals+ 2015) How Deep Learning Changes Natural

Show, attend and tell (Xu+ 2015) How Deep Learning Changes
Natural Language Processing 36 (Xu+ 2016)

Attention over video frames (Laokulrat+ 2016) How Deep Learning Changes
Natural Language Processing 37 • Base model: Sequence-to-sequence model with two-layer LSTMs • Encode every eight frame of a video into a 4096 or 2048 dimensional vector by using the fc7 layer of VGG16 or ResNet. • Attention over the input video sequence

Visual Question Answering (VQA) (Goyal+ 17) Examples of VQA 2.0
How Deep Learning Changes Natural Language Processing 39

Visual Genome (Krishna+ 17) How Deep Learning Changes Natural Language
Processing 40 Describe each region of the image, and convert the text into graph Merge the graphs of the regions and obtain a scene graph A dataset describing objects and their relationships/attributes as a graph A number of studies parse a image and sentence into a graph on this dataset

New trends of NLP (2): Context modeling How Deep Learning

Reading comprehension (Hermann+ 2015) • Convert a pair of a
news article and summary into a cloze-style problem • Anonymize entities after resolving coreferences (to disrupt simple baselines) • A benchmark data to measure the capability of accumulating contexts • Data size: 90k articles (CNN), 220k articles (Daily Mail) How Deep Learning Changes Natural Language Processing 42 (Hermann+ 2015)

SQuAD (question answering) (Rajpurkar+ 2016) https://rajpurkar.github.io/SQuAD-explorer/ How Deep Learning Changes
Natural Language Processing 43 Questions and answers generated by crowd workers for the Wikipedia article about Oxygen Top performing system as of Sep 2017 Prediction by the baseline system

Story prediction (Mostafazadeh+ 2017) • Chooses a right ending of
a story consisting of four sentences • Use crowd sourcing to collect sentences representing the right ending and incorrect endings • Requires common sense knowledge about discourse • Accuracy: 100% (human) vs 75.2% (system) How Deep Learning Changes Natural Language Processing 44 Example of the dataset

Filling gaps in comics (Iyyer+ 2017) How Deep Learning Changes
Natural Language Processing 45 Choose the panel with the right text (the other panel has text jumbled) Choose the right panel of the scene (text information is hidden) Accuracy on character coherence: NN model (63.2%) vs Human (88%) Accuracy on visual cloze: NN model (70.9%) vs Human (87%)

Context-dependent MT (Bawden+ 2018) How Deep Learning Changes Natural Language
Processing 46

Conversational QA (CoQA) (Reddy+ 2018) How Deep Learning Changes Natural

DL for advancing MT (2018-2021) Project with Tokyo Tech, U
Tokyo, Ehime Univ, NHK, NES, and Jiji Press (funded by NICT) How Deep Learning Changes Natural Language Processing 48 MT for news articles （handling OOV） 2 Conversational MT Intellingent MT （context-aware） 1 Multi-modal MT 3 4 Integration by deep learning Utilize Flicker30k Visual Genome + Ja translations Photos and Captions Ja-En corpus from News articles Build Build Apply Verification in the media company Coreference Domain adaptation Context/scene understanding from text and image Consistency Scene recognition Multi-lingual Caption generation Summarization MT engine for news Handling new words/topics Ja-En Conversation corpus

Conclusion • DL accelerates research cycles • Also with the
preprint server (arXiv) • DL breaks the boundary of research areas • Multi-modal NLP • The end-to-end approach increases the bravery • A number of new tasks and corpora proposed • We are unsure whether we are progressing • The fast research cycle stimulates the research and follow-up verifications about the progress • More detail about DL: • Introduction to Deep Learning: https://chokkan.github.io/deeplearning/ How Deep Learning Changes Natural Language Processing 49

References • D Bahdanau, K Cho, Y Bengio. 2015. Neural
machine translation by jointly learning to align and translate. Proc. of ICLR. • R Bawden, R Sennrich, A Birch, B Haddow 2018. Evaluating discourse phenomena in neural machine translation. Proc. of NAACL. • K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734. • Y Goyal, T Khot, D Summers-Stay, D Batra, D Parikh. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc. of CVPR. • K M Hermann, T Kočiský, E Grefenstette, L Espeholt, W Kay, M Suleyman, P Blunsom. 2015. Teaching machines to read and comprehend. Proc. of NIPS, pp. 1684-1692. • G Hinton, J McClelland, and D Rumelhart. 1986. Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition, Volume I. Chapter 3, pp. 77-109. • M Iyyer, V Manjunatha, A Guha, Y Vyas, J Boyd-Graber, H Daumé III, L Davis. 2017. The amazing mysteries of the gutter: drawing inferences between panels in comic book narratives. Proc. of CVPR. • R Krishna, Y Zhu, O Groth, J Johnson, K Hata, J Kravitz, S Chen, Y Kalantidis, L-J Li, D A Shamma, M S Bernstein, F-F Li. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision:123(1). • N Laokulrat, S Phan, N Nishida, R Shu, Y Ehara, N Okazaki, Y Miyao, S Satoh, H Nakayama. 2016. Generating video description using sequence-to-sequence model with temporal attention, Proc. of Coling, pp. 44-52. • M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. Proc. of EMNLP, pp. 1412- 1421. • T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119. • M Minsky and S A Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press. • N Mostafazadeh, M Roth, A Louis, N Chambers, J Allen. 2017. LSDSem 2017 Shared Task: the story cloze test. 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. • P Rajpurkar, J Zhang, K Lopyrev, P Liang. 2016. SQuAD: 100,000+ Questions for machine comprehension of text. Proc. of EMNLP, pp. 2383- 2392. • S Reddy, D Chen, C D. Manning. 2018. CoQA: a conversational question answering challenge. Proc. of EMNLP. • I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024. • I Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. Proc. of NIPS, pp. 3104–3112. • O Vinyals, Q V Le. 2015. A neural conversational model, Proc. of ICML Deep Learning Workshop. • K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. Proc. of ICML, pp. 2048-2057. How Deep Learning Changes Natural Language Processing 50

How Deep Learning Changes Natural Language Proc...

How Deep Learning Changes Natural Language Processing

More Decks by Naoaki Okazaki

Other Decks in Technology

Featured

Transcript