Slide 1

Slide 1 text

How Deep Learning Changes Natural Language Processing Naoaki Okazaki http://www.chokkan.org/ http://www.nlp.c.titech.ac.jp/ School of Computing Tokyo Institute of Technology How Deep Learning Changes Natural Language Processing 1

Slide 2

Slide 2 text

Natural Language Processing and Deep Learning • Deep Learning (DL) made a breakthrough in computer vision • Reduced the error rate of image recognition more than 10% (ILSVRC 2012) • At first, DL had limited impacts on Natural Language Processing (NLP) • Natural languages have symbols that represent semantic information • Recently, most NLP studies use DL • DL-based methods achieve the state-of-the-art performance on most tasks • Commercial Machine Translation systems replaced SMT with NMT • The essence of DL is representation learning How Deep Learning Changes Natural Language Processing 2 very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector)

Slide 3

Slide 3 text

Take-home message of this talk • Deep Learning (DL) is not only another ‘toy’ of machine learning, but also changes greatly the approach of NLP studies • DL replaces: • Monolithic ML tools with NN programs (16 mins) • Symbols with distributed representations (8 mins) • Pipeline architectures with end-to-end architectures • New trends of NLP (with brevity from DL) • Multi-modal processing • Incorporation of knowledge • Context modeling How Deep Learning Changes Natural Language Processing 3

Slide 4

Slide 4 text

Monolithic ML Tools to Neural Network Programs How Deep Learning Changes Natural Language Processing 4

Slide 5

Slide 5 text

Start with a toy example: logical NAND • Logical NAND: = ¬ 1 ∧ 2 • It has functional completeness • Any Boolean function can be implemented by using a combination of NAND gates (useful in practice!) How Deep Learning Changes Natural Language Processing 5 1 2 0 0 1 0 1 1 1 0 1 1 1 0 1 2

Slide 6

Slide 6 text

Realize logical NAND as a function • Find a function that satisfies: = (1 , 2 ) • We can manually craft a function like this: 1 , 2 = 1 − 1 + 1 − 2 () = � 1 (if > 0) 0 (otherwise) How Deep Learning Changes Natural Language Processing 6 1 , 2 = 0 0 = 1 = 1 = 1 = 0 1 , 2 = 0 1 1 , 2 = 1 0 1 , 2 = 1 1

Slide 7

Slide 7 text

Crafting a function is not generic nor practical • Input/output in a real-world task is more complex • Thus, we train a function from supervision data How Deep Learning Changes Natural Language Processing 7 In March 2005, the New York Times acquired About, Inc. IN NNP CD DT NNP NNP NNP VBD NNP NNP O B-NP I-NP B-NP I-NP I-NP I-NP B-VP B-NP B-NP 2005年 3月 , ニューヨーク・タイムズ は About 社 を 買収 し た . POS: Phrase: Translation: Input: I heard Google and Yahoo were among the other bidders. Response:

Slide 8

Slide 8 text

Training a model • Finding a function from scratch is hard • We assume a model with parameters • Let’s assume a linear model (single-layer neural network): = (1 1 + 2 2 + ) • Try: https://chokkan.github.io/deeplearning/demo-slp.html • Train a model: find the parameters that reproduce the input/output of the supervision data How Deep Learning Changes Natural Language Processing 8 Output: ∈ {0,1} Input: 1 , 2 ∈ {0,1} Parameters: 1 , 2 , ∈ ℝ

Slide 9

Slide 9 text

Using an ML tool (liblinear) https://www.csie.ntu.edu.tw/~cjlin/liblinear/ $ cat nand.txt 1 1:0 2:0 1 1:0 2:1 1 1:1 2:0 0 1:1 2:1 $ ./train -B 1 nand.txt model.txt . optimization finished, #iter = 11 Objective value = -2.191466 nSV = 4 $ ./predict nand.txt model.txt result.txt Accuracy = 100% (4/4) $ cat result.txt 1 1 1 0 How Deep Learning Changes Natural Language Processing 9 1 2 0 0 1 0 1 1 1 0 1 1 1 0 Training data prepared for the NAND logic: Train a linear SVM model (with bias term -B 1) Test the model on the training data Actual output predicted by the model (the model could reproduce the NAND logic perfectly)

Slide 10

Slide 10 text

Using a DL framework (PyTorch) https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch # Training data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Weights of the single-layer neural network. w = torch.randn(2, 1, dtype=torch.float, requires_grad=True) b = torch.randn(1, 1, dtype=torch.float, requires_grad=True) eta = 1 for t in range(1000): # Compute outputs on the training data. y_hat = x.mm(w).add(b).sigmoid() # Compute the loss. ll = y * y_hat + (1 - y) * (1 - y_hat) loss = -ll.log().sum() # Compute the gradient of the loss with respect to w and b. loss.backward() with torch.no_grad(): # Update weights using SGD. w -= eta * w.grad; b -= eta * b.grad # Clear the gradients for the next iteration. w.grad.zero_(); b.grad.zero_() How Deep Learning Changes Natural Language Processing 10 = 0 0 0 1 1 0 1 1 , = 1 1 1 0 ∈ ℝ2×1, ∈ ℝ1×1 � = + loss = −∑ log � + 1 − 1 − � ← − loss , ← − loss

Slide 11

Slide 11 text

A more high-level usage of PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch # Training data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 11

Slide 12

Slide 12 text

ML tools look easier than DL frameworks! • (Conventional) ML tools do not require programming! ☺ • However, what if the training data is XOR? • A single-layer NN cannot realize XOR (Minsky and Papert, 1969) • Change the model into a multi-layer NN: = 1 2 ℎ1 + 2 2 ℎ2 + 1 , ℎ1 = 11 1 1 + 12 1 2 + 1 1 , ℎ2 = 21 1 1 + 22 1 2 + 1 1 • We must find another ML tool for multi-layer NNs! ☹ How Deep Learning Changes Natural Language Processing 12 1 2 0 0 0 0 1 1 1 0 1 1 1 0 1 2 1 2 TRUE FALSE

Slide 13

Slide 13 text

Changes from NAND to XOR in PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch # Training data for XOR. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 2, bias=True), # 2 dims (with bias) -> 2 dim torch.nn.Sigmoid(), # Sigmoid function torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 13 First layer Second layer

Slide 14

Slide 14 text

Summary: Monolithic tools into programs • Before the DL era • Choose a tool for a model • liblinear (linear SVM), libsvm (kernel SVM), CRFsuite (CRF), • After the DL era • Implement a NN model as a program • Actual data and model (e.g., sequence-to-sequence) are more complicated than those in this example • We can describe any NN models as a program: • We design a model such that it can be trained by generic methods (stochastic gradient descent and backpropagation) How Deep Learning Changes Natural Language Processing 14

Slide 15

Slide 15 text

Symbolic Representations to Distributed Representations How Deep Learning Changes Natural Language Processing 15

Slide 16

Slide 16 text

Distributed representation (Hinton+ 1986) • Local (symbolic) representation • Assigns a unit (neuron, dimension, symbol) to every concept • Distributed representation (vectors) • Each concept is represented by multiple units (micro-features) • Each unit commits to multiple concepts How Deep Learning Changes Natural Language Processing 16 … … #249 … … #809 … … #18329

Slide 17

Slide 17 text

Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) How Deep Learning Changes Natural Language Processing 17 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus (Context width ℎ = 2, # negative samples = 1) Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words

Slide 18

Slide 18 text

Distributed representations of words (word vectors) • English: GoogleNews-vectors-negative300.bin.gz • Trained on Google News dataset (100B words) • https://code.google.com/archive/p/word2vec/ • Japanese: (trained by me) • Trained on Japanese Wikipedia articles (400M words) • Use gensim for manipulating them in Python How Deep Learning Changes Natural Language Processing 18

Slide 19

Slide 19 text

Plot word vectors How Deep Learning Changes Natural Language Processing 19

Slide 20

Slide 20 text

Measure the similarity of word vectors How Deep Learning Changes Natural Language Processing 20

Slide 21

Slide 21 text

Finding similar words How Deep Learning Changes Natural Language Processing 21

Slide 22

Slide 22 text

Word analogy Paris:France=?:UK 𝑃𝑃 − 𝐹𝐹𝐹𝐹 + 安倍晋三:日本=?:ドイツ How Deep Learning Changes Natural Language Processing 22 Direction to capital city

Slide 23

Slide 23 text

Distributed representations for sentences with Recurrent Neural Network (RNN) (Sutskever+ 2011) How Deep Learning Changes Natural Language Processing 23 I have ℎ 4 𝑦 ℎℎ a ℎℎ pen ℎℎ softmax Word embeddings Represent a word with a vector ∈ ℝ ℎ ℎ ℎ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector −1 at the previous timestep = (ℎ + ℎℎ−1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 ℎℎ ☺ The parameters ℎ , ℎℎ , 𝑦 are shared over the entire sequence They are trained by the supervision signal 1 , … , 4 , using backpropagation

Slide 24

Slide 24 text

Summary: Symbolic to distributed • Before the DL era • Extract features from words and syntactic trees • Unigrams, bigrams, POSs, dependency paths, … • After the DL era • Represent words (or letters) with vectors • Build a NN to compose vectors for sentences from word (or letter) vectors • Train the NN such that it can solve a target task • Here, we write a program to describe the NN model • Distributed representations are trained by product How Deep Learning Changes Natural Language Processing 24

Slide 25

Slide 25 text

Pipeline to End-to-End Architectures How Deep Learning Changes Natural Language Processing 25

Slide 26

Slide 26 text

Statistical Machine Translation How Deep Learning Changes Natural Language Processing 26 GIZA++ KenLM, MITLM, … Translation model Language model Bilingual text Monolingual text Moses (Decoder) Source Target Pipeline architecture

Slide 27

Slide 27 text

Neural Machine Translation (NMT) (Sutskever+ 2014; Cho+ 2014) How Deep Learning Changes Natural Language Processing 27 I have a ペン を 持つ pen BOS ペン を 持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs • Encode an input sentence into a vector, and generate a sentence by decoding (predicting) a word sequence from the feature • Known as encoder-decoder model and sequence-to-sequence model • Machine translation is realized by a single neural network, not by a combination of components! Representation of the input

Slide 28

Slide 28 text

Critical problem of the early NMT • The NMT model represents an input with a fixed- size vector • The model has no flexibility about the amount of the information of an input • The model suffers from handling longer sentences How Deep Learning Changes Natural Language Processing 28 I have a ペン を 持つ pen BOS ペン を 持つ EOS

Slide 29

Slide 29 text

The idea of attention mechanism How Deep Learning Changes Natural Language Processing 29 This is a pen BOS + これ は ペン BOS これ は ペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input (1) (2) (3) (4) (5)

Slide 30

Slide 30 text

Attention mechanism (Bahdanau+ 2015, Luong+ 2015) How Deep Learning Changes Natural Language Processing 30 is (𝑠𝑠 = 2) a (𝑠𝑠 = 3) pen (𝑠𝑠 = 4) これ は BOS ( = 1) これ ( = 2) (𝑠𝑠) � = tanh( [ ; ]) (𝑠𝑠) = exp score( , ) ∑′ exp score( , 𝑠 ) = � (𝑠𝑠) = softmax( � ) This (𝑠𝑠 = 1) score , = ⋅ • Different variables of time steps used for the encoder (𝑠𝑠) and decoder () • Computation flow (Luong+ 2015): −1 → → 𝑠𝑠 → → � → → +1 • score ℎ , ℎ : how much the decoder at time step need information from the time step 𝑠𝑠 in the encoder

Slide 31

Slide 31 text

Attention has an advantage on longer sentences How Deep Learning Changes Natural Language Processing 31 (Luong+ 2015) local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder

Slide 32

Slide 32 text

Attention roughly represents alignments How Deep Learning Changes Natural Language Processing 32 Global attention Local monotonic focus Gold alignment Local predictive focus (Luong+ 2015)

Slide 33

Slide 33 text

Summary: Pipeline to end-to-end • Before the DL era • Develop a model/method and combine them • For example, SMT: GIZA + KenLM + Moses • After the DL era • Implement the whole model as a big NN • This approach is called the end-to-end approach • Attention is also trained in the end-to-end fashion • As we will see in the next section, many studies take this approach for various tasks How Deep Learning Changes Natural Language Processing 33

Slide 34

Slide 34 text

New trends of NLP (1): Multi-modal processing How Deep Learning Changes Natural Language Processing 34

Slide 35

Slide 35 text

End-to-end caption generation (Vinyals+ 2015) How Deep Learning Changes Natural Language Processing 35

Slide 36

Slide 36 text

Show, attend and tell (Xu+ 2015) How Deep Learning Changes Natural Language Processing 36 (Xu+ 2016)

Slide 37

Slide 37 text

Attention over video frames (Laokulrat+ 2016) How Deep Learning Changes Natural Language Processing 37 • Base model: Sequence-to-sequence model with two-layer LSTMs • Encode every eight frame of a video into a 4096 or 2048 dimensional vector by using the fc7 layer of VGG16 or ResNet. • Attention over the input video sequence

Slide 38

Slide 38 text

Visual Question Answering (VQA) (Goyal+ 17) Examples of VQA 2.0 How Deep Learning Changes Natural Language Processing 39

Slide 39

Slide 39 text

Visual Genome (Krishna+ 17) How Deep Learning Changes Natural Language Processing 40 Describe each region of the image, and convert the text into graph Merge the graphs of the regions and obtain a scene graph A dataset describing objects and their relationships/attributes as a graph A number of studies parse a image and sentence into a graph on this dataset

Slide 40

Slide 40 text

New trends of NLP (2): Context modeling How Deep Learning Changes Natural Language Processing 41

Slide 41

Slide 41 text

Reading comprehension (Hermann+ 2015) • Convert a pair of a news article and summary into a cloze-style problem • Anonymize entities after resolving coreferences (to disrupt simple baselines) • A benchmark data to measure the capability of accumulating contexts • Data size: 90k articles (CNN), 220k articles (Daily Mail) How Deep Learning Changes Natural Language Processing 42 (Hermann+ 2015)

Slide 42

Slide 42 text

SQuAD (question answering) (Rajpurkar+ 2016) https://rajpurkar.github.io/SQuAD-explorer/ How Deep Learning Changes Natural Language Processing 43 Questions and answers generated by crowd workers for the Wikipedia article about Oxygen Top performing system as of Sep 2017 Prediction by the baseline system

Slide 43

Slide 43 text

Story prediction (Mostafazadeh+ 2017) • Chooses a right ending of a story consisting of four sentences • Use crowd sourcing to collect sentences representing the right ending and incorrect endings • Requires common sense knowledge about discourse • Accuracy: 100% (human) vs 75.2% (system) How Deep Learning Changes Natural Language Processing 44 Example of the dataset

Slide 44

Slide 44 text

Filling gaps in comics (Iyyer+ 2017) How Deep Learning Changes Natural Language Processing 45 Choose the panel with the right text (the other panel has text jumbled) Choose the right panel of the scene (text information is hidden) Accuracy on character coherence: NN model (63.2%) vs Human (88%) Accuracy on visual cloze: NN model (70.9%) vs Human (87%)

Slide 45

Slide 45 text

Context-dependent MT (Bawden+ 2018) How Deep Learning Changes Natural Language Processing 46

Slide 46

Slide 46 text

Conversational QA (CoQA) (Reddy+ 2018) How Deep Learning Changes Natural Language Processing 47

Slide 47

Slide 47 text

DL for advancing MT (2018-2021) Project with Tokyo Tech, U Tokyo, Ehime Univ, NHK, NES, and Jiji Press (funded by NICT) How Deep Learning Changes Natural Language Processing 48 MT for news articles (handling OOV) 2 Conversational MT Intellingent MT (context-aware) 1 Multi-modal MT 3 4 Integration by deep learning Utilize Flicker30k Visual Genome + Ja translations Photos and Captions Ja-En corpus from News articles Build Build Apply Verification in the media company Coreference Domain adaptation Context/scene understanding from text and image Consistency Scene recognition Multi-lingual Caption generation Summarization MT engine for news Handling new words/topics Ja-En Conversation corpus

Slide 48

Slide 48 text

Conclusion • DL accelerates research cycles • Also with the preprint server (arXiv) • DL breaks the boundary of research areas • Multi-modal NLP • The end-to-end approach increases the bravery • A number of new tasks and corpora proposed • We are unsure whether we are progressing • The fast research cycle stimulates the research and follow-up verifications about the progress • More detail about DL: • Introduction to Deep Learning: https://chokkan.github.io/deeplearning/ How Deep Learning Changes Natural Language Processing 49

Slide 49

Slide 49 text

References • D Bahdanau, K Cho, Y Bengio. 2015. Neural machine translation by jointly learning to align and translate. Proc. of ICLR. • R Bawden, R Sennrich, A Birch, B Haddow 2018. Evaluating discourse phenomena in neural machine translation. Proc. of NAACL. • K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734. • Y Goyal, T Khot, D Summers-Stay, D Batra, D Parikh. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc. of CVPR. • K M Hermann, T Kočiský, E Grefenstette, L Espeholt, W Kay, M Suleyman, P Blunsom. 2015. Teaching machines to read and comprehend. Proc. of NIPS, pp. 1684-1692. • G Hinton, J McClelland, and D Rumelhart. 1986. Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition, Volume I. Chapter 3, pp. 77-109. • M Iyyer, V Manjunatha, A Guha, Y Vyas, J Boyd-Graber, H Daumé III, L Davis. 2017. The amazing mysteries of the gutter: drawing inferences between panels in comic book narratives. Proc. of CVPR. • R Krishna, Y Zhu, O Groth, J Johnson, K Hata, J Kravitz, S Chen, Y Kalantidis, L-J Li, D A Shamma, M S Bernstein, F-F Li. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision:123(1). • N Laokulrat, S Phan, N Nishida, R Shu, Y Ehara, N Okazaki, Y Miyao, S Satoh, H Nakayama. 2016. Generating video description using sequence-to-sequence model with temporal attention, Proc. of Coling, pp. 44-52. • M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. Proc. of EMNLP, pp. 1412- 1421. • T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119. • M Minsky and S A Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press. • N Mostafazadeh, M Roth, A Louis, N Chambers, J Allen. 2017. LSDSem 2017 Shared Task: the story cloze test. 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. • P Rajpurkar, J Zhang, K Lopyrev, P Liang. 2016. SQuAD: 100,000+ Questions for machine comprehension of text. Proc. of EMNLP, pp. 2383- 2392. • S Reddy, D Chen, C D. Manning. 2018. CoQA: a conversational question answering challenge. Proc. of EMNLP. • I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024. • I Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. Proc. of NIPS, pp. 3104–3112. • O Vinyals, Q V Le. 2015. A neural conversational model, Proc. of ICML Deep Learning Workshop. • K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. Proc. of ICML, pp. 2048-2057. How Deep Learning Changes Natural Language Processing 50