Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Deep Learning Changes Natural Language Processing

How Deep Learning Changes Natural Language Processing

In this talk, I briefly overview the advantage of deep learning over the conventional methods of machine learning, e.g., automatic feature extraction, generic gradient-based learning, end-to-end learning, and versatile software framework. I then explain the key ideas of deep learning that have widely been accepted in NLP: distributed representations of words/phrases/sentences, encoder-decoder models, attention mechanisms, etc. Deep learning has not only provided an alternative approach to the statistical NLP, but also bridged NLP to other research areas and increased the ‘bravery’ of NLP research. I will explain the recent trends of NLP research including multi-modal processing and context modeling. I conclude this talk by summarizing the future prospect of NLP.

Naoaki Okazaki

September 18, 2018
Tweet

More Decks by Naoaki Okazaki

Other Decks in Technology

Transcript

  1. How Deep Learning Changes Natural Language Processing Naoaki Okazaki http://www.chokkan.org/

    http://www.nlp.c.titech.ac.jp/ School of Computing Tokyo Institute of Technology How Deep Learning Changes Natural Language Processing 1
  2. Natural Language Processing and Deep Learning • Deep Learning (DL)

    made a breakthrough in computer vision • Reduced the error rate of image recognition more than 10% (ILSVRC 2012) • At first, DL had limited impacts on Natural Language Processing (NLP) • Natural languages have symbols that represent semantic information • Recently, most NLP studies use DL • DL-based methods achieve the state-of-the-art performance on most tasks • Commercial Machine Translation systems replaced SMT with NMT • The essence of DL is representation learning How Deep Learning Changes Natural Language Processing 2 very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector)
  3. Take-home message of this talk • Deep Learning (DL) is

    not only another ‘toy’ of machine learning, but also changes greatly the approach of NLP studies • DL replaces: • Monolithic ML tools with NN programs (16 mins) • Symbols with distributed representations (8 mins) • Pipeline architectures with end-to-end architectures • New trends of NLP (with brevity from DL) • Multi-modal processing • Incorporation of knowledge • Context modeling How Deep Learning Changes Natural Language Processing 3
  4. Start with a toy example: logical NAND • Logical NAND:

    = ¬ 1 ∧ 2 • It has functional completeness • Any Boolean function can be implemented by using a combination of NAND gates (useful in practice!) How Deep Learning Changes Natural Language Processing 5 1 2 0 0 1 0 1 1 1 0 1 1 1 0 1 2
  5. Realize logical NAND as a function • Find a function

    that satisfies: = (1 , 2 ) • We can manually craft a function like this: 1 , 2 = 1 − 1 + 1 − 2 () = � 1 (if > 0) 0 (otherwise) How Deep Learning Changes Natural Language Processing 6 1 , 2 = 0 0 = 1 = 1 = 1 = 0 1 , 2 = 0 1 1 , 2 = 1 0 1 , 2 = 1 1
  6. Crafting a function is not generic nor practical • Input/output

    in a real-world task is more complex • Thus, we train a function from supervision data How Deep Learning Changes Natural Language Processing 7 In March 2005, the New York Times acquired About, Inc. IN NNP CD DT NNP NNP NNP VBD NNP NNP O B-NP I-NP B-NP I-NP I-NP I-NP B-VP B-NP B-NP 2005年 3月 , ニューヨーク・タイムズ は About 社 を 買収 し た . POS: Phrase: Translation: Input: I heard Google and Yahoo were among the other bidders. Response:
  7. Training a model • Finding a function from scratch is

    hard • We assume a model with parameters • Let’s assume a linear model (single-layer neural network): = (1 1 + 2 2 + ) • Try: https://chokkan.github.io/deeplearning/demo-slp.html • Train a model: find the parameters that reproduce the input/output of the supervision data How Deep Learning Changes Natural Language Processing 8 Output: ∈ {0,1} Input: 1 , 2 ∈ {0,1} Parameters: 1 , 2 , ∈ ℝ
  8. Using an ML tool (liblinear) https://www.csie.ntu.edu.tw/~cjlin/liblinear/ $ cat nand.txt 1

    1:0 2:0 1 1:0 2:1 1 1:1 2:0 0 1:1 2:1 $ ./train -B 1 nand.txt model.txt . optimization finished, #iter = 11 Objective value = -2.191466 nSV = 4 $ ./predict nand.txt model.txt result.txt Accuracy = 100% (4/4) $ cat result.txt 1 1 1 0 How Deep Learning Changes Natural Language Processing 9 1 2 0 0 1 0 1 1 1 0 1 1 1 0 Training data prepared for the NAND logic: Train a linear SVM model (with bias term -B 1) Test the model on the training data Actual output predicted by the model (the model could reproduce the NAND logic perfectly)
  9. Using a DL framework (PyTorch) https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch # Training

    data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Weights of the single-layer neural network. w = torch.randn(2, 1, dtype=torch.float, requires_grad=True) b = torch.randn(1, 1, dtype=torch.float, requires_grad=True) eta = 1 for t in range(1000): # Compute outputs on the training data. y_hat = x.mm(w).add(b).sigmoid() # Compute the loss. ll = y * y_hat + (1 - y) * (1 - y_hat) loss = -ll.log().sum() # Compute the gradient of the loss with respect to w and b. loss.backward() with torch.no_grad(): # Update weights using SGD. w -= eta * w.grad; b -= eta * b.grad # Clear the gradients for the next iteration. w.grad.zero_(); b.grad.zero_() How Deep Learning Changes Natural Language Processing 10 = 0 0 0 1 1 0 1 1 , = 1 1 1 0 ∈ ℝ2×1, ∈ ℝ1×1 � = + loss = −∑ log � + 1 − 1 − � ← − loss , ← − loss
  10. A more high-level usage of PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch #

    Training data for NAND. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[1], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 11
  11. ML tools look easier than DL frameworks! • (Conventional) ML

    tools do not require programming! ☺ • However, what if the training data is XOR? • A single-layer NN cannot realize XOR (Minsky and Papert, 1969) • Change the model into a multi-layer NN: = 1 2 ℎ1 + 2 2 ℎ2 + 1 , ℎ1 = 11 1 1 + 12 1 2 + 1 1 , ℎ2 = 21 1 1 + 22 1 2 + 1 1 • We must find another ML tool for multi-layer NNs! ☹ How Deep Learning Changes Natural Language Processing 12 1 2 0 0 0 0 1 1 1 0 1 1 1 0 1 2 1 2 TRUE FALSE
  12. Changes from NAND to XOR in PyTorch https://colab.research.google.com/drive/1P-zNuTFsEC5n2uHK6M86NIHsI6o3H3wa import torch

    # Training data for XOR. x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float) y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float) # Define a neural network using high-level modules. model = torch.nn.Sequential( torch.nn.Linear(2, 2, bias=True), # 2 dims (with bias) -> 2 dim torch.nn.Sigmoid(), # Sigmoid function torch.nn.Linear(2, 1, bias=True), # 2 dims (with bias) -> 1 dim ) # Binary corss-entropy loss after sigmoid function. loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum') # Optimizer based on SGD (change "SGD" to "Adam" to use Adam) optimizer = torch.optim.SGD(model.parameters(), lr=1) for t in range(1000): y_hat = model(x) # Make predictions. loss = loss_fn(y_hat, y) # Compute the loss. optimizer.zero_grad() # Zero-clear gradients. loss.backward() # Compute the gradients. optimizer.step() # Update the parameters using the gradients. How Deep Learning Changes Natural Language Processing 13 First layer Second layer
  13. Summary: Monolithic tools into programs • Before the DL era

    • Choose a tool for a model • liblinear (linear SVM), libsvm (kernel SVM), CRFsuite (CRF), • After the DL era • Implement a NN model as a program • Actual data and model (e.g., sequence-to-sequence) are more complicated than those in this example • We can describe any NN models as a program: • We design a model such that it can be trained by generic methods (stochastic gradient descent and backpropagation) How Deep Learning Changes Natural Language Processing 14
  14. Distributed representation (Hinton+ 1986) • Local (symbolic) representation • Assigns

    a unit (neuron, dimension, symbol) to every concept • Distributed representation (vectors) • Each concept is represented by multiple units (micro-features) • Each unit commits to multiple concepts How Deep Learning Changes Natural Language Processing 16 … … #249 … … #809 … … #18329
  15. Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) How Deep Learning

    Changes Natural Language Processing 17 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus (Context width ℎ = 2, # negative samples = 1) Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words
  16. Distributed representations of words (word vectors) • English: GoogleNews-vectors-negative300.bin.gz •

    Trained on Google News dataset (100B words) • https://code.google.com/archive/p/word2vec/ • Japanese: (trained by me) • Trained on Japanese Wikipedia articles (400M words) • Use gensim for manipulating them in Python How Deep Learning Changes Natural Language Processing 18
  17. Word analogy Paris:France=?:UK 𝑃𝑃 − 𝐹𝐹𝐹𝐹 + 安倍晋三:日本=?:ドイツ How Deep

    Learning Changes Natural Language Processing 22 Direction to capital city
  18. Distributed representations for sentences with Recurrent Neural Network (RNN) (Sutskever+

    2011) How Deep Learning Changes Natural Language Processing 23 I have ℎ 4 𝑦 ℎℎ a ℎℎ pen ℎℎ softmax Word embeddings Represent a word with a vector ∈ ℝ ℎ ℎ ℎ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector −1 at the previous timestep = (ℎ + ℎℎ−1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 ℎℎ ☺ The parameters ℎ , ℎℎ , 𝑦 are shared over the entire sequence They are trained by the supervision signal 1 , … , 4 , using backpropagation
  19. Summary: Symbolic to distributed • Before the DL era •

    Extract features from words and syntactic trees • Unigrams, bigrams, POSs, dependency paths, … • After the DL era • Represent words (or letters) with vectors • Build a NN to compose vectors for sentences from word (or letter) vectors • Train the NN such that it can solve a target task • Here, we write a program to describe the NN model • Distributed representations are trained by product How Deep Learning Changes Natural Language Processing 24
  20. Statistical Machine Translation How Deep Learning Changes Natural Language Processing

    26 GIZA++ KenLM, MITLM, … Translation model Language model Bilingual text Monolingual text Moses (Decoder) Source Target Pipeline architecture
  21. Neural Machine Translation (NMT) (Sutskever+ 2014; Cho+ 2014) How Deep

    Learning Changes Natural Language Processing 27 I have a ペン を 持つ pen BOS ペン を 持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs • Encode an input sentence into a vector, and generate a sentence by decoding (predicting) a word sequence from the feature • Known as encoder-decoder model and sequence-to-sequence model • Machine translation is realized by a single neural network, not by a combination of components! Representation of the input
  22. Critical problem of the early NMT • The NMT model

    represents an input with a fixed- size vector • The model has no flexibility about the amount of the information of an input • The model suffers from handling longer sentences How Deep Learning Changes Natural Language Processing 28 I have a ペン を 持つ pen BOS ペン を 持つ EOS
  23. The idea of attention mechanism How Deep Learning Changes Natural

    Language Processing 29 This is a pen BOS + これ は ペン BOS これ は ペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input (1) (2) (3) (4) (5)
  24. Attention mechanism (Bahdanau+ 2015, Luong+ 2015) How Deep Learning Changes

    Natural Language Processing 30 is (𝑠𝑠 = 2) a (𝑠𝑠 = 3) pen (𝑠𝑠 = 4) これ は BOS ( = 1) これ ( = 2) (𝑠𝑠) � = tanh( [ ; ]) (𝑠𝑠) = exp score( , ) ∑′ exp score( , 𝑠 ) = � (𝑠𝑠) = softmax( � ) This (𝑠𝑠 = 1) score , = ⋅ • Different variables of time steps used for the encoder (𝑠𝑠) and decoder () • Computation flow (Luong+ 2015): −1 → → 𝑠𝑠 → → � → → +1 • score ℎ , ℎ : how much the decoder at time step need information from the time step 𝑠𝑠 in the encoder
  25. Attention has an advantage on longer sentences How Deep Learning

    Changes Natural Language Processing 31 (Luong+ 2015) local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder
  26. Attention roughly represents alignments How Deep Learning Changes Natural Language

    Processing 32 Global attention Local monotonic focus Gold alignment Local predictive focus (Luong+ 2015)
  27. Summary: Pipeline to end-to-end • Before the DL era •

    Develop a model/method and combine them • For example, SMT: GIZA + KenLM + Moses • After the DL era • Implement the whole model as a big NN • This approach is called the end-to-end approach • Attention is also trained in the end-to-end fashion • As we will see in the next section, many studies take this approach for various tasks How Deep Learning Changes Natural Language Processing 33
  28. Show, attend and tell (Xu+ 2015) How Deep Learning Changes

    Natural Language Processing 36 (Xu+ 2016)
  29. Attention over video frames (Laokulrat+ 2016) How Deep Learning Changes

    Natural Language Processing 37 • Base model: Sequence-to-sequence model with two-layer LSTMs • Encode every eight frame of a video into a 4096 or 2048 dimensional vector by using the fc7 layer of VGG16 or ResNet. • Attention over the input video sequence
  30. Visual Question Answering (VQA) (Goyal+ 17) Examples of VQA 2.0

    How Deep Learning Changes Natural Language Processing 39
  31. Visual Genome (Krishna+ 17) How Deep Learning Changes Natural Language

    Processing 40 Describe each region of the image, and convert the text into graph Merge the graphs of the regions and obtain a scene graph A dataset describing objects and their relationships/attributes as a graph A number of studies parse a image and sentence into a graph on this dataset
  32. New trends of NLP (2): Context modeling How Deep Learning

    Changes Natural Language Processing 41
  33. Reading comprehension (Hermann+ 2015) • Convert a pair of a

    news article and summary into a cloze-style problem • Anonymize entities after resolving coreferences (to disrupt simple baselines) • A benchmark data to measure the capability of accumulating contexts • Data size: 90k articles (CNN), 220k articles (Daily Mail) How Deep Learning Changes Natural Language Processing 42 (Hermann+ 2015)
  34. SQuAD (question answering) (Rajpurkar+ 2016) https://rajpurkar.github.io/SQuAD-explorer/ How Deep Learning Changes

    Natural Language Processing 43 Questions and answers generated by crowd workers for the Wikipedia article about Oxygen Top performing system as of Sep 2017 Prediction by the baseline system
  35. Story prediction (Mostafazadeh+ 2017) • Chooses a right ending of

    a story consisting of four sentences • Use crowd sourcing to collect sentences representing the right ending and incorrect endings • Requires common sense knowledge about discourse • Accuracy: 100% (human) vs 75.2% (system) How Deep Learning Changes Natural Language Processing 44 Example of the dataset
  36. Filling gaps in comics (Iyyer+ 2017) How Deep Learning Changes

    Natural Language Processing 45 Choose the panel with the right text (the other panel has text jumbled) Choose the right panel of the scene (text information is hidden) Accuracy on character coherence: NN model (63.2%) vs Human (88%) Accuracy on visual cloze: NN model (70.9%) vs Human (87%)
  37. DL for advancing MT (2018-2021) Project with Tokyo Tech, U

    Tokyo, Ehime Univ, NHK, NES, and Jiji Press (funded by NICT) How Deep Learning Changes Natural Language Processing 48 MT for news articles (handling OOV) 2 Conversational MT Intellingent MT (context-aware) 1 Multi-modal MT 3 4 Integration by deep learning Utilize Flicker30k Visual Genome + Ja translations Photos and Captions Ja-En corpus from News articles Build Build Apply Verification in the media company Coreference Domain adaptation Context/scene understanding from text and image Consistency Scene recognition Multi-lingual Caption generation Summarization MT engine for news Handling new words/topics Ja-En Conversation corpus
  38. Conclusion • DL accelerates research cycles • Also with the

    preprint server (arXiv) • DL breaks the boundary of research areas • Multi-modal NLP • The end-to-end approach increases the bravery • A number of new tasks and corpora proposed • We are unsure whether we are progressing • The fast research cycle stimulates the research and follow-up verifications about the progress • More detail about DL: • Introduction to Deep Learning: https://chokkan.github.io/deeplearning/ How Deep Learning Changes Natural Language Processing 49
  39. References • D Bahdanau, K Cho, Y Bengio. 2015. Neural

    machine translation by jointly learning to align and translate. Proc. of ICLR. • R Bawden, R Sennrich, A Birch, B Haddow 2018. Evaluating discourse phenomena in neural machine translation. Proc. of NAACL. • K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734. • Y Goyal, T Khot, D Summers-Stay, D Batra, D Parikh. 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc. of CVPR. • K M Hermann, T Kočiský, E Grefenstette, L Espeholt, W Kay, M Suleyman, P Blunsom. 2015. Teaching machines to read and comprehend. Proc. of NIPS, pp. 1684-1692. • G Hinton, J McClelland, and D Rumelhart. 1986. Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition, Volume I. Chapter 3, pp. 77-109. • M Iyyer, V Manjunatha, A Guha, Y Vyas, J Boyd-Graber, H Daumé III, L Davis. 2017. The amazing mysteries of the gutter: drawing inferences between panels in comic book narratives. Proc. of CVPR. • R Krishna, Y Zhu, O Groth, J Johnson, K Hata, J Kravitz, S Chen, Y Kalantidis, L-J Li, D A Shamma, M S Bernstein, F-F Li. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision:123(1). • N Laokulrat, S Phan, N Nishida, R Shu, Y Ehara, N Okazaki, Y Miyao, S Satoh, H Nakayama. 2016. Generating video description using sequence-to-sequence model with temporal attention, Proc. of Coling, pp. 44-52. • M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. Proc. of EMNLP, pp. 1412- 1421. • T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119. • M Minsky and S A Papert. 1969. Perceptrons: an introduction to computational geometry. The MIT Press. • N Mostafazadeh, M Roth, A Louis, N Chambers, J Allen. 2017. LSDSem 2017 Shared Task: the story cloze test. 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. • P Rajpurkar, J Zhang, K Lopyrev, P Liang. 2016. SQuAD: 100,000+ Questions for machine comprehension of text. Proc. of EMNLP, pp. 2383- 2392. • S Reddy, D Chen, C D. Manning. 2018. CoQA: a conversational question answering challenge. Proc. of EMNLP. • I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024. • I Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. Proc. of NIPS, pp. 3104–3112. • O Vinyals, Q V Le. 2015. A neural conversational model, Proc. of ICML Deep Learning Workshop. • K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. Proc. of ICML, pp. 2048-2057. How Deep Learning Changes Natural Language Processing 50