Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

While there is a wide literature on developing neural networks for natural language understanding, the networks all have the same general architecture, determined by basic facts about the nature of linguistic input. In this talk I name and explain the four components (embed, encode, attend, predict), give a brief history of approaches to each subproblem, and explain two sophisticated networks in terms of this framework -- one for text classification, and another for textual entailment. The talk assumes a general knowledge of neural networks and machine learning. The talk should be especially suitable for people who have been working on computer vision or other problems.

Just as computer vision models are designed around the fact that images are two or three-dimensional arrays of continuous values, NLP models are designed around the fact that text is a linear sequence of discrete symbols that form a hierarchical structure: letters are grouped into words, which are grouped into larger syntactic units (phrases, clauses, etc), which are grouped into larger discursive structures (utterances, paragraphs, sections, etc).

Because the input symbols are discrete (letters, words, etc), the first step is "embed": map the discrete symbols into continuous vector representations. Because the input is a sequence, the second step is "encode": update the vector representation for each symbol given the surrounding context. You can't understand a sentence by looking up each word in the dictionary --- context matters. Because the input is hierarchical, sentences mean more than the sum of their parts. This motivates step three, attend: learn a further mapping from a variable-length matrix to a fixed-width vector, which we can then use to predict some specific information about the meaning of the text.

Matthew Honnibal

April 12, 2018
Tweet

More Decks by Matthew Honnibal

Other Decks in Programming

Transcript

  1. Embed, Encode, Attend, Predict A four-step framework for understanding neural

    network approaches to Natural Language Understanding problems Dr. Matthew Honnibal Explosion AI
  2. Taking a computer’s-eye
 view of language Imagine you don’t speak

    Indonesian. You’re given: ⭐ 10,000 rated restaurant reviews a quiet room, a pencil and paper one week ☕ a lot of coffee How could you learn to predict the ratings for new reviews?
  3. Siang ini saya mau makan di Teras dan tenyata penuh

    banget. Alhasil saya take away makanannya. Krn gak sabar nunggu. Benar kata orang kalau siang resto ini rame jadi menyiasatinya mungkin harus reserved dulu kali a Dateng ke sini karna ngeliat dari trip advisor... dan ternyata wow... ternyata reviewnya benar...makanannya lezat dan enak enak... variasinya banyak..dan dessertnya...es kopyor...super syegeer..saya suka Teras dharmawangsa tempatnya enak cozy place, enak untuk rame-rame, mau private juga ada, untuk makananya harganya terjangkau, terus rasanya enak, avocado coffe tidak terlalu manis
  4. Machine Learning and the reductionist’s dilemma Machine Learning is all

    about generalization What information matters in this example, and what’s irrelevant? Most sentences are unique, so we can’t process them holistically. If we can’t reduce, we can’t understand.
  5. How to understand reviews in a language you don’t understand

    Do the words in it usually occur in positive reviews or negative reviews? Track a positivity score for each Indonesian word. When you see a new word, assume its positivity is . Count up the average positivity score for the words in the review. 0.5
  6. Bag-of-words Text Classification If and review is positive, 
 or

    and review is negative
 Your theory worked! Next review. If but review is negative:
 Your positivity scores for these words were too high! Decrease those scores slightly. If but review is positive:
 Your positivity scores for these words were too low! Increase those scores slightly. total > 0.5 total < 0.5 total > 0.5 total < 0.5
  7. What are we discarding? What’s the reduction? We’re assuming: different

    words are unrelated words only have one meaning meanings can be understood in isolation How do we avoid assuming this? How do we learn what to learn?
  8. Think of data shapes, not application details. integer category label

    vector single meaning sequence of vectors multiple meanings matrix meanings in context
  9. All words look unique to the computer “dog” and “puppy”

    are just strings of letters easy to learn need to predict Problem #1 P(id | "dog") P(id | "puppy")
  10. Learn dense embeddings “You shall know a word by the

    company it keeps.” “If it barks like a dog...” word2vec, PMI, LSI, etc. Solution #1
  11. We’re discarding context Problem #2 “I don't even like seafood,

    but the scallops were something else.” “You should go somewhere else. Like, literally, anywhere else.”
  12. Learn to encode context take a list of word vectors

    encode into sentence matrix Solution #2
  13. Too much information Okay, you’ve got a sentence matrix. Now

    what? rows show meaning of individual tokens no representation of entire sentence Problem #3
  14. Learn what to pay attention to summarize sentence with respect

    to query get global problem-specific representation Solution #3
  15. We need a specific value, not a generic representation Okay,

    you’ve got a sentence vector. Now what? still working with “representations” our application is looking for a value Problem #4
  16. Learn to predict target values turn the generic architecture into

    a specific solution provide the value to your application Solution #4
  17. What if we don’t have 
 10,000 reviews? initialize the

    model with as much knowledge as possible: word embeddings, context embeddings, transfer learning save your data for attend and predict use general knowledge of the language for embed and encode
  18. Conclusion neural networks let us learn what to learn knowledge

    must come from somewhere, ideally unlabelled text (e.g. word embeddings) you still need labels to predict what you’re really interested in the general shapes are now well-understood – but there’s lots to mix and match