Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

While there is a wide literature on developing neural networks for natural language understanding, the networks all have the same general architecture, determined by basic facts about the nature of linguistic input. In this talk I name and explain the four components (embed, encode, attend, predict), give a brief history of approaches to each subproblem, and explain two sophisticated networks in terms of this framework -- one for text classification, and another for textual entailment. The talk assumes a general knowledge of neural networks and machine learning. The talk should be especially suitable for people who have been working on computer vision or other problems.

Just as computer vision models are designed around the fact that images are two or three-dimensional arrays of continuous values, NLP models are designed around the fact that text is a linear sequence of discrete symbols that form a hierarchical structure: letters are grouped into words, which are grouped into larger syntactic units (phrases, clauses, etc), which are grouped into larger discursive structures (utterances, paragraphs, sections, etc).

Because the input symbols are discrete (letters, words, etc), the first step is "embed": map the discrete symbols into continuous vector representations. Because the input is a sequence, the second step is "encode": update the vector representation for each symbol given the surrounding context. You can't understand a sentence by looking up each word in the dictionary --- context matters. Because the input is hierarchical, sentences mean more than the sum of their parts. This motivates step three, attend: learn a further mapping from a variable-length matrix to a fixed-width vector, which we can then use to predict some specific information about the meaning of the text.

Matthew Honnibal

April 12, 2018
Tweet

More Decks by Matthew Honnibal

Other Decks in Programming

Transcript

  1. Embed, Encode,
    Attend, Predict
    A four-step framework for understanding
    neural network approaches to Natural
    Language Understanding problems
    Dr. Matthew Honnibal
    Explosion AI

    View full-size slide

  2. Taking a computer’s-eye

    view of language
    Imagine you don’t speak Indonesian. You’re given:
    ⭐ 10,000 rated restaurant reviews
    a quiet room, a pencil and paper
    one week
    ☕ a lot of coffee
    How could you learn to predict the ratings for
    new reviews?

    View full-size slide

  3. Siang ini saya mau makan di Teras dan tenyata penuh banget.
    Alhasil saya take away makanannya. Krn gak sabar nunggu. Benar
    kata orang kalau siang resto ini rame jadi menyiasatinya mungkin
    harus reserved dulu kali a
    Dateng ke sini karna ngeliat dari trip advisor... dan ternyata
    wow... ternyata reviewnya benar...makanannya lezat dan enak
    enak... variasinya banyak..dan dessertnya...es kopyor...super
    syegeer..saya suka
    Teras dharmawangsa tempatnya enak cozy place, enak untuk
    rame-rame, mau private juga ada, untuk makananya harganya
    terjangkau, terus rasanya enak, avocado coffe tidak terlalu manis

    View full-size slide

  4. Machine Learning and the
    reductionist’s dilemma
    Machine Learning is all about generalization
    What information matters in this example, and
    what’s irrelevant?
    Most sentences are unique, so we can’t process
    them holistically.
    If we can’t reduce, we can’t understand.

    View full-size slide

  5. How to understand reviews in a
    language you don’t understand
    Do the words in it usually occur in positive
    reviews or negative reviews?
    Track a positivity score for each Indonesian
    word. When you see a new word, assume its
    positivity is .
    Count up the average positivity score for the
    words in the review.
    0.5

    View full-size slide

  6. Bag-of-words Text Classification
    If and review is positive, 

    or and review is negative

    Your theory worked! Next review.
    If but review is negative:

    Your positivity scores for these words were too high!
    Decrease those scores slightly.
    If but review is positive:

    Your positivity scores for these words were too low!
    Increase those scores slightly.
    total > 0.5
    total < 0.5
    total > 0.5
    total < 0.5

    View full-size slide

  7. What are we discarding?
    What’s the reduction?
    We’re assuming:
    different words are unrelated
    words only have one meaning
    meanings can be understood in isolation
    How do we avoid assuming this?
    How do we learn what to learn?

    View full-size slide

  8. Embed. Encode.
    Attend. Predict.

    View full-size slide

  9. Think of data shapes,
    not application details.
    integer
    category label
    vector
    single meaning
    sequence of vectors
    multiple meanings
    matrix
    meanings in context

    View full-size slide

  10. All words look unique to the
    computer
    “dog” and “puppy” are just strings of letters
    easy to learn
    need to predict
    Problem #1
    P(id | "dog")
    P(id | "puppy")

    View full-size slide

  11. Learn dense embeddings
    “You shall know a word by the company it keeps.”
    “If it barks like a dog...”
    word2vec, PMI, LSI, etc.
    Solution #1

    View full-size slide

  12. We’re discarding context
    Problem #2
    “I don't even like seafood, but the
    scallops were something else.”
    “You should go somewhere else. Like,
    literally, anywhere else.”

    View full-size slide

  13. Learn to encode context
    take a list of word vectors
    encode into sentence matrix
    Solution #2

    View full-size slide

  14. Too much information
    Okay, you’ve got a sentence matrix. Now what?
    rows show meaning of individual tokens
    no representation of entire sentence
    Problem #3

    View full-size slide

  15. Learn what to pay attention to
    summarize sentence with respect to query
    get global problem-specific representation
    Solution #3

    View full-size slide

  16. We need a specific value,
    not a generic representation
    Okay, you’ve got a sentence vector. Now what?
    still working with “representations”
    our application is looking for a value
    Problem #4

    View full-size slide

  17. Learn to predict target values
    turn the generic architecture into a specific
    solution
    provide the value to your application
    Solution #4

    View full-size slide

  18. Putting it into practice

    View full-size slide

  19. A hierarchical neural network
    model for classifying text

    View full-size slide

  20. Predicting relationships between texts

    View full-size slide

  21. What if we don’t have 

    10,000 reviews?
    initialize the model with as much knowledge as
    possible: word embeddings, context embeddings,
    transfer learning
    save your data for attend and predict
    use general knowledge of the language for
    embed and encode

    View full-size slide

  22. Conclusion
    neural networks let us learn what to learn
    knowledge must come from somewhere, ideally
    unlabelled text (e.g. word embeddings)
    you still need labels to predict what you’re really
    interested in
    the general shapes are now well-understood –
    but there’s lots to mix and match

    View full-size slide

  23. Thanks!
    Explosion AI

    explosion.ai
    Follow us on Twitter

    @honnibal

    @explosion_ai

    View full-size slide