Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sequence Modelling with Deep Learning

Avatar for nslatysheva nslatysheva
November 20, 2019

Sequence Modelling with Deep Learning

Much of data is sequential – think speech, text, DNA, stock prices, financial transactions and customer action histories. Modern methods for modelling sequence data are often deep learning-based, composed of either recurrent neural networks (RNNs) or attention-based Transformers. A tremendous amount of research progress has recently been made in sequence modelling, particularly in the application to NLP problems. However, the inner workings of these sequence models can be difficult to dissect and intuitively understand.

This presentation/tutorial will start from the basics and gradually build upon concepts in order to impart an understanding of the inner mechanics of sequence models – why do we need specific architectures for sequences at all, when you could use standard feed-forward networks? How do RNNs actually handle sequential information, and why do LSTM units help longer-term remembering of information? How can Transformers do such a good job at modelling sequences without any recurrence or convolutions?

In the practical portion of this tutorial, attendees will learn how to build their own LSTM-based language model in Keras. A few other use cases of deep learning-based sequence modelling will be discussed – including sentiment analysis (prediction of the emotional valence of a piece of text) and machine translation (automatic translation between different languages).

The goals of this presentation are to provide an overview of popular sequence-based problems, impart an intuition for how the most commonly-used sequence models work under the hood, and show that quite similar architectures are used to solve sequence-based problems across many domains.

Avatar for nslatysheva

nslatysheva

November 20, 2019
Tweet

More Decks by nslatysheva

Other Decks in Technology

Transcript

  1. Overview I. Introduction to sequence modelling II. Quick neural network

    review • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  2. Speaker Intro • Welocalize • We provide language services •

    Fairly large, by revenue 8th largest globally, 4th largest US. 1500+ employees. • Lots of localisation (translation) • International marketing, site optimisation • NLP engineering team • 14 people remote across US, Ireland, UK, Germany, China • Various NLP things: machine translation, text-to-speech, NER, sentiment, topics, classification, etc.
  3. Less conventional sequence data • Activity on a website: •

    [click_button, move_cursor, wait, wait, click_subscribe, close_tab] • Customer history: • [inactive -> mildly_active -> payment_made -> complaint_filed -> inactive -> account_closed] • Code (constrained language) is sequential data – can learn the structure
  4. Why do we need fancy methods to model sequences? •

    Say we are training a translation model, English->French • ”The cat is black” to “Le chat is noir” • Could in theory use a feed- forward network to translate word-by-word
  5. Why do we need fancy methods? • A feed-forward network

    treats time steps as completely independent • Even in this simple 1-to-1 correspondence example, things are broken • How you translate “black” depends on noun gender (“noir” vs. “noire”) • How you translate “The” also depends on gender (“Le” vs. “La”) • More generally, getting the translation right requires context
  6. Why do we need fancy methods? • We need a

    way for the network to remember information from previous time steps
  7. Recurrent neural networks • Extremely popular way of modelling sequential

    data • Process data one time step at a time, while updating a running internal hidden state
  8. Standard FF network to RNN • At each time step,

    RNN passes on its activations from previous time step • In theory all the way back to the first time step
  9. Standard FF network to RNN • So you can say

    this is a form of memory • Cell hidden state transferred • Basis for RNNs remembering context
  10. Memory problems • Basic RNNs not great at long-term dependencies

    but plenty of ways to improve this • Information gating mechanisms • Condensing input using encoders
  11. Gating mechanisms • Gates regulate the flow of information •

    Very helpful - basic RNN cells not really used anymore. Responsible for recent RNN popularity. • Add explicit mechanisms to remember information and forget information • Why use gates? • Helps you learn long-term dependencies • Not all time points are equally relevant – not everything has to be remembered • Speeds up training/convergence
  12. Gated recurrent units (GRUs) • GRUs were developed later than

    LSTMs but are simpler • Motivation is to get the main benefits of LSTMs but with less computation • Reset gate: Mechanism to decide when to remember vs. forget/reset previous information (hidden state) • Update gate: Mechanism to decide when to update hidden state
  13. GRU mechanics • Reset gate controls how much past info

    we use • Rt = 0 means we are resetting our RNN, not using any previous information • Rt = 1 means we use all of previous information (back to our normal vanilla RNN)
  14. GRU mechanics • Update gate controls whether we bother updating

    our hidden state using new information • Zt = 1 means you’re not updating, you’re just using previous hidden state • Zt = 0 means you’re updating as much as possible
  15. LSTM mechanics • LSTMs add a memory unit to further

    control the flow of information through the cell • Also whereas GRUs have 2 gates, an LSTM cell has 3 gates: • An input gate – should I ignore or consider the input? • A forget gate – should I keep or throw away the information in memory? • An output gate – how should I use input, hidden state and memory to output my next hidden state?
  16. GRUs vs. LSTMs • GRUs are simpler + train faster

    • LSTMs more popular – can give slightly better performance, but GRU performance often on par • LSTMs would in theory outperform GRUs in tasks requiring very long-range modelling
  17. Notebook • ~30 mins • Jupyter notebook on building an

    RNN- based language model • Python 3 + Keras for neural networks tinyurl.com/wbay5o3
  18. Encoder-Decoder architectures • Tends to work a lot better than

    using a single sequence-to- sequence RNNs to produce an output for each input step • You often need to see the whole sequence before knowing what to output
  19. Bidirectionality in RNN encoder-decoders • For the encoder, bidirectional RNNs

    (BRNNs) often used • BRNNs read the input sequences forwards and backwards
  20. The problem with RNN encoder-decoders • Serious information bottleneck •

    Condense input sequence down to a small vector?! • Memorise long sequence + regurgitate • Not how humans work • Long computation paths
  21. Attention concept • Has been very influential in deep learning

    • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  22. Attention intuition • Attention allows the network to refer back

    to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  23. • Encoder: Used BRNN to compute rich set of features

    about source words and their surrounding words • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word Attention intuition
  24. • Decoder RNN uses attention parameters to decide how much

    to pay attention to different parts of the input • Allows the model to amplify the signal from relevant parts of the input sequence • This improves modelling Attention intuition
  25. Main benefits • Encoder passes a lot more data to

    the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path problem: relevant information is now closer by
  26. Summary so far • Sequence modelling • Recurrent neural networks

    • Some key components of SOTA RNN-based models: • Gating mechanisms (GRUs and LSTMs) • Encoder-decoders • Bidirectional encoding • Attention
  27. Transformers are taking over NLP • Translation, language models, question

    answering, summarisation, etc. • Some of the best word embeddings are based on Transformers • BERT, ELmO, OpenAI GPT-2 models
  28. A single Transformer encoder block • No recurrence, no convolutions

    • “Attention is all you need” paper • The core concept is the self- attention mechanism • Much more parallelisable than RNN-based models, which means faster training
  29. Self-attention is a sequence-to-sequence operation • At the highest level

    – self- attention takes t input vectors and outputs t output vectors • Take input embedding for “the” and update it by incorporating in information from its context
  30. • Each output vector is a weighted sum of the

    input vectors • But all of these weights are different
  31. These are not learned weights in the traditional neural network

    sense • The weights are calculated by taking dot products • Can use different functions over input
  32. Attention weight matrix • The dot product can be anything

    (negative infinity to positive infinity) • We normalise by length • We softmax this so that the weights are positive values summing to 1 • Attention weight matrix summarises relationship between words • Because dot products capture similarity between vectors
  33. Multi-headed attention • Attention weight matrix captures relationship between words

    • But there’s many different ways words can be related • And which ones you want to capture depends on your task • Different attention heads learn different relations between word pairs Img source
  34. Difference to RNNs • Whereas RNNs updates context token-by-token by

    updating internal hidden state, self- attention captures context by updating all word representations simultaneously • Lower computational complexity, scales better with more data • More parallelisable = faster training
  35. Connecting all these concepts • “Useful” input representations are learned

    • “Useful” weights for transforming input vectors are learned • These quantities should produce “useful” dot products • That lead to “useful” updated input vectors • That lead to “useful” input to the feed-forward network layer • … etc. … that eventually lead to lower overall loss on the training set
  36. Summary I. Introduction to sequence modelling II. Quick neural network

    review • How a single neuron functions • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  37. Further Reading • More accessible: Andrew Ng Sequence Course on

    Coursera • https://www.coursera.org/learn/nlp- sequence-models • More technical: Deep Learning book by Goodfellow et al. • https://www.deeplearningbook.org/cont ents/rnn.html • Also: Alex Smola Berkeley Lectures • https://www.youtube.com/user/smolix/vi deos
  38. Just for fun • Talk to transformer • https://talktotransformer.com/ •

    Using OpenAI’s “too dangerous to release” GPT- 2 language model
  39. Sequences in natural language • Sequence modelling very popular in

    NLP because language is sequential by nature • Text • Sequences of words • Sequences of characters • We process text sequentially, though in principle could see all words at once • Speech • Sequence of amplitudes over time • Frequency spectrogram over time • Extracted frequency features over time Img source
  40. Sequences in biology • Genomics, DNA and RNA sequences •

    Proteomics, protein sequences, structural biology • Trying to represent sequences in some way, or predict some function or association of the sequence Img source
  41. Sequences in finance • Lots of time series data •

    Numerical sequences (stocks, indices) • Lots of forecasting work – predicting the future (trading strategies) • Deep learning for these sequences perhaps not as popular as you might think • Quite well-developed methods based on classical statistics, interpretability important Img source Img source
  42. Single neuron computation • What computation is happening inside 1

    neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  43. Single neuron computation • What computation is happening inside 1

    neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  44. Perceptrons • Modelling a binary outcome using binary input features

    • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, " = 3 • Have I just had a cup of tea? • already_had, # =-1 • Can I get it to go? • to_go, $ =2
  45. Perceptrons • Modelling a binary outcome using binary input features

    • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, " = 3 • Have I just had a cup of tea? • already_had, # =-1 • Can I get it to go? • to_go, $ =2
  46. Perceptrons • Here weights are cherry-picked, but perceptrons learn these

    weights automatically from training data by shifting parameters to minimise error
  47. Perceptrons • Formalising the perceptron calculation • Instead of a

    threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency • Here, I manually chose these values – but given a dataset of past inputs/outputs, you could learn the optimal parameter values
  48. Perceptrons • Formalising the perceptron calculation • Instead of a

    threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency
  49. Sigmoid neurons • Want to handle continuous values • Where

    input can be something other than just 0 or 1 • Where output can be something other than just 0 or 1 • We put the weighted sum of inputs through an activation function • Sigmoid or logistic function
  50. Sigmoid neurons • The sigmoid function is basically a smoothed

    out perceptron! • Output no longer a sudden jump • It’s the smoothness of the function that we care about Img source
  51. Activation functions • Which activation function to use? • Heuristics

    based on experiments, not proof- based Img source
  52. More layers! • Increase number of layers to increase capacity

    for abstraction, hierarchical processing of input
  53. Training on big window sizes • How much of window

    size? On very long sequence, unrolled RNN becomes a very deep network • Same problems with vanishing/exploding gradients as normal networks • And takes a longer time to train • The normal tricks can help – good initialization of parameters, non- saturating activation functions, gradient clipping, batch norm • Training over a limited number of steps – truncated backpropagation through time
  54. LSTM mechanics • Input, forget, output gates are little neural

    networks within the cell • Memory being updated via forget gate and candidate memory • Hidden state being updated by output gate, which weighs up all information
  55. Query, Key, and Value transformations • Notice that we are

    using each input vector on 3 separate occasions • E.g. vector x2 1. To take dot products with each other input vector when calculating y2 2. In dot products with other output vectors (y1 , y3 , y4 ) are calculated 3. And in the weighted sum to produce output vector y2
  56. Query, Key, and Value transformations • To model these 3

    different functions for each input vector, and give the model extra expressivity and flexibility, we are going to modify the input vectors • Apply simple linear transformations
  57. Input transformation matrices • These weight matrices are learnable parameters

    • Gives something else to learn by gradient descent