Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Neural Machine Translation System fr...

Building a Neural Machine Translation System from Scratch

Human languages are complex, diverse and riddled with exceptions – translating between different languages is therefore a highly challenging technical problem. Deep learning approaches have proved powerful in modelling the intricacies of language, and have surpassed all statistics-based methods for automated translation. This session begins with an introduction to the problem of machine translation and discusses the two dominant neural architectures for solving it – recurrent neural networks and transformers. A practical overview of the workflow involved in training, optimising and adapting a competitive neural machine translation system is provided. Attendees will gain an understanding of the internal workings and capabilities of state-of-the-art systems for automatic translation, as well as an appreciation of the key challenges and open problems in the field.

Avatar for nslatysheva

nslatysheva

May 07, 2019
Tweet

More Decks by nslatysheva

Other Decks in Technology

Transcript

  1. Building a Neural Machine Translation System from Scratch Deep Learning

    World 2019, Munich Natasha Latysheva, Welocalize
  2. This talk • 1. Introduction to machine translation • 2.

    Data for machine translation • 3. Representing words with embeddings • 4. Deep learning architectures • Recurrent neural networks • Transformers • 5. Some fun things about MT • 6. Tech stack for machine translation
  3. Intro • Welocalize • Language services • 1500+ employees •

    8th largest globally, 4th largest US • NLP engineering team • 13 people • Remote across US, Ireland, UK, Germany, China img
  4. Intro • Lots of localisation (translation) • Often for tech

    companies • Also do: life sciences, banking, patent/legal • International marketing, site optimisation • Various NLP things: text-to-speech, sentiment, topics, classification, NER, etc. img
  5. What is machine translation? • Automated translation between languages •

    MT challenges: • Language is very complex, flexible with lots of exceptions • Language pairs might be very different • Lots of ”non-standard” usage • Not always a lot of data • But if people can do it, a model should be able to learn to do it
  6. What is machine translation? • Automated translation between languages •

    MT challenges: • Language is very complex, flexible with lots of exceptions • Language pairs might be very different • Lots of ”non-standard” usage • Not always a lot of data • But if people can do it, a model should be able to learn to do it • Why bother? • Huge industry and market demand because communication is important • Humans are expensive and slow • Research side: understanding language is probably key to intelligence
  7. Rule-based MT • Very manual, laborious. Hand-crafted rules by expert

    linguists. • Early focus on Russian. E.g. translate English “much” or “many” into Russian: Jurafsky and Martin, Speech and Language Processing, chapter 25
  8. Data for machine translation • Parallel texts, bitexts, corpus •

    You need a lot of data (millions of decent length sentence pairs) to build decent neural systems • Increasing amount of freely- available parallel data available (curated or scraped or both)
  9. Neural machine translation • Dominant architecture is an encoder-decoder •

    Based on recurrent neural networks (RNNs) • Or the Transformer
  10. Word embeddings • ML models can’t process text strings directly

    in any meaningful way • Need to find a way to represent words as numbers • And hopefully the numbers are linguistically meaningful in some way
  11. Simplest way to encode words • Your vocabulary is all

    the possible words • Each word is assigned an integer index
  12. Properties of word embeddings • Similar words cluster together •

    The values are coordinates in a high-dimensional semantic space • Not easily interpretable Img
  13. Where do the embeddings come from? • Which embedding? •

    FastText > GloVe > word2vec • This is (shallow) transfer learning in NLP Google ML blog post link
  14. Calculating embeddings • word2vec skip-gram example • Train shallow net

    to predict a surrounding word, given a word • Take hidden layer weight matrix, treat as coordinates • So goal is actually just to learn this hidden layer weight matrix… the output layer, we don’t care about Chris McCormick blog post
  15. Calculating embeddings • word2vec skip-gram example • Train shallow net

    to predict a surrounding word, given a word • Take hidden layer weight matrix, treat as coordinates • So goal is actually just to learn this hidden layer weight matrix… the output layer, we don’t care about Chris McCormick blog post
  16. Recurrent Neural Networks (RNNs) • Why not feed-forward networks for

    translation? • Words aren’t independent • Third word really depends on first and second • Similar to how conv nets capture interdependence of neighbouring pixels
  17. Pictures of RNNs • Main idea behind RNNs is that

    you’re allowed to reuse information from previous time steps to inform predictions of the current time step
  18. Standard FF network to RNN • At each time step,

    RNN passes on its activations from previous time step for next time step to use • Parameters governing the connection are shared between time steps
  19. More layers! • Increase number of layers to increase capacity

    for abstraction, hierarchical processing of input
  20. Almost there.. • Bidirectionality and attention • For the encoder,

    bidirectional RNNs (BRNNs) often used • BRNNs read the input text forwards and backwards
  21. Trouble with memorising long passages • For long sentences, we’re

    asking encoder-decoder to read the entire English sentence, memorise it, then write it back in French • Condense everything down to small vector?! • The issue is that the decoder needs different information at different timesteps but all it gets is this vector • Not really how human translators work
  22. The problem with RNN encoder-decoders • Serious information bottleneck •

    Condense all source input down to a small vector?! • Long computation paths
  23. Some ways of handling long sequences • Long-range dependencies •

    LSTMs, GRUs • Meant for long-range memory.. but it’s still very difficult Colah’s blog, “Understanding LSTMs”
  24. Some ways of handling long sequences • Reverse source sentence

    (feed it in backwards) • Kind of a hack… works for English- >French, what about Japanese? • Feed sentence in twice
  25. Attention Idea • Has been very influential in deep learning

    • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  26. Attention intuition • Attention allows the network to refer back

    to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  27. Attention intuition • Encoder: Used BRNN to compute rich set

    of features about source words and their surrounding words • Decoder: Use another RNN to generate output as before • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word
  28. Attention intuition • Decoder RNN uses attention parameters to decide

    how much to pay attention to input features • Allows the model to amplify the signal from relevant parts of the source sequence • This improves translation
  29. Main differences and benefits • Encoder passes a lot more

    data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path length a lot shorter from relevant information
  30. Transformers • Paradigm shift in sequence processing • People convinced

    you need recurrence or convolutions to learn interdependence • RNNs were the best way to capture time-dependent patterns, like in language • Transformers use only attention to do the same job
  31. Transformer intuition • Also have an encoder-decoder structure • In

    RNNs, hidden state incorporates context • In transformers, self-attention incorporates context
  32. Transformer intuition • Self-attention • Instead of processing input tokens

    one by one, attention takes in set of input tokens • Learns the dependencies between all of them using three learned weight matrices (key, value, query) • Makes better use of GPU resources
  33. Transformers now often SOTA • You can often get a

    couple of points improvement by switching to using Transformers for your machine translation system
  34. Tokenisation and sub-word embeddings • Maybe our embeddings should reflect

    that different forms of same word are related • “walking” and “walked” • Translation works better if you can incorporate some morphological knowledge • Can be learned or linguistic knowledge baked in Cotterell and Schutze, 2018, Joint Semantic Synthesis and Morphological Analysis of the Derived Word,
  35. Frameworks • How low-level do you go? • Implementing backprop

    and gradient checking yourself in numpy • … • Clicking ‘Train’ and ‘Deploy’ in a GUI • You probably want to be somewhere in between • Around a dozen open-source NMT implementations around • Nematus, OpenNMT, tensorflow- seq2seq, Marian, fair-seq, Tensor2Tensor
  36. Recommendations • Python scientific stack • OpenNMT-tf (TensorFlow version) •

    TensorBoard monitoring is great • Good checkpointing, automatic evaluation during training • Get some GPUs if you can • 3x GeForce GTX Titan X take 2-3 days to train decent Transformer model • AWS or GCP good but can be expensive • Docker containers
  37. Statistical machine translation • Gather lots of counts, frequencies •

    How often n-grams in the source language map to n- grams in the target language • Bayes’ Rule to calculate probabilities • p(f) and p(e) are language models • p(e) denominator can be ignored
  38. Main differences and benefits • Attention weights can be used

    to draw alignments • In languages that are aligned, like Romance languages, attention will likely choose to align things sequentially Bahdanau, 2014, Neural Machine Translation by Jointly Learning to Align and Translate
  39. Actually generating translations • What you want in machine translation

    is to generate the best (most probable) translation • Greedy: pick most likely first word, then pick the most likely second word, etc.
  40. Actually generating translations • Given that you want to max

    the probability of the whole output sequence.. • Not always optimal to pick one word at a time • Can’t exhaustively search every combination of words either • Approximate search algorithm: beam search
  41. Beam search: actually generating translations • Beam width of 3

    • At each time step, keeping the 3 best candidate translations so far