Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Language: An Introduction to NLP and...

Avatar for nslatysheva nslatysheva
February 27, 2019

Python for Language: An Introduction to NLP and the Python NLP Universe

Natural language processing (NLP) is an exciting and rapidly growing field at the intersection of language, text, and machine learning. This talk provides an introduction to some of the main areas of NLP and the Python libraries available for working on them.

The session will begin with a survey of NLP topics, including text classification, topic modelling, and chat bots. Then, there will be two deeper dives to explain the intuitions and techniques behind:
1) Word embeddings, which map words to numerically-represented meanings, and which have become a foundation for many NLP applications; and
2) Machine translation, focusing on providing a taster of modern deep learning methods for automatically translating text between languages

Avatar for nslatysheva

nslatysheva

February 27, 2019
Tweet

More Decks by nslatysheva

Other Decks in Programming

Transcript

  1. § Welocalize § Language services § 1500+ employees § 8th

    largest globally, 4th largest US § NLP engineering team § 13 people § Remote across US, Ireland, UK, Germany, China img
  2. § 1. Some NLP applications § 1. Text classification §

    2. Sentiment analysis § 3. Text summarization § 4. Topic modelling § 5. Chat bots § 6. Generating captions from images § 7. Generating images from captions § 8. Event2Mind § 2. Deeper dive into two NLP topics § Word embeddings – how to represent the meaning of words § Machine translation – how to translate between text. Dominant architectures used in modern translation systems
  3. § Spam/non-spam § Language detection § Harassment and abuse detection

    § Video games, chat rooms § Classify support ticket § Category, who you should assign it to § Customer feedback § How urgent is it, how angry is the customer Image 1 Fine Abuse
  4. § Often there’s a relatively strong signal in just the

    word usage § Stick a simple classifier on that from sklearn, like Naive Bayes § More informative: tf-idf matrix § SOTA text classification, unlike computer vision, classification in NLP fairly difficult § Ambiguity, need quite advanced knowledge of the world § “Universal Language Model Fine-tuning for Text Classification”
  5. § Film reviews § Social media monitoring: analyse Twitter, FB,

    Reddit to find your product or company name, extract surrounding text, get sentiment § Market research/competitor analysis: analyse reviews of your brand and its competition § Customer feedback: detect particularly angry or frustrated customers; detect what they dislike § Often single words are used § Difficult because of negation, metaphors/jokes, mixed sentiment Img “I like the film” “I do not like the film” “The best I can say is that The Room is certainly interesting” “Garmin smartwatches look slick and stylish, but the app is a complete mess”
  6. § Happy with word-based approach, train on existing corpus §

    NLTK to clean/tokenise § VADER § Valence Aware Dictionary and sEntiment Reasoner) § Lexicon + rule-based § RNN-based sentiment analysis § Search “sentiment analysis keras” to find tutorials NLTK VADER
  7. § OpenAI. Trained a LSTM recursive neural network to predict

    the next character in Amazon reviews § Digging in to what individual neurons are picking up on, realised one neuron seemed to be strongly responding to sentiment Img
  8. § OpenAI. Trained a LSTM recursive neural network to predict

    the next character in Amazon reviews § Digging in to what individual neurons are picking up on, realised one neuron seemed to be strongly responding to sentiment Img
  9. § Mostly not machine learning. Score sentences according to simple

    heuristics, return in chronological order § Particularly fine for news articles: dense, simple, sentences can stand alone § autotl;dr summariser uses SMMRY API (rule-based) § Various APIs available § NN-based ways to summarise text, seq2seq problem. § Examples
  10. § You have a lot of text § Want to

    automatically discover the topics within the text § LDA very common algorithm for topic modelling, Gensim library Cats and dogs are my favourite animals Masala chai is black tea with spices The brown dog eats some crunchy biscuits Find 2 topics.
  11. § You have a lot of text § Want to

    automatically discover the topics within the text § LDA very common algorithm for topic modelling, Gensim library Cats and dogs are my favourite animals Masala chai is black tea with spices The brown dog eats some crunchy biscuits 100% topic 1 100% topic 2 50% topic 1, 50% topic 2 Topic 1: cats, dogs, animals, dog Topic 2: masala, chai, tea, spices, crunchy, biscuits 1) 2)
  12. § You have a lot of text § Want to

    automatically discover the topics within the text § LDA very common algorithm for topic modelling, Gensim library Cats and dogs are my favourite animals Masala chai is black tea with spices The brown dog eats some crunchy biscuits 100% Animals 100% Food 50% Animals, 50% Food Animal topic: cats, dogs, animals, dog Food topic: masala, chai, tea, spices, crunchy, biscuits 1) 2)
  13. § Two types: § Retrieval-based methods § Generative methods §

    ELIZA therapy bot § Lots of platforms for building chatbots: § Link § Facebook Messenger bot § Python: ChatterBot § Scratch: tutorial. § Interacting with NPCs in games might become more interesting
  14. § Understanding content of image, translate understanding into natural language

    § 2 main components of neural network models for captioning: § 1. Feature extraction. § Use CNN for identifying salient features in images § 2. Language model § RNN takes extracted features (plus any words that have already been generated) and generates caption word by word § Can be trained jointly end-to-end, encoder-decoder architecture Image: “Long-term recurrent convolutional networks for visual recognition and description”, 2015.
  15. § Attention turns out to be important § Allows decoder

    to learn where in the image to put its attention as it generates each word in the caption Image: “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, 2015.
  16. § Generate images from text demo (now not working) §

    runwayapp.ai § Article with examples
  17. § Generate images from text demo (now not working) §

    runwayapp.ai § Article with examples
  18. § Generate images from text demo (now not working) §

    runwayapp.ai § Article with examples
  19. § Generate images from text demo (now not working) §

    runwayapp.ai § Article with examples
  20. § Really almost impossible to talk about NLP these days

    without talking about word embeddings § Basis for basically every modern approach § How do you represent words numerically so as to capture meaning?
  21. § In practice, you don’t usually do it from scratch

    :P § gensim library very popular for training § Use pre-trained word embeddings § Composed of many words, each word represented by long vector of coordinates § Sort of like transfer learning § FastText > GloVe > word2vec
  22. § Train shallow net to predict a surrounding word, given

    a word § Take hidden layer weight matrix, treat as coordinates § So goal is actually just to learn this hidden layer weight matrix… output layer, we don’t care about it Img
  23. § Train shallow net to predict a surrounding word, given

    a word § Take hidden layer weight matrix, treat as coordinates § So goal is actually just to learn this hidden layer weight matrix… output layer, we don’t care about it Img
  24. § BERT tutorial § Embedding depends on how the word

    is used! § ELMo (2018; embeddings from language models) Img
  25. § Embedding depends on how the word is used! §

    ELMo (2018; embeddings from language models) § BERT (late 2018) § Very similar to how transfer learning has been successful in computer vision. § Pre-training contextual representations
  26. § NLP very fast moving these days § Very fluent

    text about unicorns CHECK IT OUT: https://blog.openai.com/better-language-models/
  27. § Very fluent text about unicorns § Controversy about releasing

    this model § Incremental advance? But remarkable for its scale. Computer ~$45k
  28. § Historically: § 1. Rule-based § 2. SMT – statistical

    machine learning § Doing lots of counts, lots of Bayes’ Rule § 3. Now it’s mostly neural network- based Img
  29. § Very manual, very laborious § Hand-crafted rules by expert

    linguists § Early focus on Russian § Translate one word: English words “much” or “many” into Russian: Jurafsky and Martin, chapter 25
  30. § Gather lots of counts, frequencies § Use Bayes’ Rule

    to calculate direct probabilities § French -> English § What’s the English sentence that’s most probable given the French input? Img Le chat est noir “The cat is brown” “The cat is black” “The the the the” French Input Possible English Translations
  31. § Gather lots of counts, frequencies § Use Bayes’ Rule

    to calculate direct probabilities § French -> English § What’s the English sentence that’s most probable given the French input? Img Le chat est noir “The cat is brown” “The cat is black” “The the the the” French Input Possible English Translations
  32. § Gather lots of counts, frequencies § Use Bayes’ Rule

    to calculate direct probabilities § French -> English § What’s the English sentence that’s most probable given the French input? Img Le chat est noir “The cat is brown” “The cat is black” “The the the the” § Flip the probabilities around: French Input Possible English Translations
  33. § Seq2seq modelling problem § Characters, words are a sequence

    § RNNs, encoder-decoder architectures are very common
  34. § Catastrophic forgetting, vanishing gradient problem § GRUs – gated

    recurrent units § LSTM – long short term memory § Able to remember relevant information, forget less relevant information § Transformers – another way to summarise contextual semantics of sentence without using recurrence – self-attention mechanism, directly models relationships between all words
  35. § Building your own models § Here’s a long list

    of NMT frameworks § Personal workflow: OpenNMT-tf, 3 card GPU cluster, Docker containers, Python + bits of bash scripting for data processing