Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beginner's Guide to Natural Language Processing

Beginner's Guide to Natural Language Processing

Antje Barth

May 06, 2020
Tweet

More Decks by Antje Barth

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Beginner’s Guide to Natural Language Processing (NLP) Antje Barth Developer Advocate AI/ML Amazon Web Services
  2. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. King - Man + Woman = Queen = ??
  3. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Agenda Introduction to NLP BERT-family of models NLP with Amazon Comprehend Demo
  4. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Problem statement • Natural Language Processing (NLP) is a major field in AI • NLP apps require a language model in order to predict the next word • Vocabulary size can be hundreds of thousands of words … in millions of documents • Can we build a compact mathematical representation of language, that will help with a variety of domain-specific NLP tasks?
  5. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. « You shall know a word by the company it keeps », Firth (1957) • Word vectors are built from co-occurrence counts • Also called word embeddings • High dimensional: at least 50, up to 300 • Words with similar meanings should have similar vectors • “car” ≈ “automobile” ≈ “sedan” • The distance between vectors for the same concepts should be similar • distance (“Paris”, ”France”) ≈ distance(“Berlin”, ”Germany”) • distance(“hot”, ”hotter”) ≈ distance(“cold”, ”colder”) • King – Man + Woman = Queen
  6. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. High-level view 1. Start from a large text corpus (100s of millions of words, even billions) 2. Preprocess the corpus into tokens Tokenize: « hello, world! » à « <BOS>hello<SP>world<SP>!<EOS>» Multi-word entities: « Rio de Janeiro » à « rio_de_janeiro » 3. Build the vocabulary from the tokens 4. Learn vector representations for the vocabulary … or simply use pre-trained models with existing vector representations (more on this later)
  7. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Popular NLP use cases Representation learning Machine Translation Text Classification Language Modeling Sentiment Analysis Named Entity Recognition Question Answering
  8. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Evolution of NLP algorithms Jan 2013 Jan 2014 Jul 2016 Jun 2017 Feb 2018 Oct 2018 Shallow neural network Continuous bag-of-words and continuous skip-gram Global Vectors for Word Representation Matrix factorization Extension of Word2Vec: Each word is treated as a set of sub-words (character n-grams)
  9. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Limitations of Word2Vec (and family) • Some words have different meanings « Kevin, stop throwing rocks! » vs. « Machine Learning rocks » Word2Vec encodes the different meanings of a word as the same vector • Bidirectional context is not taken into account Previous words (left-to-right) and next words (right-to-left)
  10. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Evolution of NLP algorithms Jan 2013 Jan 2014 Jul 2016 Jun 2017 Feb 2018 Oct 2018 “Embeddings from Language Models” (Pseudo-)bi-directional context using two uni-directional LSTMs “Attention Is All You Need” Replace LSTM with Transformers implementing true bidirectional attention
  11. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Attention on sentence “This movie is funny, it is great” The movie is funny , it is This movie is funny , it is great This is , it is great
  12. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. BERT Bidirectional Encoder Representations from Transformers https://arxiv.org/abs/1810.04805 https://github.com/google-research/bert • BERT improves on ELMo • Replace LSTM with Transformers, which deal better with long-term dependencies • Truly bidirectional architecture: left-to-right and right-to-left contexts are learned by the same network • Words are randomly masked during training to improve learning • Sentences are randomly paired to improve Next Sentence Prediction (NSP) • Pre-trained models: BERT Base and BERT Large Layers Hidden Units Parameters BERT base 12 768 110M BERT large 24 1024 340M
  13. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. BERT Pre-Training and Fine-Tuning
  14. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. BERT This is [MASK] great “movie” BERT Pre-Training BERT Pre-Training: Masked Language Model (MLM) This movie is great Outputs: P(movie| This, [MASK], is, great) [MASK] Randomly mask 15% of all tokens and predict token
  15. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. BERT Pre-Training: Next Sentence Prediction (NSP) Predict next sentence • 50% of the time, replace one sentence in a sentence pair with another random sentence • Feed the two-sentence encodings into a dense layer to predict if they are a pair. Goal: Learn logical coherence <cls> this movie is great <sep> i love thrillers <sep> <cls> this movie is great <sep> tomorrow is saturday <sep> Is pair? [CLS] this movie [MASK] great [SEP] BERT i [MASK] thrillers [SEP]
  16. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Input: Input: This product is great BERT Fine-Tuning Star Rating Classifier (1 star = bad, 5 stars = good) Output: 5 stars BERT fine-tuning (text classification) [BOS] This is [EOS] Fine-tuning classifier 5 stars BERT great product
  17. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Discover insights and relationships in text – no ML experience required
  18. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AI services Pre-trained AI services that require no ML skills or training Easily add intelligence to your existing applications and workflows Quality and accuracy from continuously learning APIs Vision Speech Text Search Chatbots Personalization Forecasting Fraud Development Contact centers Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru Amazon Textract Amazon Kendra Contact Lens For Amazon Connect
  19. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Comprehend Discover insights and relationships in text Amazon Comprehend Entities Key phrases Language Sentiment Topic modeling
  20. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Text Analysis A m a z o n . c o m , I n c . i s l o c a t e d i n S e a t t l e , W A a n d w a s f o u n d e d J u l y 5 t h , 1 9 9 4 b y J e f f B e z o s . O u r c u s t o m e r s l o v e b u y i n g e v e r y t h i n g f r o m b o o k s t o b l e n d e r s a t g r e a t p r i c e s N a m e d E n t i t i e s A m a z o n . c o m : O r g a n i z a t i o n S e a t t l e , W A : L o c a t i o n J u l y 5 t h , 1 9 9 4 : D a t e J e f f B e z o s : P e r s o n K e y p h r a s e s O u r c u s t o m e r s b o o k s b l e n d e r s g r e a t p r i c e s S e n t i m e n t P o s i t i v e L a n g u a g e E n g l i s h
  21. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Custom entities • Analyze documents for your business and domain terms and phrases • Bring your own schema to unstructured text analytics Hello, my name is John Doe, and thank you for calling AnyCompany. I understand you are calling about part number XT2457. We have that part on back order, and we are expediting it for you. Thank you, but I was expecting it last week. At this point, I think we should go ahead and cancel the order entirely. We are sorry to hear that, sir. Would you be willing to complete the order if we offered a 10% discount? Yes, thank you. Person: John Doe Organization: AnyCompany Part: XT2457 Account_Action: Cancel the order
  22. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Automated custom entity recognizer training Prepare examples Train the service 1 2 3 Analyze Terms and phrases Documents containing terms and phrases in text Entity value Entity type XT2457 Part AnyCompany Organization John Doe Person Cancel the order Account_Action • Automated annotation • Automated algorithm selection • Automated tuning and testing • SDK or code-free console UX Hello, my name is John Doe, and thank you for calling AnyCompany. I understand you are calling about part number XT2457. We have that part on back order, and we are expediting it for you. Thank you, but I was expecting it last week. At this point, I think we should go ahead and cancel the order entirely. We are sorry to hear that, sir. Would you be willing to complete the order if we offered a 10% discount? Yes, thank you.
  23. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Custom classification • Triage support tickets • Moderate forums • Organize customer feedback • Organize support calls Classification: PRICING Classification: CANCEL_ACCOUNT Classification: LOYALTY_PROGRAM Example
  24. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Automated custom classifier training Create a .csv file with training data Train the service Classify Text Label I am calling about my credit card LOYALTY_PROGRAM I really need to shut the service down CANCEL_ACCOUNT My points are not being applied correctly LOYALTY_PROGRAM The service is very expensive compared to competition PRICING I need a discount to subscribe PRICING • Automated algorithm selection • Automated tuning and testing • SDK or code-free console UX 1 2 3
  25. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://github.com/data-science-on- aws/workshop/blob/0c1b2a95f74794a756a55d1f4d7abc7ed4d76f86/02_automl/ 03_Train_Reviews_Comprehend.ipynb
  26. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Customer Reviews Dataset https://registry.opendata.aws/amazon-reviews/
  27. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://github.com/data-science-on- aws/workshop/blob/b0b716e4803ac79b25de2b8e4525af0754b9339a/06_train/ 03_Train_Reviews_BERT_Transformers_TensorFlow_ScriptMode.ipynb
  28. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Text Classification (Star Rating) 1. Build a dataset of labeled sentences 2. Grab a pre-trained model (BERT), and add a classification layer 3. Convert each sentence (Amazon review) to a list of vectors using pre-trained tokenizer (BERT tokenizer) 4. Train or fine-tune the model to predict the correct class (star rating) for each review [BOS] This is [EOS] Fine-tuning classifier 5 stars BERT great product
  29. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Get started on AWS https://ml.aws https://aws.amazon.com/marketplace/ solutions/machine-learning/natural- language-processing https://aws.amazon.com/comprehend/ https://github.com/data-science-on- aws/workshop https://www.datascienceonaws.com/
  30. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Antje Barth @anbarth data-science-on-aws/workshop linkedin.com/in/antje-barth/