Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP

Introduction to NLP

Ristek Data Science Internal Class

C57af1a97254c871ece1cee87979a222?s=128

Galuh Sahid

August 04, 2021
Tweet

Transcript

  1. Introduction to NLP Galuh Sahid @galuhsahid | github.com/galuhsahid

  2. https://bit.ly/2Tplxbl @galuhsahid

  3. NLP tasks @galuhsahid

  4. @galuhsahid Text Classification

  5. @galuhsahid Text Summarisation Source

  6. @galuhsahid Text Clustering Source

  7. @galuhsahid Machine Translation

  8. @galuhsahid Text Generation Source

  9. @galuhsahid Speech Recognition Try out the model! Source code

  10. @galuhsahid Image Captioning Source

  11. @galuhsahid Question Answering Source

  12. Text Classification @galuhsahid

  13. @galuhsahid Text Classification INFO RESMI Tri Care SELAMAT Nomor Anda

    terpilih mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan PIN Pemenang: br25h99 info: www.gebeyar-3care.tk Da sebenernya makan sama indomie ge cukup aku mah 😂 RAWIT harganya miring, Internetan makin sering! AYO isi ulang kartu AXIS kamu sekarang, pastikan besok ikut promo Rabu RAWIT. Info838. LD328 Fraud Normal Promo
  14. @galuhsahid Source

  15. @galuhsahid Source

  16. Approach #1: scikit-learn @galuhsahid

  17. @galuhsahid Overview Preprocess Transform Model fitting

  18. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Tokenizing • Lowercasing • Lemmatizing • Stemming • Stopwords removal • … anything else?
  19. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters, punctuations • Remove numbers • Convert slang words • Replace number with tag • … and many more
  20. @galuhsahid Transform Source

  21. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Number of words in Review 2 = 8 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number of terms in review 2) = 1/8
  22. @galuhsahid Transform Source

  23. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0
  24. @galuhsahid Transform Source

  25. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0
  26. @galuhsahid Transform Source

  27. @galuhsahid Model Fitting Source There are many algorithms that we

    can experiment with, e.g.: • Random Forest • SVM • … and many more
  28. Approach #2: Transformer-based @galuhsahid

  29. @galuhsahid Ways to do training • Train everything from scratch

    • Use a pre-trained model
  30. @galuhsahid Transfer learning A deep learning model is trained on

    a large dataset, then used to perform similar tasks on another dataset (e.g. text classification)
  31. @galuhsahid BERT BERT: Bidirectional Encoder Representations from Transformers

  32. @galuhsahid “...we train a general-purpose ‘language understanding’ model on a

    large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering)” https://github.com/google-research/bert
  33. @galuhsahid “BERT outperforms previous methods because it is the first

    unsupervised, deeply bidirectional system for pre-training NLP.”
  34. @galuhsahid Unsupervised? BERT was trained using only a plain text

    corpus
  35. @galuhsahid Bidirectional? Pre-trained representations can also either be context-free or

    contextual ban k bank deposit river bank
  36. @galuhsahid Bidirectional? Contextual representations can further be unidirectional or bidirectional

  37. @galuhsahid BERT Training Strategies • Masked language model • Next

    sentence prediction
  38. @galuhsahid Masked language model • Input: the man went to

    the [MASK1] . he bought a [MASK2] of milk. • Labels: [MASK1] = store; [MASK2] = gallon
  39. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: he bought a gallon of milk . • Label: IsNextSentence
  40. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: penguins are flightless . • Label: NotNextSentence
  41. @galuhsahid Demo

  42. @galuhsahid Pro cons between the two approaches

  43. @galuhsahid