Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP

Introduction to NLP

Ristek Data Science Internal Class

Avatar for Galuh Sahid

Galuh Sahid

August 04, 2021
Tweet

More Decks by Galuh Sahid

Other Decks in Technology

Transcript

  1. @galuhsahid Text Classification INFO RESMI Tri Care SELAMAT Nomor Anda

    terpilih mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan PIN Pemenang: br25h99 info: www.gebeyar-3care.tk Da sebenernya makan sama indomie ge cukup aku mah 😂 RAWIT harganya miring, Internetan makin sering! AYO isi ulang kartu AXIS kamu sekarang, pastikan besok ikut promo Rabu RAWIT. Info838. LD328 Fraud Normal Promo
  2. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Tokenizing • Lowercasing • Lemmatizing • Stemming • Stopwords removal • … anything else?
  3. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters, punctuations • Remove numbers • Convert slang words • Replace number with tag • … and many more
  4. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Number of words in Review 2 = 8 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number of terms in review 2) = 1/8
  5. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0
  6. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0
  7. @galuhsahid Model Fitting Source There are many algorithms that we

    can experiment with, e.g.: • Random Forest • SVM • … and many more
  8. @galuhsahid Transfer learning A deep learning model is trained on

    a large dataset, then used to perform similar tasks on another dataset (e.g. text classification)
  9. @galuhsahid “...we train a general-purpose ‘language understanding’ model on a

    large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering)” https://github.com/google-research/bert
  10. @galuhsahid “BERT outperforms previous methods because it is the first

    unsupervised, deeply bidirectional system for pre-training NLP.”
  11. @galuhsahid Masked language model • Input: the man went to

    the [MASK1] . he bought a [MASK2] of milk. • Labels: [MASK1] = store; [MASK2] = gallon
  12. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: he bought a gallon of milk . • Label: IsNextSentence
  13. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: penguins are flightless . • Label: NotNextSentence