Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP

Introduction to NLP

Ristek Data Science Internal Class

Galuh Sahid

August 04, 2021
Tweet

More Decks by Galuh Sahid

Other Decks in Technology

Transcript

  1. @galuhsahid Text Classification INFO RESMI Tri Care SELAMAT Nomor Anda

    terpilih mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan PIN Pemenang: br25h99 info: www.gebeyar-3care.tk Da sebenernya makan sama indomie ge cukup aku mah 😂 RAWIT harganya miring, Internetan makin sering! AYO isi ulang kartu AXIS kamu sekarang, pastikan besok ikut promo Rabu RAWIT. Info838. LD328 Fraud Normal Promo
  2. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Tokenizing • Lowercasing • Lemmatizing • Stemming • Stopwords removal • … anything else?
  3. @galuhsahid Preprocessing There is no magic recipe, but generally you

    will need to do the following: • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters, punctuations • Remove numbers • Convert slang words • Replace number with tag • … and many more
  4. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Number of words in Review 2 = 8 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number of terms in review 2) = 1/8
  5. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0
  6. @galuhsahid Transform Source Review 2: This movie is not scary

    and is slow TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0
  7. @galuhsahid Model Fitting Source There are many algorithms that we

    can experiment with, e.g.: • Random Forest • SVM • … and many more
  8. @galuhsahid Transfer learning A deep learning model is trained on

    a large dataset, then used to perform similar tasks on another dataset (e.g. text classification)
  9. @galuhsahid “...we train a general-purpose ‘language understanding’ model on a

    large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering)” https://github.com/google-research/bert
  10. @galuhsahid “BERT outperforms previous methods because it is the first

    unsupervised, deeply bidirectional system for pre-training NLP.”
  11. @galuhsahid Masked language model • Input: the man went to

    the [MASK1] . he bought a [MASK2] of milk. • Labels: [MASK1] = store; [MASK2] = gallon
  12. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: he bought a gallon of milk . • Label: IsNextSentence
  13. @galuhsahid Next sentence prediction • Sentence A: the man went

    to the store . • Sentence B: penguins are flightless . • Label: NotNextSentence