Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to NLP

Introduction to NLP

Ristek Data Science Internal Class

Galuh Sahid

August 04, 2021
Tweet

More Decks by Galuh Sahid

Other Decks in Technology

Transcript

  1. Introduction to


    NLP
    Galuh Sahid
    @galuhsahid | github.com/galuhsahid

    View Slide

  2. https://bit.ly/2Tplxbl
    @galuhsahid

    View Slide

  3. NLP tasks
    @galuhsahid

    View Slide

  4. @galuhsahid
    Text Classification

    View Slide

  5. @galuhsahid
    Text Summarisation
    Source

    View Slide

  6. @galuhsahid
    Text Clustering
    Source

    View Slide

  7. @galuhsahid
    Machine Translation

    View Slide

  8. @galuhsahid
    Text Generation
    Source

    View Slide

  9. @galuhsahid
    Speech Recognition
    Try out the model! Source code

    View Slide

  10. @galuhsahid
    Image Captioning
    Source

    View Slide

  11. @galuhsahid
    Question Answering
    Source

    View Slide

  12. Text Classification
    @galuhsahid

    View Slide

  13. @galuhsahid
    Text Classification
    INFO RESMI Tri Care SELAMAT Nomor Anda terpilih
    mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan
    PIN Pemenang: br25h99 info: www.gebeyar-3care.tk
    Da sebenernya makan sama indomie ge cukup aku
    mah 😂
    RAWIT harganya miring, Internetan makin sering! AYO
    isi ulang kartu AXIS kamu sekarang, pastikan besok
    ikut promo Rabu RAWIT. Info838. LD328
    Fraud
    Normal
    Promo

    View Slide

  14. @galuhsahid
    Source

    View Slide

  15. @galuhsahid
    Source

    View Slide

  16. Approach #1: scikit-learn
    @galuhsahid

    View Slide

  17. @galuhsahid
    Overview
    Preprocess Transform Model fitting

    View Slide

  18. @galuhsahid
    Preprocessing
    There is no magic recipe, but generally you will need to do the following:


    • Tokenizing


    • Lowercasing


    • Lemmatizing


    • Stemming


    • Stopwords removal


    • … anything else?

    View Slide

  19. @galuhsahid
    Preprocessing
    There is no magic recipe, but generally you will need to do the following:


    • Remove HTML tags


    • Remove extra whitespaces


    • Convert accented characters to ASCII characters


    • Remove special characters, punctuations


    • Remove numbers


    • Convert slang words


    • Replace number with tag


    • … and many more

    View Slide

  20. @galuhsahid
    Transform
    Source

    View Slide

  21. @galuhsahid
    Transform
    Source
    Review 2: This movie is not scary and is slow
    Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,
    ‘slow’, ‘spooky’, ‘good’


    Number of words in Review 2 = 8


    TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/
    (number of terms in review 2) = 1/8


    View Slide

  22. @galuhsahid
    Transform
    Source

    View Slide

  23. @galuhsahid
    Transform
    Source
    Review 2: This movie is not scary and is slow
    IDF(‘this’) = log(number of documents/number of documents containing
    the word ‘this’) = log(3/3) = log(1) = 0

    View Slide

  24. @galuhsahid
    Transform
    Source

    View Slide

  25. @galuhsahid
    Transform
    Source
    Review 2: This movie is not scary and is slow
    TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

    View Slide

  26. @galuhsahid
    Transform
    Source

    View Slide

  27. @galuhsahid
    Model Fitting
    Source
    There are many algorithms that we can experiment with, e.g.:


    • Random Forest


    • SVM


    • … and many more

    View Slide

  28. Approach #2: Transformer-based
    @galuhsahid

    View Slide

  29. @galuhsahid
    Ways to do training
    • Train everything from scratch


    • Use a pre-trained model

    View Slide

  30. @galuhsahid
    Transfer learning
    A deep learning model is trained on a large dataset, then used to perform
    similar tasks on another dataset (e.g. text classification)

    View Slide

  31. @galuhsahid
    BERT
    BERT: Bidirectional Encoder Representations from Transformers

    View Slide

  32. @galuhsahid
    “...we train a general-purpose ‘language understanding’ model on
    a large text corpus (like Wikipedia), and then use that model
    for downstream NLP tasks that we care about (like question
    answering)”
    https://github.com/google-research/bert

    View Slide

  33. @galuhsahid
    “BERT outperforms previous methods because it
    is the first unsupervised, deeply
    bidirectional system for pre-training NLP.”

    View Slide

  34. @galuhsahid
    Unsupervised?
    BERT was trained using only a plain text corpus

    View Slide

  35. @galuhsahid
    Bidirectional?
    Pre-trained representations can also either be context-free or contextual
    ban
    k

    bank deposit
    river bank

    View Slide

  36. @galuhsahid
    Bidirectional?
    Contextual representations can further be unidirectional or bidirectional

    View Slide

  37. @galuhsahid
    BERT Training Strategies
    • Masked language model


    • Next sentence prediction

    View Slide

  38. @galuhsahid
    Masked language model
    • Input: the man went to the [MASK1] . he bought a [MASK2] of milk.


    • Labels: [MASK1] = store; [MASK2] = gallon

    View Slide

  39. @galuhsahid
    Next sentence prediction
    • Sentence A: the man went to the store .


    • Sentence B: he bought a gallon of milk .


    • Label: IsNextSentence

    View Slide

  40. @galuhsahid
    Next sentence prediction
    • Sentence A: the man went to the store .


    • Sentence B: penguins are flightless .


    • Label: NotNextSentence

    View Slide

  41. @galuhsahid
    Demo

    View Slide

  42. @galuhsahid
    Pro cons between the two approaches

    View Slide

  43. @galuhsahid

    View Slide