Introduction to NLP - Speaker Deck

Slide 1

Slide 1 text

Introduction to NLP Galuh Sahid @galuhsahid | github.com/galuhsahid

Slide 2

Slide 2 text

https://bit.ly/2Tplxbl @galuhsahid

Slide 3

Slide 3 text

NLP tasks @galuhsahid

Slide 4

Slide 4 text

@galuhsahid Text Classification

Slide 5

Slide 5 text

@galuhsahid Text Summarisation Source

Slide 6

Slide 6 text

@galuhsahid Text Clustering Source

Slide 7

Slide 7 text

@galuhsahid Machine Translation

Slide 8

Slide 8 text

@galuhsahid Text Generation Source

Slide 9

Slide 9 text

@galuhsahid Speech Recognition Try out the model! Source code

Slide 10

Slide 10 text

@galuhsahid Image Captioning Source

Slide 11

Slide 11 text

@galuhsahid Question Answering Source

Slide 12

Slide 12 text

Text Classification @galuhsahid

Slide 13

Slide 13 text

@galuhsahid Text Classification INFO RESMI Tri Care SELAMAT Nomor Anda terpilih mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan PIN Pemenang: br25h99 info: www.gebeyar-3care.tk Da sebenernya makan sama indomie ge cukup aku mah 😂 RAWIT harganya miring, Internetan makin sering! AYO isi ulang kartu AXIS kamu sekarang, pastikan besok ikut promo Rabu RAWIT. Info838. LD328 Fraud Normal Promo

Slide 14

Slide 14 text

@galuhsahid Source

Slide 15

Slide 15 text

@galuhsahid Source

Slide 16

Slide 16 text

Approach #1: scikit-learn @galuhsahid

Slide 17

Slide 17 text

@galuhsahid Overview Preprocess Transform Model fitting

Slide 18

Slide 18 text

@galuhsahid Preprocessing There is no magic recipe, but generally you will need to do the following: • Tokenizing • Lowercasing • Lemmatizing • Stemming • Stopwords removal • … anything else?

Slide 19

Slide 19 text

@galuhsahid Preprocessing There is no magic recipe, but generally you will need to do the following: • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters, punctuations • Remove numbers • Convert slang words • Replace number with tag • … and many more

Slide 20

Slide 20 text

@galuhsahid Transform Source

Slide 21

Slide 21 text

@galuhsahid Transform Source Review 2: This movie is not scary and is slow Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Number of words in Review 2 = 8 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number of terms in review 2) = 1/8

Slide 22

Slide 22 text

@galuhsahid Transform Source

Slide 23

Slide 23 text

@galuhsahid Transform Source Review 2: This movie is not scary and is slow IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0

Slide 24

Slide 24 text

@galuhsahid Transform Source

Slide 25

Slide 25 text

@galuhsahid Transform Source Review 2: This movie is not scary and is slow TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Slide 26

Slide 26 text

@galuhsahid Transform Source

Slide 27

Slide 27 text

@galuhsahid Model Fitting Source There are many algorithms that we can experiment with, e.g.: • Random Forest • SVM • … and many more

Slide 28

Slide 28 text

Approach #2: Transformer-based @galuhsahid

Slide 29

Slide 29 text

@galuhsahid Ways to do training • Train everything from scratch • Use a pre-trained model

Slide 30

Slide 30 text

@galuhsahid Transfer learning A deep learning model is trained on a large dataset, then used to perform similar tasks on another dataset (e.g. text classification)

Slide 31

Slide 31 text

@galuhsahid BERT BERT: Bidirectional Encoder Representations from Transformers

Slide 32

Slide 32 text

@galuhsahid “...we train a general-purpose ‘language understanding’ model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering)” https://github.com/google-research/bert

Slide 33

Slide 33 text

@galuhsahid “BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.”

Slide 34

Slide 34 text

@galuhsahid Unsupervised? BERT was trained using only a plain text corpus

Slide 35

Slide 35 text

@galuhsahid Bidirectional? Pre-trained representations can also either be context-free or contextual ban k bank deposit river bank

Slide 36

Slide 36 text

@galuhsahid Bidirectional? Contextual representations can further be unidirectional or bidirectional

Slide 37

Slide 37 text

@galuhsahid BERT Training Strategies • Masked language model • Next sentence prediction

Slide 38

Slide 38 text

@galuhsahid Masked language model • Input: the man went to the [MASK1] . he bought a [MASK2] of milk. • Labels: [MASK1] = store; [MASK2] = gallon

Slide 39

Slide 39 text

@galuhsahid Next sentence prediction • Sentence A: the man went to the store . • Sentence B: he bought a gallon of milk . • Label: IsNextSentence

Slide 40

Slide 40 text

@galuhsahid Next sentence prediction • Sentence A: the man went to the store . • Sentence B: penguins are flightless . • Label: NotNextSentence