Introduction to NLP

Introduction to NLP Galuh Sahid @galuhsahid | github.com/galuhsahid

https://bit.ly/2Tplxbl @galuhsahid

NLP tasks @galuhsahid

@galuhsahid Text Classification

@galuhsahid Text Summarisation Source

@galuhsahid Text Clustering Source

@galuhsahid Machine Translation

@galuhsahid Text Generation Source

@galuhsahid Speech Recognition Try out the model! Source code

@galuhsahid Image Captioning Source

@galuhsahid Question Answering Source

Text Classification @galuhsahid

@galuhsahid Text Classification INFO RESMI Tri Care SELAMAT Nomor Anda
terpilih mendapatkan Hadiah 1 Unit MOBIL Dri Tri Care Dengan PIN Pemenang: br25h99 info: www.gebeyar-3care.tk Da sebenernya makan sama indomie ge cukup aku mah 😂 RAWIT harganya miring, Internetan makin sering! AYO isi ulang kartu AXIS kamu sekarang, pastikan besok ikut promo Rabu RAWIT. Info838. LD328 Fraud Normal Promo

@galuhsahid Source

Approach #1: scikit-learn @galuhsahid

@galuhsahid Overview Preprocess Transform Model fitting

@galuhsahid Preprocessing There is no magic recipe, but generally you
will need to do the following: • Tokenizing • Lowercasing • Lemmatizing • Stemming • Stopwords removal • … anything else?

@galuhsahid Preprocessing There is no magic recipe, but generally you
will need to do the following: • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters, punctuations • Remove numbers • Convert slang words • Replace number with tag • … and many more

@galuhsahid Transform Source

@galuhsahid Transform Source Review 2: This movie is not scary
and is slow Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Number of words in Review 2 = 8 TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/ (number of terms in review 2) = 1/8

and is slow IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0

and is slow TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

@galuhsahid Model Fitting Source There are many algorithms that we
can experiment with, e.g.: • Random Forest • SVM • … and many more

Approach #2: Transformer-based @galuhsahid

@galuhsahid Ways to do training • Train everything from scratch
• Use a pre-trained model

@galuhsahid Transfer learning A deep learning model is trained on
a large dataset, then used to perform similar tasks on another dataset (e.g. text classification)

@galuhsahid BERT BERT: Bidirectional Encoder Representations from Transformers

@galuhsahid “...we train a general-purpose ‘language understanding’ model on a
large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering)” https://github.com/google-research/bert

@galuhsahid “BERT outperforms previous methods because it is the first
unsupervised, deeply bidirectional system for pre-training NLP.”

@galuhsahid Unsupervised? BERT was trained using only a plain text
corpus

@galuhsahid Bidirectional? Pre-trained representations can also either be context-free or
contextual ban k bank deposit river bank

@galuhsahid Bidirectional? Contextual representations can further be unidirectional or bidirectional

@galuhsahid BERT Training Strategies • Masked language model • Next
sentence prediction

@galuhsahid Masked language model • Input: the man went to
the [MASK1] . he bought a [MASK2] of milk. • Labels: [MASK1] = store; [MASK2] = gallon

@galuhsahid Next sentence prediction • Sentence A: the man went
to the store . • Sentence B: he bought a gallon of milk . • Label: IsNextSentence

@galuhsahid Next sentence prediction • Sentence A: the man went
to the store . • Sentence B: penguins are flightless . • Label: NotNextSentence

@galuhsahid Demo

@galuhsahid Pro cons between the two approaches

@galuhsahid

Introduction to NLP

Introduction to NLP

More Decks by Galuh Sahid

Other Decks in Technology

Featured

Transcript