Trie
HMM with states
BMSE: Begin Middle Single End
Viterbi Algorithm
Choose path with maximum probability
Slide 13
Slide 13 text
Quality
• Word coverage
• Appearance frequency
• Design preprocessing
• Collect dictionaries
Hard to capture
complex/non-linear relationship
Computing cost increase
with large datasets/complex states
Problem
HMM
Slide 14
Slide 14 text
BiLSTM
Architecture of BiLSTM
Couple of LSTM (Long Short-Term Memory)
Cross-BiLSTM-CNN
Corpora
Embedding
NLP task
Peng-Hsuan Li, Tsu-Jui Fu, WeiYun Ma. 2019. Remedying BiLSTM-CNN Deficiency in Modeling Cross-Context for NER.
Slide 15
Slide 15 text
No pre-trained
with large corpora
Need more task-specific data
Problem
OOV words
Fixed vocabulary is constraint on
segmentation with new/rare word
Parallelize limitation
Sequential attribute of LSTM
cause hard to parallelize computation
Slide 16
Slide 16 text
BERT
Architecture of BERT
• BERT Base: 12 layers (110 M parameters)
• BERT Large: 24 layers (340M parameters)
sentence 1 sentence 2
Encoder only
Attention
• Multi-Head Attention
• Self-Attention
tag
tag
Self-Attention
Multi-Head Attention
Slide 17
Slide 17 text
BERT
Transfer Learning
• Pre-Training
• Fine-Tuning
Slide 18
Slide 18 text
Need significant size of
task-specific data
Data augmentation
Problem
Semi-supervise learning
Active learning
Knowledge distillation
External knowledge
Slide 19
Slide 19 text
SECTION 03
Advance
Slide 20
Slide 20 text
Discriminative vs. Generative AI
Input data
Discriminative
Model
learn relationship
Input tag
Output data
Discriminative AI
Input data
Generative
Model
learn unstructured
content
Input
unstructured
data
New data
Generative AI
Slide 21
Slide 21 text
NLP Evolution
Prompt
Engineering
GPT Hybrid
Neuron
Network
Traditional
Approach
Pre-Trained
Model
Trie BiLSTM BERT
Slide 22
Slide 22 text
BERT+GPT
BERT GPT
Model type Encoder only Decoder only
Pre-Training MLM AR
Direction Bidirectional Unidirectional
Fine-tuning
Task specific layer added on
pre-trained model
Task specific prompting with
one-shot/few-shot adaption
Use case
Word segmentation
Classification
NER
Text generation
Summarization
Slide 23
Slide 23 text
Generate Dataset by GPT
Initial data
Raw data
Small labeled exist data
Design data format and
prompting
GPT Synthetic data
Strong contextual
understand
Zero-shot/Few-shot
learning
External knowledge
Manual review
Filter low-quality data
Slide 24
Slide 24 text
Fine-Tune BERT
Synthetic data
Auto labeling
Fine-tuning data
Split dataset
Shuffling
Fine-tuning
Evaluation
BERT
Slide 25
Slide 25 text
LLM-Driven Fine-Tuning BERT
Initial data
Raw data
GPT Synthetic data
Optimize training
parameters
Fine-tuning data
BERT
Generate more data
Refine prompt
Slide 26
Slide 26 text
SECTION 05
Takeaway
Slide 27
Slide 27 text
• Every NLP model is designed for several purpose. It’s like most of
mathematical questions - not only one solution.
• We’re able to cook appropriate solution with our goal and resource
if we have gotten core concept about NLP models and frequent application.
Takeaway