Slide 1

Slide 1 text

Distillation Overview and Recent Research ߔ৔޹ (ML Engineer, Pingpong)

Slide 2

Slide 2 text

ݾର ݾର 1. Overview 1. What is Distillation? 2. Distilling the Knowledge in a Neural Network (Hinton et al., 2015) 2. Recent Research 1. Transformer to CNN:Label-scarce distillation for efficient text classification (Chia et al., 2018 NIPS Workshop) 2. BAM!:Born-Again Multi-Task Networks for Natural Language Understanding (Clark et al., 2019 arXiv) 3. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation(Turc et al., 2019 arXiv) 4. Patient Knowledge Distillation for BERT Model Compression (Sun et al., 2019 EMNLP)

Slide 3

Slide 3 text

Overview “Distillation” Overview

Slide 4

Slide 4 text

What is Distillation? Overview

Slide 5

Slide 5 text

Overview What is Distillation? Teacher Model Many parameter - Large Model Well-trained Student Model Fewer parameter - Small Model Knowledge

Slide 6

Slide 6 text

Overview • Original Classification • One-hot label: ੿׹ classী 1੄ labelਸ ࠗৈ Ã “Hard target” • Loss: Cross-Entropy • Distillation • Continuous label: Model੄ Outputਸ label۽ ੉ਊ Ã “Soft target” • Loss: Cross-Entropy, K-L Divergence, MSE One-hot vs Continuous 0 0.175 0.35 0.525 0.7 Class 1 Class 2 Class 3 Class 4 0.1 0.05 0.7 0.15 Probability 0 0.25 0.5 0.75 1 Class 1 Class 2 Class 3 Class 4 0 0 1 0 Probability

Slide 7

Slide 7 text

Overview Dataset • Labelled Dataset • ࢎۈ੉ ૒੽ labelling à ݆਷ ন ഛࠁ x • Supervised Learningਸ ਤ೧ࢲ ٜ݅য૓ (input, label) हਵ۽ ҳ ࢿػ ؘ੉ఠࣇ • Unlabelled Dataset • Labelling੉ غয ੓૑ ঋ਷ ؘ੉ఠࣇ ex) Pingpong corpus • Unsupervised Learningਸ ਤ೧ࢲ ੉ਊؽ ex) Word2Vec, Autoencoder, BERT pre-training ١ • ࣻ૘ೞӝ ए਑ Model • Teacher Model • ૑धਸ ੹ࣻೞӝ ਤೠ ݽ؛ (о੢ ੿ഛೠ ݽ؛) • ੌ߈੸ਵ۽ ݆਷ ౵ۄݫఠܳ о૑Ҋ, о੢ ࢿמ੉ જ਷ ҳઑ੄ ݽ؛ ਸ ੉ਊೣ (ঔ࢚࠶ب ੗઱ ੉ਊೣ) • Student Model • ૑धਸ ੹ࣻ߉ӝ ਤೠ ݽ؛ • ੌ߈੸ਵ۽ ࢲࡂ੉ оמೠ(memory, latency)ܳ о૓ ݽ؛ਸ ੉ਊ ೣ (੸਷ ౵ۄݫఠ ࣻ ߂ ߽۳ചо ੜ غয ࣘبо ࡅܲ ҳઑ - CNN) à ࢎप memory/latencyח ࢚؀੸੐ (severীࢲ ࡅܰ׮Ҋ mobileীࢲ ࡅܰ૓ ঋ਺) Main Concept

Slide 8

Slide 8 text

Overview Unlabelled Data ࣻ૘੉ए਑ ন੉݆਺ 8JLJ (PPHMF*NBHF١ Labelled Data ࣻ૘੉য۰਑ ন੉੸਺ /-* 454١ Main Concept Machine Learning Data driven approach

Slide 9

Slide 9 text

Overview Unlabelled Data ࣻ૘੉ए਑ ন੉݆਺ 8JLJ (PPHMF*NBHF١ Labelled Data ࣻ૘੉য۰਑ ন੉੸਺ /-* 454١ Transfer Data 6OMBCFMMFE$PSQVT۽ࠗఠ୶୹ ন੉࢚؀੸ਵ۽݆਺ -BCFMFE৬%JTUSJCVUJPO੉׮ܳࣻ੓਺ ؼࣻ੓ਵݶ࠺तೠ%JTUSJCVUJPOਵ۽ Teacher Model Main Concept

Slide 10

Slide 10 text

Overview Teacher Training ݆਷౵ۄݫఠ MBUFODZNFNPSZन҃Y о੢ࢿמ੉જ਷ 405" ݽ؛ 0OFIPUMBCFM۽೟ण Make Transfer Data 5FBDIFSݽ؛۽ࠗఠࢤࢿ ੿ഛೞ૑חঋ਺ /-*١਷ٜ݅ӝয۰਑ Student Training ੸਷౵ۄݫఠ ੸੺ೠNFNPSZMBUFODZ 5FBDIFS੄0VUQVUਸ੉ਊೞৈ೟ण Labeled Data Unlabeled Data Transfer Data Process

Slide 11

Slide 11 text

Distilling the Knowledge in a Neural Network Hinton et al., 2015 Overview

Slide 12

Slide 12 text

Overview Distilling the Knowledge in a Neural Network • Neural Networkী ੓যࢲ Distillationਸ ୊਺ ࣗѐೣ • Neural Networkח ੿׹ Class੄ ഛܫਸ о੢ ௼ѱ ࠗৈೞب۾ ೟ण • ੿׹੉ ইצ Classীب ޷ࣁೠ ഛܫਸ ࠗৈ • ؀ࠗ࠙ ݒ਋ ੘਷ ഛܫ੉૑݅ ݻݻ਷ ׮ܲ Ѫٜ ࠁ׮ ௼׮. • ࢚؀੸ੋ ഛܫ ࠙ನ Ã “Model੉ ੌ߈ചೞח ҃ೱࢿ”

Slide 13

Slide 13 text

Overview Distilling the Knowledge in a Neural Network • Neural Networkী ੓যࢲ Distillationਸ ୊਺ ࣗѐೣ • Neural Networkח ੿׹ Class੄ ഛܫਸ о੢ ௼ѱ ࠗৈೞب۾ ೟ण • ੿׹੉ ইצ Classীب ޷ࣁೠ ഛܫਸ ࠗৈ • ؀ࠗ࠙ ݒ਋ ੘਷ ഛܫ੉૑݅ ݻݻ਷ ׮ܲ Ѫٜ ࠁ׮ ௼׮. • ࢚؀੸ੋ ഛܫ ࠙ನ Ã “Model੉ ੌ߈ചೞח ҃ೱࢿ” 0 20 40 60 80 Car Garbage Truck Bus Carrot 0.1 4 7 75 Probability Classifier Knowledge!

Slide 14

Slide 14 text

Overview Distilling Method H(p, q) = − ∑ x p(x)log q(x) Label(After Softmax) Model Output(After Softmax) Cross-Entropy Mean Squared Error “Teacher੄ Output”ਸ “Label”۽ ೟ण! H(p, q) = ∑ i (pi − qi )2 Label(Logit) Model Output(Logit) “Teacher੄ Output” + “True label”ਸ زदী ੉ਊೞӝب ೣ

Slide 15

Slide 15 text

Overview Main Idea • Temperature Term • ੜ ೟णػ ݽ؛ à ੿׹ী “High Confidence(≈ 1)” • Soft target੉ ੄޷о হਸ ࣻ ੓਺ • Softmax function ੉੹ী Smoothing à ੸׼ೠ “Soft target” H(p, q) = − ∑ x p(x)log q(x) Label Model Output

Slide 16

Slide 16 text

Overview Main Idea • Temperature Term • ੜ ೟णػ ݽ؛ à ੿׹ী “High Confidence(≈ 1)” • Soft target੉ ੄޷о হਸ ࣻ ੓਺ • Softmax function ੉੹ী Smoothing à ੸׼ೠ “Soft target” qi = exp(zi ) ∑ j exp(zj ) qi = exp(zi /T) ∑ j exp(zj /T) T : Temperature 0 0.175 0.35 0.525 0.7 Class 1 Class 2 Class 3 Class 4 0.1 0.05 0.7 0.15 Probability 0 0.225 0.45 0.675 0.9 Class 1 Class 2 Class 3 Class 4 0.01 0.02 0.9 0.07 Probability H(p, q) = − ∑ x p(x)log q(x) Label Model Output

Slide 17

Slide 17 text

Overview Experiments Meaningful Experiment • MNIST • 0~9੄ Ӓܿ Ã ं੗ ৘ஏ • Experiment • ੿׹੉ “3”ੋ exampleਸ ೟ण ࣇীࢲ ઁ৻ • “3”਷ “2”ա “8” ١੄ Distillation labelীࢲ݅ ١੢ • Test Accuracy: 877/1010 = 86%

Slide 18

Slide 18 text

Recent Research Distillation on “NLP”! Recent Research

Slide 19

Slide 19 text

Transformer to CNN: Label-scarce distillation for efficient text classification Chia et al., 2018 NIPS workshop Recent Research

Slide 20

Slide 20 text

Recent Research Main Idea • Transformer(GPT) Ã Simple CNN

Slide 21

Slide 21 text

Recent Research Experiments Environment • 3-Architecture: Bi-LSTM, KimCNN(Char-CNN), BlendCNN Text Classification Task AG News(4 classes), DBpedia(10 classes), Yahoo Answers(10 classes) ~x300 speed up!!

Slide 22

Slide 22 text

Result Recent Research Environment • 3-Architecture: Bi-LSTM, KimCNN(Char-CNN), BlendCNN Text Classification Task AG News(4 classes), DBpedia(10 classes), Yahoo Answers(10 classes) ~x300 speed up!! Labelled Data Transfer Data Experiments

Slide 23

Slide 23 text

Recent Research Main Contribution 
 BERT Distillation੄ оמࢿ ഛੋ ! (प೷੄ ֤ܻࢿ਷…)

Slide 24

Slide 24 text

BAM!:Born-Again Multi-Task Networks for Natural Language Understanding Clark et al., 2019 arXiv Recent Research

Slide 25

Slide 25 text

Recent Research Main Idea • Multi-Task Learning + Distillation

Slide 26

Slide 26 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Loss Input Batch(Multiple Task, Input) Training Phase

Slide 27

Slide 27 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 1) Loss: Task 1 Loss Input Batch(Multiple Task, Input) Training Phase

Slide 28

Slide 28 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 2) Loss: Task 1 Loss + Task 2 Loss Input Batch(Multiple Task, Input) Training Phase

Slide 29

Slide 29 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 3) Loss: Task 1 Loss + Task 2 Loss + Task 3 Loss Input Batch(Multiple Task, Input) Training Phase

Slide 30

Slide 30 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Loss: Task 1 Loss + Task 2 Loss + Task 3 Loss Input Batch(Multiple Task, Input) Optimize Training Phase

Slide 31

Slide 31 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier Inference Phase BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Task 1 Classifier Task 1 Classifier Task 1 Classifier Input Batch(Multiple Task, Input) Task 1 Output Task 2 Output Task 3 Output

Slide 32

Slide 32 text

Recent Research Main Idea • Multi-Task Learning (with BERT) • ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier • ੉੼ • Single Sentence à One-time Inference! • Intent, DA, Sentiment ١ “োҙࢿ” ੓ח Task • Training Robustness Inference Phase BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Task 1 Classifier Task 1 Classifier Task 1 Classifier Input Batch(Multiple Task, Input) Task 1 Output Task 2 Output Task 3 Output

Slide 33

Slide 33 text

Recent Research Main Idea • Teacher Annealing Loss function e.g. cross entropy, MSE L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ))

Slide 34

Slide 34 text

Recent Research Main Idea • Teacher Annealing Task Task Dataset L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ))

Slide 35

Slide 35 text

L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ)) Recent Research Main Idea • Teacher Annealing Teacher Result(Label) Student Result Same Architecture

Slide 36

Slide 36 text

Recent Research Main Idea • Teacher Annealing L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ)) l(λyi T + (1 − λ)fT (xi T , θT ), fT (xi T , θ)) Teacher Annealing lambdaী ٮۄ True label: Teacher Output੄ ࠺ਯ Ѿ੿ Label True label੉ ೙ਃ! Ã Transfer Dataset ࢎਊࠛо

Slide 37

Slide 37 text

Recent Research Main Idea • Model & Training Parameter • Large LR :1e-4 • Task Weighted Sampling(Multi-task) • Layer-wise LR

Slide 38

Slide 38 text

Recent Research Experiments Environment • GLUE • NLU Benchmark • Experiment • Multi-task + Distillationਸ ׮নೞѱ ઑ೤ೞৈ प೷ ૓೯ • Ablation Studyܳ ా೧ Teacher Annealing੄ ബҗ ૐݺ

Slide 39

Slide 39 text

Recent Research Main Contribution 
 Multitask Learning + Teacher Annealing

Slide 40

Slide 40 text

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation Turc et al., 2019 arXiv Recent Research

Slide 41

Slide 41 text

Recent Research Main Idea • Various Size of BERT Distillation (Google ౵ਕ..) • BERT Ã BERT Distillation • ׮নೠ Hidden Size, Layer depth۽ Distillation (24о૑ ݽ؛) Memory Speed

Slide 42

Slide 42 text

Recent Research Main Idea • BERTী ݏ୸ Distillation ೟ण ߑध 1. Student Model Pre-training(Initialization) 2. Distillation using Transfer Dataset 3. Fine-Tuning using Labelled Dataset

Slide 43

Slide 43 text

Recent Research Experiments Environment • Text Classification Task(Single, Pair) • SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning)

Slide 44

Slide 44 text

Recent Research Experiments Environment • Text Classification Task(Single, Pair) • SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning) ݽٚ Taskীࢲ #4о જ਷ ࢿמ

Slide 45

Slide 45 text

Recent Research Experiments Environment • Text Classification Task(Single, Pair) • SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning) - Layer ࣻী ٮܲ Ѿҗ - Hidden Sizeী ٮܲ Ѿҗ 8-layerө૑ח Hidden size << Layer ࣻ Intermediate layer = Hidden size * 4 Multi-head = Hidden size / 64

Slide 46

Slide 46 text

Recent Research Main Contribution 
 BERT Ã BERT Distillation Relation Between Model Size and Performance

Slide 47

Slide 47 text

Patient Knowledge Distillation for BERT Model Compression Sun et al., 2019 EMNLP Recent Research

Slide 48

Slide 48 text

Recent Research Main Idea • Patient Distillation • Output + Intermediate Layerܳ Teacherܳ ٮܰب۾ • Intermediate Layer੄ [CLS]ష௾ ੉ਊ • [CLS] ష௾਷ classifier ݆਷ ੿ࠁܳ ׸Ҋ ੓׮Ҋ о੿ • ױ, Teacher৬ Student੄ Hidden sizeо زੌ೧ঠೣ • ࣘب ஏݶী ੓যࢲח ௾ ઁডઑѤ੐ • Teacher੄ ೞਤ kѐ੄ layer۽ initialization

Slide 49

Slide 49 text

Recent Research Experiments Environment • Dataset • SST, MRPC, QQP, MNLI, QNLI. RTE • Experiment • 2о૑ ݽ؛(3-layer, 6-layer) ߂ 3о૑ ߑߨ(FT, KD, PKD)۽ ૓೯ • #1:FT(Fine-Tuning): ೞਤ k-layer۽ initialize റ Fine-tuning • #2:KD(Knowledge Distillation): ੌ߈੸ੋ KD(output݅ ੉ਊ) • #3:PKD(Patient Knowledge Distillation): Intermediate layer + output ੉ਊ

Slide 50

Slide 50 text

Recent Research Experiments Environment • Dataset • SST, MRPC, QQP, MNLI, QNLI. RTE • Experiment • 2о૑ ݽ؛(3-layer, 6-layer) ߂ 3о૑ ߑߨ(FT, KD, PKD)۽ ૓೯ • #1:FT(Fine-Tuning): ೞਤ k-layer۽ initialize റ Fine-tuning • #2:KD(Knowledge Distillation): ੌ߈੸ੋ KD(output݅ ੉ਊ) • #3:PKD(Patient Knowledge Distillation): Intermediate layer + output ੉ਊ ੉੹ ֤ޙҗ ࠺Ү೮ਸ ٸ, ੹୓੸ਵ۽ ઑӘঀ ծ਷ ࣻ஖

Slide 51

Slide 51 text

Recent Research Main Contribution 
 Patient Distillation ੉ۄח ࢜۽਍ Distillation Method੄ ઁद

Slide 52

Slide 52 text

хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽ োۅ ઱ࣁਃ! ߔ৔޹ (Machine Learning Engineer, Pingpong) Email.yeongmin.baek@scatterlab.co.kr Facebook. bym0313