Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overview and Recent Research in Distillation

Scatter Lab Inc.
September 04, 2019

Overview and Recent Research in Distillation

Scatter Lab Inc.

September 04, 2019
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. ݾର ݾର 1. Overview 1. What is Distillation? 2. Distilling

    the Knowledge in a Neural Network (Hinton et al., 2015) 2. Recent Research 1. Transformer to CNN:Label-scarce distillation for efficient text classification (Chia et al., 2018 NIPS Workshop) 2. BAM!:Born-Again Multi-Task Networks for Natural Language Understanding (Clark et al., 2019 arXiv) 3. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation(Turc et al., 2019 arXiv) 4. Patient Knowledge Distillation for BERT Model Compression (Sun et al., 2019 EMNLP)
  2. Overview What is Distillation? Teacher Model Many parameter - Large

    Model Well-trained Student Model Fewer parameter - Small Model Knowledge
  3. Overview • Original Classification • One-hot label: ੿׹ classী 1੄

    labelਸ ࠗৈ Ã “Hard target” • Loss: Cross-Entropy • Distillation • Continuous label: Model੄ Outputਸ label۽ ੉ਊ Ã “Soft target” • Loss: Cross-Entropy, K-L Divergence, MSE One-hot vs Continuous 0 0.175 0.35 0.525 0.7 Class 1 Class 2 Class 3 Class 4 0.1 0.05 0.7 0.15 Probability 0 0.25 0.5 0.75 1 Class 1 Class 2 Class 3 Class 4 0 0 1 0 Probability
  4. Overview Dataset • Labelled Dataset • ࢎۈ੉ ૒੽ labelling Ã

    ݆਷ ন ഛࠁ x • Supervised Learningਸ ਤ೧ࢲ ٜ݅য૓ (input, label) हਵ۽ ҳ ࢿػ ؘ੉ఠࣇ • Unlabelled Dataset • Labelling੉ غয ੓૑ ঋ਷ ؘ੉ఠࣇ ex) Pingpong corpus • Unsupervised Learningਸ ਤ೧ࢲ ੉ਊؽ ex) Word2Vec, Autoencoder, BERT pre-training ١ • ࣻ૘ೞӝ ए਑ Model • Teacher Model • ૑धਸ ੹ࣻೞӝ ਤೠ ݽ؛ (о੢ ੿ഛೠ ݽ؛) • ੌ߈੸ਵ۽ ݆਷ ౵ۄݫఠܳ о૑Ҋ, о੢ ࢿמ੉ જ਷ ҳઑ੄ ݽ؛ ਸ ੉ਊೣ (ঔ࢚࠶ب ੗઱ ੉ਊೣ) • Student Model • ૑धਸ ੹ࣻ߉ӝ ਤೠ ݽ؛ • ੌ߈੸ਵ۽ ࢲࡂ੉ оמೠ(memory, latency)ܳ о૓ ݽ؛ਸ ੉ਊ ೣ (੸਷ ౵ۄݫఠ ࣻ ߂ ߽۳ചо ੜ غয ࣘبо ࡅܲ ҳઑ - CNN) Ã ࢎप memory/latencyח ࢚؀੸੐ (severীࢲ ࡅܰ׮Ҋ mobileীࢲ ࡅܰ૓ ঋ਺) Main Concept
  5. Overview Unlabelled Data ࣻ૘੉ए਑ ন੉݆਺ 8JLJ (PPHMF*NBHF١ Labelled Data ࣻ૘੉য۰਑

    ন੉੸਺ /-* 454١ Main Concept Machine Learning Data driven approach
  6. Overview Unlabelled Data ࣻ૘੉ए਑ ন੉݆਺ 8JLJ (PPHMF*NBHF١ Labelled Data ࣻ૘੉য۰਑

    ন੉੸਺ /-* 454١ Transfer Data 6OMBCFMMFE$PSQVT۽ࠗఠ୶୹ ন੉࢚؀੸ਵ۽݆਺ -BCFMFE৬%JTUSJCVUJPO੉׮ܳࣻ੓਺ ؼࣻ੓ਵݶ࠺तೠ%JTUSJCVUJPOਵ۽ Teacher Model Main Concept
  7. Overview Teacher Training ݆਷౵ۄݫఠ MBUFODZNFNPSZन҃Y о੢ࢿמ੉જ਷ 405" ݽ؛ 0OFIPUMBCFM۽೟ण Make

    Transfer Data 5FBDIFSݽ؛۽ࠗఠࢤࢿ ੿ഛೞ૑חঋ਺ /-*١਷ٜ݅ӝয۰਑ Student Training ੸਷౵ۄݫఠ ੸੺ೠNFNPSZMBUFODZ 5FBDIFS੄0VUQVUਸ੉ਊೞৈ೟ण Labeled Data Unlabeled Data Transfer Data Process
  8. Overview Distilling the Knowledge in a Neural Network • Neural

    Networkী ੓যࢲ Distillationਸ ୊਺ ࣗѐೣ • Neural Networkח ੿׹ Class੄ ഛܫਸ о੢ ௼ѱ ࠗৈೞب۾ ೟ण • ੿׹੉ ইצ Classীب ޷ࣁೠ ഛܫਸ ࠗৈ • ؀ࠗ࠙ ݒ਋ ੘਷ ഛܫ੉૑݅ ݻݻ਷ ׮ܲ Ѫٜ ࠁ׮ ௼׮. • ࢚؀੸ੋ ഛܫ ࠙ನ Ã “Model੉ ੌ߈ചೞח ҃ೱࢿ”
  9. Overview Distilling the Knowledge in a Neural Network • Neural

    Networkী ੓যࢲ Distillationਸ ୊਺ ࣗѐೣ • Neural Networkח ੿׹ Class੄ ഛܫਸ о੢ ௼ѱ ࠗৈೞب۾ ೟ण • ੿׹੉ ইצ Classীب ޷ࣁೠ ഛܫਸ ࠗৈ • ؀ࠗ࠙ ݒ਋ ੘਷ ഛܫ੉૑݅ ݻݻ਷ ׮ܲ Ѫٜ ࠁ׮ ௼׮. • ࢚؀੸ੋ ഛܫ ࠙ನ Ã “Model੉ ੌ߈ചೞח ҃ೱࢿ” 0 20 40 60 80 Car Garbage Truck Bus Carrot 0.1 4 7 75 Probability Classifier Knowledge!
  10. Overview Distilling Method H(p, q) = − ∑ x p(x)log

    q(x) Label(After Softmax) Model Output(After Softmax) Cross-Entropy Mean Squared Error “Teacher੄ Output”ਸ “Label”۽ ೟ण! H(p, q) = ∑ i (pi − qi )2 Label(Logit) Model Output(Logit) “Teacher੄ Output” + “True label”ਸ زदী ੉ਊೞӝب ೣ
  11. Overview Main Idea • Temperature Term • ੜ ೟णػ ݽ؛

    à ੿׹ী “High Confidence(≈ 1)” • Soft target੉ ੄޷о হਸ ࣻ ੓਺ • Softmax function ੉੹ী Smoothing à ੸׼ೠ “Soft target” H(p, q) = − ∑ x p(x)log q(x) Label Model Output
  12. Overview Main Idea • Temperature Term • ੜ ೟णػ ݽ؛

    à ੿׹ী “High Confidence(≈ 1)” • Soft target੉ ੄޷о হਸ ࣻ ੓਺ • Softmax function ੉੹ী Smoothing à ੸׼ೠ “Soft target” qi = exp(zi ) ∑ j exp(zj ) qi = exp(zi /T) ∑ j exp(zj /T) T : Temperature 0 0.175 0.35 0.525 0.7 Class 1 Class 2 Class 3 Class 4 0.1 0.05 0.7 0.15 Probability 0 0.225 0.45 0.675 0.9 Class 1 Class 2 Class 3 Class 4 0.01 0.02 0.9 0.07 Probability H(p, q) = − ∑ x p(x)log q(x) Label Model Output
  13. Overview Experiments Meaningful Experiment • MNIST • 0~9੄ Ӓܿ Ã

    ं੗ ৘ஏ • Experiment • ੿׹੉ “3”ੋ exampleਸ ೟ण ࣇীࢲ ઁ৻ • “3”਷ “2”ա “8” ١੄ Distillation labelীࢲ݅ ١੢ • Test Accuracy: 877/1010 = 86%
  14. Recent Research Experiments Environment • 3-Architecture: Bi-LSTM, KimCNN(Char-CNN), BlendCNN Text

    Classification Task AG News(4 classes), DBpedia(10 classes), Yahoo Answers(10 classes) ~x300 speed up!!
  15. Result Recent Research Environment • 3-Architecture: Bi-LSTM, KimCNN(Char-CNN), BlendCNN Text

    Classification Task AG News(4 classes), DBpedia(10 classes), Yahoo Answers(10 classes) ~x300 speed up!! Labelled Data Transfer Data Experiments
  16. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Loss Input Batch(Multiple Task, Input) Training Phase
  17. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 1) Loss: Task 1 Loss Input Batch(Multiple Task, Input) Training Phase
  18. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 2) Loss: Task 1 Loss + Task 2 Loss Input Batch(Multiple Task, Input) Training Phase
  19. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence(Task 3) Loss: Task 1 Loss + Task 2 Loss + Task 3 Loss Input Batch(Multiple Task, Input) Training Phase
  20. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Loss: Task 1 Loss + Task 2 Loss + Task 3 Loss Input Batch(Multiple Task, Input) Optimize Training Phase
  21. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier Inference Phase BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Task 1 Classifier Task 1 Classifier Task 1 Classifier Input Batch(Multiple Task, Input) Task 1 Output Task 2 Output Task 3 Output
  22. Recent Research Main Idea • Multi-Task Learning (with BERT) •

    ೞա੄ Model۽ Multi Task • ҕా੄ Encoder + п Task߹ Classifier • ੉੼ • Single Sentence à One-time Inference! • Intent, DA, Sentiment ١ “োҙࢿ” ੓ח Task • Training Robustness Inference Phase BERT Encoder Task 1 Classifier Task 2 Classifier Task 3 Classifier Input sentence Task 1 Classifier Task 1 Classifier Task 1 Classifier Input Batch(Multiple Task, Input) Task 1 Output Task 2 Output Task 3 Output
  23. Recent Research Main Idea • Teacher Annealing Loss function e.g.

    cross entropy, MSE L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ))
  24. Recent Research Main Idea • Teacher Annealing Task Task Dataset

    L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ))
  25. L(θ) = ∑ τ∈T ∑ (xi T ,yi T )∈DT

    l(fT (xi T , θT ), fT (xi T , θ)) Recent Research Main Idea • Teacher Annealing Teacher Result(Label) Student Result Same Architecture
  26. Recent Research Main Idea • Teacher Annealing L(θ) = ∑

    τ∈T ∑ (xi T ,yi T )∈DT l(fT (xi T , θT ), fT (xi T , θ)) l(λyi T + (1 − λ)fT (xi T , θT ), fT (xi T , θ)) Teacher Annealing lambdaী ٮۄ True label: Teacher Output੄ ࠺ਯ Ѿ੿ Label True label੉ ೙ਃ! Ã Transfer Dataset ࢎਊࠛо
  27. Recent Research Main Idea • Model & Training Parameter •

    Large LR :1e-4 • Task Weighted Sampling(Multi-task) • Layer-wise LR
  28. Recent Research Experiments Environment • GLUE • NLU Benchmark •

    Experiment • Multi-task + Distillationਸ ׮নೞѱ ઑ೤ೞৈ प೷ ૓೯ • Ablation Studyܳ ా೧ Teacher Annealing੄ ബҗ ૐݺ
  29. Well-Read Students Learn Better: The Impact of Student Initialization on

    Knowledge Distillation Turc et al., 2019 arXiv Recent Research
  30. Recent Research Main Idea • Various Size of BERT Distillation

    (Google ౵ਕ..) • BERT Ã BERT Distillation • ׮নೠ Hidden Size, Layer depth۽ Distillation (24о૑ ݽ؛) Memory Speed
  31. Recent Research Main Idea • BERTী ݏ୸ Distillation ೟ण ߑध

    1. Student Model Pre-training(Initialization) 2. Distillation using Transfer Dataset 3. Fine-Tuning using Labelled Dataset
  32. Recent Research Experiments Environment • Text Classification Task(Single, Pair) •

    SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning)
  33. Recent Research Experiments Environment • Text Classification Task(Single, Pair) •

    SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning) ݽٚ Taskীࢲ #4о જ਷ ࢿמ
  34. Recent Research Experiments Environment • Text Classification Task(Single, Pair) •

    SST, Book Review(Single), MNLI, RTE(Pair) • Experiment • 4о૑ Method۽ प೷ ૓೯ • #1:Basic Training: Pre-training হ੉ ߄۽ ೟ण • #2:Distillation: Distillationਵ۽݅ ೟ण • #3:Pre-training + Fine-Tuning: ੌ߈੸ੋ BERT ೟णߨ • #4:Pre-training + Distillation + (Fine-Tuning) - Layer ࣻী ٮܲ Ѿҗ - Hidden Sizeী ٮܲ Ѿҗ 8-layerө૑ח Hidden size << Layer ࣻ Intermediate layer = Hidden size * 4 Multi-head = Hidden size / 64
  35. Recent Research Main Idea • Patient Distillation • Output +

    Intermediate Layerܳ Teacherܳ ٮܰب۾ • Intermediate Layer੄ [CLS]ష௾ ੉ਊ • [CLS] ష௾਷ classifier ݆਷ ੿ࠁܳ ׸Ҋ ੓׮Ҋ о੿ • ױ, Teacher৬ Student੄ Hidden sizeо زੌ೧ঠೣ • ࣘب ஏݶী ੓যࢲח ௾ ઁডઑѤ੐ • Teacher੄ ೞਤ kѐ੄ layer۽ initialization
  36. Recent Research Experiments Environment • Dataset • SST, MRPC, QQP,

    MNLI, QNLI. RTE • Experiment • 2о૑ ݽ؛(3-layer, 6-layer) ߂ 3о૑ ߑߨ(FT, KD, PKD)۽ ૓೯ • #1:FT(Fine-Tuning): ೞਤ k-layer۽ initialize റ Fine-tuning • #2:KD(Knowledge Distillation): ੌ߈੸ੋ KD(output݅ ੉ਊ) • #3:PKD(Patient Knowledge Distillation): Intermediate layer + output ੉ਊ
  37. Recent Research Experiments Environment • Dataset • SST, MRPC, QQP,

    MNLI, QNLI. RTE • Experiment • 2о૑ ݽ؛(3-layer, 6-layer) ߂ 3о૑ ߑߨ(FT, KD, PKD)۽ ૓೯ • #1:FT(Fine-Tuning): ೞਤ k-layer۽ initialize റ Fine-tuning • #2:KD(Knowledge Distillation): ੌ߈੸ੋ KD(output݅ ੉ਊ) • #3:PKD(Patient Knowledge Distillation): Intermediate layer + output ੉ਊ ੉੹ ֤ޙҗ ࠺Ү೮ਸ ٸ, ੹୓੸ਵ۽ ઑӘঀ ծ਷ ࣻ஖
  38. хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽

    োۅ ઱ࣁਃ! ߔ৔޹ (Machine Learning Engineer, Pingpong) [email protected] Facebook. bym0313