Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a high quality language model for Japanese

Developing a high quality language model for Japanese

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda › Why LINE builds language models › What makes

    a high-quality language model › Introduce a Japanese language model created at LINE
  2. Global trend › Pretrained language model has become de facto

    standard in NLP research 2018 2019 2020 ELMo GPT BERT XLNet T5 GPT-3 Pretraining language model + fine tuning on task => SOTA Competition on pretrained language models
  3. Global trend › Pretrained language model has become de facto

    standard in NLP research 2018 2019 2020 ELMo GPT BERT XLNet T5 GPT-3 Pretraining language model + fine tuning on task => SOTA Competition on pretrained language models › High quality language model can improve many downstream tasks Sentiment analysis Question answering Search ranking … Increasing adoption in production (Google, Microsoft etc.)
  4. Language models at LINE › Already being used in CLOVA

    Text Analysis, LINE AiCall, and more coming › Active R&D › Early study since 2018 Production-grade high quality model Large scale model Light weight model
  5. Business requirements Algorithm selection Data process Model training Model evaluation

    Model creation: a continual process Model deployment is a separate topic
  6. Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M

    1B 10B 100B Model size BERT Google XLNet Google T5 Google GPT-3 OpenAI Most of Japanese LM
  7. Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M

    1B 10B 100B Model size BERT Google XLNet Google T5 Google GPT-3 OpenAI Most of Japanese LM LINE model
  8. Data size Collect data from various resources for multiple downstream

    tasks Resource type Cleaned data size News 27 GB Blog 10 GB Books 0.2 GB Q&A 51 GB Wikipedia 2 GB Web 50 GB Total 140 GB
  9. Data cleaning system :7_ -'6/ 03R 25OeOf 101.8kg 2016-03- 28

    20:00:30 .63g+'(,/ 24)1*g\^b J[0,/316 "d B@8 SM  ;9%K"HWZD\^ ! F \^ $h &:ad3RP%`"bT 5 Q] 5 5 5cA 7G8?C"A EYALV">N U<=XI # "&d <BEFORE> &:ad3RP%`"bT  5Q] 5 5 5cA 7G8?C"A EYALV">N U<=XI #  "&d <AFTER> Input: raw text Extractor: document filter • Cutting out samples from data resource. • Preprocessing : remove unnecessary tags. (HTML etc.) Sentence splitter / Preprocessor • Extract sentences from each sample. • Preprocessing : remove/replace special symbols/chars. Sentence filter • Determine which a cleaned data can be used as a sample or not under the various conditions; grammatical filter etc.
  10. Model training Speedup large-scale training shortens TTM and enables more

    iterations › Distributed on NSML (on-prem GPU cluster) to scale-up batch size and reduce training time › Dynamic instances generation to reduce storage
  11. Product review Sentiment analysis Textual entailment Named entity recognition Reading

    comprehension Question answering Acc F1 Acc F1 Acc F1 Acc F1 Acc EM F1 LINE model Tohoku Univ NICT-BPE NICT-Word Evaluation benchmarks On diverse Japanese NLP tasks
  12. Product review Sentiment analysis Textual entailment Named entity recognition Reading

    comprehension Question answering Acc F1 Acc F1 Acc F1 Acc F1 Acc EM F1 LINE model Tohoku Univ NICT-BPE NICT-Word Evaluation benchmarks On diverse Japanese NLP tasks
  13. Evaluation benchmarks On diverse Japanese NLP tasks Product review Sentiment

    analysis Textual entailment Named entity recognition Reading comprehension Question answering Acc F1 Acc F1 Acc F1 Acc F1 Acc EM F1 LINE model 57.49 57.27 89.31 89.46 72.66 70.81 97.90 71.99    Tohoku Univ            NICT-BPE            NICT-Word         83.75 78.47 79.49
  14. Summary High quality language model is needed in more and

    more services LINE is building a large Japanese data set and solving scale issues Many challenging issues to continuously improve language model