Developing a high quality language model for Japanese

Agenda › Why LINE builds language models › What makes
a high-quality language model › Introduce a Japanese language model created at LINE

Why LINE builds language models

Global trend › Pretrained language model has become de facto
standard in NLP research 2018 2019 2020 ELMo GPT BERT XLNet T5 GPT-3 Pretraining language model + fine tuning on task => SOTA Competition on pretrained language models

Global trend › Pretrained language model has become de facto
standard in NLP research 2018 2019 2020 ELMo GPT BERT XLNet T5 GPT-3 Pretraining language model + fine tuning on task => SOTA Competition on pretrained language models › High quality language model can improve many downstream tasks Sentiment analysis Question answering Search ranking … Increasing adoption in production (Google, Microsoft etc.)

Language models at LINE › Already being used in CLOVA
Text Analysis, LINE AiCall, and more coming › Active R&D › Early study since 2018 Production-grade high quality model Large scale model Light weight model

What makes a high-quality language model

Business requirements Algorithm selection Data process Model training Model evaluation
Model creation: a continual process Model deployment is a separate topic

Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M
1B 10B 100B Model size BERT Google XLNet Google T5 Google GPT-3 OpenAI Most of Japanese LM

Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M
1B 10B 100B Model size BERT Google XLNet Google T5 Google GPT-3 OpenAI Most of Japanese LM LINE model

Data size Collect data from various resources for multiple downstream
tasks Resource type Cleaned data size News 27 GB Blog 10 GB Books 0.2 GB Q&A 51 GB Wikipedia 2 GB Web 50 GB Total 140 GB

Data cleaning system :7_ -'6/ 03R 25OeOf 101.8kg 2016-03- 28
20:00:30 .63g+'(,/ 24)1*g\^b J[0,/316 "d B@8 SM ;9%K"HWZD\^ ! F \^ $h &:ad3RP%`"bT 5 Q] 5 5 5cA 7G8?C"A EYALV">N U<=XI # "&d <BEFORE> &:ad3RP%`"bT 5Q] 5 5 5cA 7G8?C"A EYALV">N U<=XI # "&d <AFTER> Input: raw text Extractor: document filter • Cutting out samples from data resource. • Preprocessing : remove unnecessary tags. (HTML etc.) Sentence splitter / Preprocessor • Extract sentences from each sample. • Preprocessing : remove/replace special symbols/chars. Sentence filter • Determine which a cleaned data can be used as a sample or not under the various conditions; grammatical filter etc.

Model training Speedup large-scale training shortens TTM and enables more
iterations › Distributed on NSML (on-prem GPU cluster) to scale-up batch size and reduce training time › Dynamic instances generation to reduce storage

Product review Sentiment analysis Textual entailment Named entity recognition Reading
comprehension Question answering Acc F1 Acc F1 Acc F1 Acc F1 Acc EM F1 LINE model Tohoku Univ NICT-BPE NICT-Word Evaluation benchmarks On diverse Japanese NLP tasks

Evaluation benchmarks On diverse Japanese NLP tasks Product review Sentiment
analysis Textual entailment Named entity recognition Reading comprehension Question answering Acc F1 Acc F1 Acc F1 Acc F1 Acc EM F1 LINE model 57.49 57.27 89.31 89.46 72.66 70.81 97.90 71.99 Tohoku Univ NICT-BPE NICT-Word 83.75 78.47 79.49

Summary High quality language model is needed in more and
more services LINE is building a large Japanese data set and solving scale issues Many challenging issues to continuously improve language model

Thank you

Developing a high quality language model for Ja...

Developing a high quality language model for Japanese

LINE DevDay 2020

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript

Agenda › Why LINE builds language models › What makes

Why LINE builds language models

Global trend › Pretrained language model has become de facto

Global trend › Pretrained language model has become de facto

Language models at LINE › Already being used in CLOVA

What makes a high-quality language model

Business requirements Algorithm selection Data process Model training Model evaluation

Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M

Challenging issue: Scale 1GB 10GB 100GB 1TB Data size 100M

Data size Collect data from various resources for multiple downstream

Data cleaning system :7_ -'6/ 03R 25OeOf 101.8kg 2016-03- 28

Model training Speedup large-scale training shortens TTM and enables more

Product review Sentiment analysis Textual entailment Named entity recognition Reading

Product review Sentiment analysis Textual entailment Named entity recognition Reading

Evaluation benchmarks On diverse Japanese NLP tasks Product review Sentiment

Summary High quality language model is needed in more and

Thank you