How to Apply Large ML Models for AI-Text Filtering Models

How to Apply Large ML Models for AI-Text Filtering Models
Hyung Rak Kim / LINE Plus

• Hyungrak Kim • NLP Engineer • AI text filter
model • Likes to learn and use new technology

• (Stable AI): https://beta.dreamstudio.ai/dream “beautiful forest”

Contents › Introduction › Large ML model training tech ›
Apply large ML model to AI text filter › Experiment Result › Expected Effectiveness › Conclusion

Introduction

Introduction What is AI text filter User LINE Monitoring System
私と付き合いたい場合は連絡してください [email protected] Please contact me if you want to date me [email protected] Translation Check 380,000,000 Data every month Normal Personal Info Porn Harass illegal Advertising 0 0 0 1 1 JP Language AI Text filter Model 0

Introduction What is AI text Filter Problem JP Language AI
Text filter Model Fine-Tuning JP BERT JP Char BERT JP RoBERTa JP small BERT JP Distill BERT ....... ..... Public Pre-training Model

Introduction What is AI text Filter Problem JP Language AI
Text filter Model › Which model performance is better? › What if the language is different? • Research Cost • Development Cost • Service Cost Fine-Tuning Problem JP BERT JP Char BERT JP RoBERTa JP small BERT JP Distill BERT ....... ..... Public Pre-training Model

Introduction Solution Language Multi Language Performance Large ML Model Technique
Large ML Training Tech

Introduction What is AI text Filter Solution Single Language Model
Multi Language Model Impact X100 Cost Reduction Service Extension AI Text filter Model 110 Million Japan English Thailand Taiwan Indonesia ....... ..... .. 11 Billion Large Model Training Technique

Introduction Contribution › Introduction and sharing of large ML model
training technology › AI text filter advancement using large multi-language model › With MLU team of LINE MLOps › Model serving › With MLU serving team of LINE ML service

Large ML Model Training Tech

Large ML Model Training Tech Basic Scaling Lightweight Pruning Quantization
Knowledge Distillation Data Parallelism Model Parallelism CPU-Offload

Large ML Model Training Tech Data Parallelism Data 1 Data
2 Data 3 V100 GPU 1 V100 GPU 2 V100 GPU 3 ML model ML model ML model

Large ML Model Training Tech Model Parallelism Input Layer 1
Layer 2 Output GPU 1 GPU 2 Intra Operator Parallelism All Reduce

Large ML Model Training Tech Model Parallelism + CPU Offload
CPU Space Model Size UP CPU Offload Input Layer 1 Layer 2 Output GPU 1 GPU 2 Intra Operator Parallelism All Reduce

Large ML Model Training Tech Large ML model training framework
› Framework › Why choose DeepSpeed › Open Source › CPU offload › Supported best method from ML Tech • (ICML2022 Big model tutorial): https://icml.cc/virtual/2022/tutorial/18440 • (DeepSpeed):https://www.deepspeed.ai

Apply large ML model to AI Text filter

Apply large ML model to AI text filter Large model
training › Constructure GPU GPU GPU GPU Node 1 GPU GPU GPU GPU GPU GPU GPU GPU Node 2 GPU GPU GPU GPU GPU GPU GPU GPU Node 3 GPU GPU GPU GPU DeepSpeed Multi-Node Setting • (DeepSpeed): https://www.deepspeed.ai/ › Node › GPU : A100 40G › CPU core: 70 › CPU memory: 1T › GPU Number: 8

training › Constructure GPU GPU GPU GPU Node 1 GPU GPU GPU GPU GPU GPU GPU GPU Node 2 GPU GPU GPU GPU GPU GPU GPU GPU Node 3 GPU GPU GPU GPU DeepSpeed Multi-Node Setting Training Configure + • (DeepSpeed): https://www.deepspeed.ai/ Training Pre-training Model Multi Language 11 Billion Fine-tuning AI Text Filter + + Data 730,000 › Node › GPU : A100 40G › CPU core: 70 › CPU memory: 1T › GPU Number: 8

Problem Environment Setting Multi-node Sharing Pre-training model Dependency Apply large
ML model to AI text filter

Apply large ML model to AI text filter Environment setting
problem • (DeepSpeed): https://hub.docker.com/r/deepspeed/deepspeed/tags?page=1&ordering=last_updated DeepSpeed Environment CPU GPU Library Dependency DeepSpeed OS System Library

problem • (DeepSpeed): https://hub.docker.com/r/deepspeed/deepspeed/tags?page=1&ordering=last_updated DeepSpeed Environment CPU GPU Library Dependency DeepSpeed OS System Library Cuda Extension Build System CUDA Extension Ninja G++/C++ DeepSpeed

solution DeepSpeed Env Setting OS System Library DeepSpeed Library Multi Node Library

solution All Function used in MLU Fixed DeepSpeed Stable Version Training Library Free MLU Environment DeepSpeed Env Setting OS System Library DeepSpeed Library Multi Node Library

solution All Function used in MLU Fixed DeepSpeed Stable Version Training Library Free MLU Environment Docker Installation Document Docker Hub DeepSpeed Env Setting OS System Library DeepSpeed Library Multi Node Library

Apply large ML model to AI text filter Multi-node training
file sharing problem 1 Training MLU Environment GPU Server First Training Start CUDA Extension GPU Accelerator • (DeepSpeed): https://www.deepspeed.ai/tutorials/advanced-install/

file sharing problem 2 Training Header GPU Node 1 MLU Environment Worker GPU Node 2 Worker GPU Node 3 Training Start CUDA Extension ? ? • (DeepSpeed): https://www.deepspeed.ai/tutorials/advanced-install/ ssh ssh

file sharing solution 1 Worker GPU Node 2 Worker GPU Node 3 Header GPU Node 1 Training Start CUDA Extension Worker Node IP address List Worker GPU Node N Secure File Transfer Sharing Module Multi Node File Sharing Module

file sharing solution 2 Header GPU Node 1 MLU Environment Worker GPU Node 2 Worker GPU Node 3 Training Start CUDA Extension Sharing Training • (DeepSpeed): https://www.deepspeed.ai/tutorials/advanced-install/ ssh ssh

Apply large ML model to AI text filter Pre-training model
parallelism dependency problem Input Layer 1 Layer 2 Output GPU 1 GPU 2 Intra Operator Parallelism All Reduce Coding

parallelism dependency problem Public Pre-training model JP BERT JP Char BERT JP RoBERTa JP small BERT JP Distill BERT ....... ..... Un Parallelized Model Code Input Layer 1 Layer 2 Output GPU 1 GPU 2 Intra Operator Parallelism All Reduce Coding

parallelism dependency problem Public Pre-training model JP BERT JP Char BERT JP RoBERTa JP small BERT JP Distill BERT ....... ..... Un Parallelized Model Code Pre-training Model Parallelism Dependency Fine-tuning Input Layer 1 Layer 2 Output GPU 1 GPU 2 Intra Operator Parallelism All Reduce Coding

parallelism dependency solution Parallelism Converting Pre-training Model Code Parallelism

parallelism dependency solution Parallelism Converting Pre-training Model Code Parallelism Pre-training Model Weight Partitioning

parallelism dependency code parallelism 1 Public Pre-training Model Transformer model Encoder Layer 1 Layer 2 Layer N Decoder Layer 1 Layer 2 Layer N

parallelism dependency code parallelism 1 Public Pre-training Model Transformer model Encoder Layer 1 Layer 2 Layer N Decoder Layer 1 Layer 2 Layer N Layer 1 Multi Head Attention Key Query Value Feed Forward Network + Intermediate Feed Forward Network H to 4H FFN 4H to H FFN

parallelism dependency code parallelism 2 Multi Language Pre-training Model Code Parallelism Layer GPU 1 GPU 2 Key Query Multi Head Attention Layer Value + All Reduce • (Megatron ML): https://arxiv.org/pdf/1909.08053.pdf FFN Feed Forward Layer

parallelism dependency code parallelism 2 Multi Language Pre-training Model Code Parallelism Layer GPU 1 GPU 2 Key Query Multi Head Attention Layer Value + Intermediate Feed Forward Layer All Reduce Output All Reduce • (Megatron ML): https://arxiv.org/pdf/1909.08053.pdf H to 4H 4H to H FFN Feed Forward Layer

parallelism dependency code parallelism Model Parameter Partitioning Algorithm • (Megatron ML): https://github.com/NVIDIA/Megatron-LM Model Code Parallelism Model Load Weight Partitioning Pre-Training Model Weight Fine-tuning

parallelism dependency code parallelism Model Parameter Partitioning Algorithm • (Megatron ML): https://github.com/NVIDIA/Megatron-LM Model Code Parallelism Model Load Weight Partitioning Pre-Training Model Weight Multi Head Attention Layer Feed Forward Layer Intermediated Feed Forward Layer Fine-tuning Auto Partitioning GPU 1 GPU 2

parallelism dependency solution Public Pre-training model Un-Parallelized Model Code Fine-tuning Model Parallelism Parallelized model Group N Model GPU 1 GPU 2 Parallelism Converter Model Code Parallelism Model Weight Partitioning

parallelism dependency solution analysis Disadvantage Advantage Unstable Converge Model Performance Down Parallelism Dependency Free More Research Model Size Up

Apply large ML model to AI text filter Performance tunning
label correlation Label Correlation Normal Advertising Personal Info Porn illegal Harass Algorithm Normal Advertising Personal Info Porn illegal Harass Global Correlation Embedding

Serving › Large model serving • (DeepSpeed Inference):https://www.deepspeed.ai/tutorials/inference-tutorial/ Model Model Optimize FP16 Loss Scaling

Serving › Large model serving • (DeepSpeed Inference):https://www.deepspeed.ai/tutorials/inference-tutorial/ Model GPU Kernel Optimization Inference Parallelism Inference Model Optimize FP16 Loss Scaling

Serving › Large model serving • (DeepSpeed Inference):https://www.deepspeed.ai/tutorials/inference-tutorial/ V100 GPU Auto Scaling MLU Serving Model GPU Kernel Optimization Inference Parallelism Inference Model Optimize FP16 Loss Scaling

Experiment Result

Experiment Result Experiment setting VS AI-Text Filter Japanese single language
model 110 Million Service model AI-Text Filter Multi Language Large Model 11 Billion Tuning VS AI-Text Filter Multi Language Large Model 11 Billion Not Tuning

Experiment Result Experiment test data Label Count Ratio(%) Normal 99,996
86.2 Info 10,278 8.8 Porn 2,299 1.9 Harass 1,106 0.9 illegal 106 0.09 AD 2,180 1.8 Total 115,965 99996 10278 2299 1106 106 2180 0 20000 40000 60000 80000 100000 120000 Normal Info Porn Harass illegal AD Test Data Count

Experiment Result F1 Score result 0.4 0.5 0.6 0.7 0.8
0.9 1 Normal Info Porn Harass illegal AD F1 score Multi-Tuning Multi JP Service

Experiment Result F1 Score result 0.4 0.5 0.6 0.7 0.8
0.9 1 Normal Info Porn Harass illegal AD F1 score Multi-Tuning Multi JP Service 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Total Average F1 score Multi Tuning Multi JP Service -1% -9.9% 0%

Experiment Result AUC result 0.6 0.65 0.7 0.75 0.8 0.85
0.9 0.95 1 Normal Info Porn Harass illegal AD AUC score Multi Tuning Multi JP Service

Experiment Result AUC result 0.6 0.65 0.7 0.75 0.8 0.85
0.9 0.95 1 Normal Info Porn Harass illegal AD AUC score Multi Tuning Multi JP Service 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 Total Average AUC score Multi Tuning Multi JP Service -1% -9.1% 0%

Experiment Result Qualitative evaluation 経営難で銀行等からの融資待ちの方、収入がなく生活が出来ない……等々、コロナショックで困ってる方🙀 連絡頂ければ即融資可能です😊‼ Those who
are in trouble due to the corona shock, such as those who are waiting for loans from banks due to financial difficulties, those who cannot live without income, etc. 🙀 If you contact us, you can finance immediately 😊!! Translation User

are in trouble due to the corona shock, such as those who are waiting for loans from banks due to financial difficulties, those who cannot live without income, etc. 🙀 If you contact us, you can finance immediately 😊!! Translation User JP Service Model illegal 12%

are in trouble due to the corona shock, such as those who are waiting for loans from banks due to financial difficulties, those who cannot live without income, etc. 🙀 If you contact us, you can finance immediately 😊!! Translation User Multi Language Large Model - Tuning JP Service Model VS illegal 99% illegal 12%

Expected Effectiveness

Expected Effectiveness Effect › The effect of introducing the Large
ML model Improvement Performance Service Extension Large ML Model Training Tech 1 2 3

Expectation 1 › 10% more accurate AI text filter of
LMP system › 0.3% Monitoring rate down based current AI text filter service model Total 380,000,000 every month JP Service Model 5,700,000 Monitoring Data 1.5% Expected Effectiveness

LMP system › 0.3% Monitoring rate down based current AI text filter service model Total 380,000,000 every month JP Service Model Multi Language Large Model 5,700,000 Monitoring Data 1.5% 1.2% 4,560,000 Monitoring Data − Expected Effectiveness

LMP system › 0.3% Monitoring rate down based current AI text filter service model Total 380,000,000 every month JP Service Model Multi Language Large Model Monthly Monitoring Resource 1.5% 1.2% 4,560,000 Monitoring Data − Expected Effectiveness 1,140,000 Reduction 5,700,000 Monitoring Data

Expectation 2 Service Resource Monitored for a Year 300 400
500 600 700 Resource Multi Language Large Model JP Service Model X axis: 𝟏𝟎𝟓 -13,680,000 Expected Effectiveness

Expectation 2 Service Resource Monitored for a Year 300 400
500 600 700 Resource Multi Language Large Model JP Service Model X axis: 𝟏𝟎𝟓 One Year Monitoring Resource -20% -13,680,000 Expected Effectiveness

Conclusion

Conclusion Conclusion & Future Work › Conclusion › Not easy
to understand and put in practice › Fun to study as much as it was difficult › Large model effectiveness › Need more collaboration with other teams

Conclusion Conclusion & Future Work › Conclusion › Not easy
to understand and put in practice › Fun to study as much as it was difficult › Large model effectiveness › Need more collaboration with other teams › Future work › Large model hyper-parameter tuning

Next Session Info MLU & MLU Serving

Thank you

How to Apply Large ML Models for AI-Text Filter...

How to Apply Large ML Models for AI-Text Filtering Models

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript