LFL Client Platform for Suppoting Multiple Federated Learning Instances

by Tech-Verse2022

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

What is federated learning? Cloud-based Machine Learning Federated Learning Collected user data Clients Server Aggregated ML model Clients User data Distribute ML model Training Download & upload ML model User data Training Server ML model Training

Slide 3

Slide 3 text

Federated learning examples Research papers Google’s Gboard Apple, Meta, etc. NAVER SmartBoard

Slide 4

Slide 4 text

- Premium stickers recommendation - Before: recommend by usage history - After: recommend by federated learning - A/B test result - Premium stickers download 5.56% uplift Federated learning at LINE

Slide 5

Slide 5 text

Music Call News Video Expecting multiple FL adoption Chat Pay

Slide 6

Slide 6 text

Client app Common fuctionalities for FL FL structure overview Server Model aggregation Common functionality User interaction ML Model Repository Model (updated) Inference Train Model User log

Slide 7

Slide 7 text

Resource-intensive functionalities for FL - Resource limited mobile environment - Simultaneous on-device training - May degrade user experience On-device training

Slide 8

Slide 8 text

Table of contents - What is Federated Learning? - Federated Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned

Slide 9

Slide 9 text

LINE app LINE Federated Learning architecture Feature service server Download ML model LFL Platform (Client-side) Feature service client Download to feature service Model (updated) Model Inference Train LFL Platform (Server-side) Feature Model Repository ML Model Repository User log Model aggregation Inference request User interaction Upload ML model

Slide 10

Slide 10 text

LFL client platform structure ML library LFL Application LFL Application LFL Application LFL Application module Common module LFL Client Platform Model version check Model download Result upload Init with config Push logs Inference Model training User logs ML model Update model, config Start training Get training result LINE Feature service

Slide 11

Slide 11 text

Dependency inversion in LFL client platform LFL Common's dependency to Application Modules LFL_Common_Module LFL_Application_A_Module Application_A fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() LFL_Application_B_Module Application_B fun updateModel() fun startTrain() fun getTrainResult() Call functions Call functions LFLApplication Manager fun updateModelA() fun updateModelB() fun updateConfigA() fun startTrainA() fun startTrainB() fun getTrainResultA() fun getTrainResultB() C C C

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Dependency inversion in LFL client platform Dependency Infeversion of LFL Common and Application Module LFL_Common_Module LFL_Application_A_Module Application_A fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() LFL_Application_B_Module Application_B fun updateModel() fun startTrain() fun getTrainResult() Call functions Call functions LFLApplication Manager fun updateModelA() fun updateModelB() fun updateConfigA() fun startTrainA() fun startTrainB() fun getTrainResultA() fun getTrainResultB() C C C LFL_Application_A_Module LFL_Application_B_Module LFL_Common_Module LFLApplication Manager C LFLApplication fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() I call functions LFLApplication_A_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl()} LFLApplication_B_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl() C C Implement interfaces Dependency injection

Slide 14

Slide 14 text

LFL application module LFL application module LFL application module Interfaces of LFL client platform LFL Client Platform Common module Machine learning library LFL application manager Local trigger 1 Training trigger 3 Start train to actual instance 2 Start train LFL application interface Dependency Infeversion of LFL Common and Application Module LFL application module LFL_Application_A_Module LFL_Application_B_Module LFL_Common_Module LFLApplication Manager C LFLApplication fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() I call functions LFLApplication_A_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl()} LFLApplication_B_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl() C C Dependency Injection Implement interfaces Dependency injection

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Requirements for on-device training - iOS: BGTaskScheduler and BGProcessingTask - Android: WorkManager - Background processing can run more than 10 min. - OS can interrupt the processing anytime - Requirements for on-device training - Battery, storage, device idle (background processing)

Slide 17

Slide 17 text

Excessively skewed FL participation 1st 2nd 3rd 4th 5th 6th 7th Days without Rollout User data Client Locally trained model Initial model Upload model Train Cloud Model aggregation

Slide 18

Slide 18 text

Excessively skewed FL participation !"##"$% = {()*((,"-,)%(./01203, /567203)) & :.;<07=>?0} A- [1C66C.7D0E>F, 1C66C.7GFH] model_config.json { "training":{ … ＂uploaing limit":2 }, "rollout":{ "salt_key":"ranker", "slots":{ "begin":0, "end":10 } }, } 1st 2nd 3rd 4th 5th 6th 7th Days with Rollout

Slide 19

Slide 19 text

Trigger background processing repeatedly Need to wait for configuration changes Check config (e.g. rollout) Background task OS trigger Check update Ready to train F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side)

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Trigger background processing repeatedly Need to wait for additional user logs Background task OS trigger Check update Ready to train Train model Upload model T F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side) Update model Delete user logs

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Trigger background processing repeatedly Inefficient background processing of FL Introduce retry interval and train interval Need to wait for configuration changes or additional user logs

Slide 24

Slide 24 text

Interval-based scheduling ⎯ Diverse retry interval and train interval ⎯ Single training session at a time à training duration estimation Schedule multiple LFL applications’ training Application A Application B Application C Retry interval Training Train interval Retry interval Retry interval Retry interval Training Train interval Training Train interval Training BGTask triggered BGTask triggered BGTask triggered BGTask triggered But not ready BGTask triggered But not ready BGTask triggered But not ready BGTask triggered

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Interval-based scheduling ⎯ Share a single background processing for all LFL applications’ model training ⎯ Register background processing trigger with interval-based delay fun registerNextTrainingWithDelay() { val waitingTime = LFLApplications.minOf( application -> maxOf( application.getLatestTrainReadyCheckTime() + application.getRetryInterval(), application.getLatestTrainSuccessTime() + application.getTrainInterval() ) ) – System.currentTimeMillis() registerNextBackgroundProcessingTriggerWithDelay(waitingTime) } Interval-based scheduling Background Processing Training A Training C Training B Min. delay Min. delay Min. delay

Slide 27

Slide 27 text

Select application to train On-device training with interval-based scheduling LFL Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

ML library for training and inference Focused on lightweight library ⎯ A federated learning library for cross mobile platform (support iOS and Android) ⎯ Based on ONNX Runtime ( https://onnx.ai ) ⎯ Model conversion from TensorFlow or Pytorch ⎯ Currently, around 1.2MB ⎯ Limited number of operations and only CPU backend supported ⎯ Local Differential Privacy (LDP) supported ⎯ Use gaussian mechanism for differential privacy Yuki Federated Learning (YFL) SDK

Slide 34

Slide 34 text

Slide 35

Slide 35 text

LFL application’s Storage management Version update of ML model User log DB ⎯ Training model and inference model ⎯ Model configurations ⎯ Integrity check for privacy configurations (e.g. rollout, uploading limit, LDP) ⎯ Different policy for major update and patch update ⎯ Delete user logs or reset uploading limit ⎯ Version matching with feature model ⎯ Delete old logs more than maximum training input ⎯ Delete logs used for training ML Model storage

Slide 36

Slide 36 text

Table of Contents - What is Federated Learning? - Federated Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned

Slide 37

Slide 37 text

A large-scale project with the collaboration of multiple teams Lesson Learned Blind spots and discrepancies can exist in the process of collaboration Larger project, higher complexity

Slide 38

Slide 38 text

Lesson Learned The importance of testing cannot be overemphasized Sample app for end-to-end testing Test tools for background processing Remote logging for critical sections

Slide 39

Slide 39 text

Thank you