Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LFL Client Platform for Suppoting Multiple Federated Learning Instances

LFL Client Platform for Suppoting Multiple Federated Learning Instances

Hyukjae Jang (LINE Plus / Messaging Client Engineering / Software Engineeer)

https://tech-verse.me/ja/sessions/25
https://tech-verse.me/en/sessions/25
https://tech-verse.me/ko/sessions/25

Tech-Verse2022

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. What is federated learning? Cloud-based Machine Learning Federated Learning Collected

    user data Clients Server Aggregated ML model Clients User data Distribute ML model Training Download & upload ML model User data Training Server ML model Training
  2. - Premium stickers recommendation - Before: recommend by usage history

    - After: recommend by federated learning - A/B test result - Premium stickers download 5.56% uplift Federated learning at LINE
  3. Client app Common fuctionalities for FL FL structure overview Server

    Model aggregation Common functionality User interaction ML Model Repository Model (updated) Inference Train Model User log
  4. Resource-intensive functionalities for FL - Resource limited mobile environment -

    Simultaneous on-device training - May degrade user experience On-device training
  5. Table of contents - What is Federated Learning? - Federated

    Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned
  6. LINE app LINE Federated Learning architecture Feature service server Download

    ML model LFL Platform (Client-side) Feature service client Download to feature service Model (updated) Model Inference Train LFL Platform (Server-side) Feature Model Repository ML Model Repository User log Model aggregation Inference request User interaction Upload ML model
  7. LFL client platform structure ML library LFL Application LFL Application

    LFL Application LFL Application module Common module LFL Client Platform Model version check Model download Result upload Init with config Push logs Inference Model training User logs ML model Update model, config Start training Get training result LINE Feature service
  8. Dependency inversion in LFL client platform LFL Common's dependency to

    Application Modules LFL_Common_Module LFL_Application_A_Module Application_A fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() LFL_Application_B_Module Application_B fun updateModel() fun startTrain() fun getTrainResult() Call functions Call functions LFLApplication Manager fun updateModelA() fun updateModelB() fun updateConfigA() fun startTrainA() fun startTrainB() fun getTrainResultA() fun getTrainResultB() C C C
  9. Dependency inversion in LFL client platform LFL Common's dependency to

    Application Modules LFL_Common_Module LFL_Application_A_Module Application_A fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() LFL_Application_B_Module Application_B fun updateModel() fun startTrain() fun getTrainResult() Call functions Call functions LFLApplication Manager fun updateModelA() fun updateModelB() fun updateConfigA() fun startTrainA() fun startTrainB() fun getTrainResultA() fun getTrainResultB() C C C
  10. Dependency inversion in LFL client platform Dependency Infeversion of LFL

    Common and Application Module LFL_Common_Module LFL_Application_A_Module Application_A fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() LFL_Application_B_Module Application_B fun updateModel() fun startTrain() fun getTrainResult() Call functions Call functions LFLApplication Manager fun updateModelA() fun updateModelB() fun updateConfigA() fun startTrainA() fun startTrainB() fun getTrainResultA() fun getTrainResultB() C C C LFL_Application_A_Module LFL_Application_B_Module LFL_Common_Module LFLApplication Manager C LFLApplication fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() I call functions LFLApplication_A_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl()} LFLApplication_B_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl() C C Implement interfaces Dependency injection
  11. LFL application module LFL application module LFL application module Interfaces

    of LFL client platform LFL Client Platform Common module Machine learning library LFL application manager Local trigger 1 Training trigger 3 Start train to actual instance 2 Start train LFL application interface Dependency Infeversion of LFL Common and Application Module LFL application module LFL_Application_A_Module LFL_Application_B_Module LFL_Common_Module LFLApplication Manager C LFLApplication fun updateModel() fun updateConfig() fun startTrain() fun getTrainResult() I call functions LFLApplication_A_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl()} LFLApplication_B_Impl fun updateModelImpl() fun updateConfigImpl() fun startTrainImpl() fun getTrainResultImpl() C C Dependency Injection Implement interfaces Dependency injection
  12. Table of contents - What is Federated Learning? - Federated

    Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned
  13. Requirements for on-device training - iOS: BGTaskScheduler and BGProcessingTask -

    Android: WorkManager - Background processing can run more than 10 min. - OS can interrupt the processing anytime - Requirements for on-device training - Battery, storage, device idle (background processing)
  14. Excessively skewed FL participation 1st 2nd 3rd 4th 5th 6th

    7th Days without Rollout User data Client Locally trained model Initial model Upload model Train Cloud Model aggregation
  15. Excessively skewed FL participation !"##"$% = {()*((,"-,)%(./01203, /567203)) & :.;<07=>?0}

    A- [1C66C.7D0E>F, 1C66C.7GFH] model_config.json { "training":{ … "uploaing limit":2 }, "rollout":{ "salt_key":"ranker", "slots":{ "begin":0, "end":10 } }, } 1st 2nd 3rd 4th 5th 6th 7th Days with Rollout
  16. Trigger background processing repeatedly Need to wait for configuration changes

    Check config (e.g. rollout) Background task OS trigger Check update Ready to train F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side)
  17. Trigger background processing repeatedly Need to wait for configuration changes

    Check config (e.g. rollout) Background task OS trigger Check update Ready to train F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side) Rollout is disabled
  18. Trigger background processing repeatedly Need to wait for additional user

    logs Background task OS trigger Check update Ready to train Train model Upload model T F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side) Update model Delete user logs
  19. Trigger background processing repeatedly Need to wait for additional user

    logs Background task OS trigger Check update Ready to train Train model Upload model T F LFL Platform (Server-side) Register LINE app LFL Platform (Client-side) Update model Delete user logs Not enough user logs
  20. Trigger background processing repeatedly Inefficient background processing of FL Introduce

    retry interval and train interval Need to wait for configuration changes or additional user logs
  21. Interval-based scheduling ⎯ Diverse retry interval and train interval ⎯

    Single training session at a time à training duration estimation Schedule multiple LFL applications’ training Application A Application B Application C Retry interval Training Train interval Retry interval Retry interval Retry interval Training Train interval Training Train interval Training BGTask triggered BGTask triggered BGTask triggered BGTask triggered But not ready BGTask triggered But not ready BGTask triggered But not ready BGTask triggered
  22. Interval-based scheduling ⎯ Diverse retry interval and train interval ⎯

    Single training session at a time à training duration estimation Schedule multiple LFL applications’ training Application A Application B Application C Retry interval Training Train interval Retry interval Retry interval Retry interval Training Train interval Training Train interval Training BGTask triggered BGTask triggered BGTask triggered BGTask triggered But not ready BGTask triggered But not ready BGTask triggered But not ready BGTask triggered Duration estimation Training scheduling
  23. Interval-based scheduling ⎯ Share a single background processing for all

    LFL applications’ model training ⎯ Register background processing trigger with interval-based delay fun registerNextTrainingWithDelay() { val waitingTime = LFLApplications.minOf( application -> maxOf( application.getLatestTrainReadyCheckTime() + application.getRetryInterval(), application.getLatestTrainSuccessTime() + application.getTrainInterval() ) ) – System.currentTimeMillis() registerNextBackgroundProcessingTriggerWithDelay(waitingTime) } Interval-based scheduling Background Processing Training A Training C Training B Min. delay Min. delay Min. delay
  24. Select application to train On-device training with interval-based scheduling LFL

    Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F
  25. Select application to train On-device training with interval-based scheduling LFL

    Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F
  26. Select application to train On-device training with interval-based scheduling LFL

    Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F
  27. Select application to train On-device training with interval-based scheduling LFL

    Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F
  28. Select application to train On-device training with interval-based scheduling LFL

    Application List Application A Application B Application C Application A Background task OS trigger Model downloader Check update Update latest_retry_time Ready to train Minimum waiting time LFL Trainer Update latest_train_time delegate Train model Register Ready to upload? T Upload model delegate Model uploader Retrieve Calculate next delay time delegate T Filter applications with Interval conditions Model config Inference model Training model F F
  29. Table of contents - What is Federated Learning? - Federated

    Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned
  30. ML library for training and inference Focused on lightweight library

    ⎯ A federated learning library for cross mobile platform (support iOS and Android) ⎯ Based on ONNX Runtime ( https://onnx.ai ) ⎯ Model conversion from TensorFlow or Pytorch ⎯ Currently, around 1.2MB ⎯ Limited number of operations and only CPU backend supported ⎯ Local Differential Privacy (LDP) supported ⎯ Use gaussian mechanism for differential privacy Yuki Federated Learning (YFL) SDK
  31. ML library for training and inference Focused on lightweight library

    ⎯ A federated learning library for cross mobile platform (support iOS and Android) ⎯ Based on ONNX Runtime ( https://onnx.ai ) ⎯ Model conversion from TensorFlow or Pytorch ⎯ Currently, around 1.2MB ⎯ Limited number of operations and only CPU backend supported ⎯ Local Differential Privacy (LDP) supported ⎯ Use gaussian mechanism for differential privacy Yuki Federated Learning (YFL) SDK
  32. LFL application’s Storage management Version update of ML model User

    log DB ⎯ Training model and inference model ⎯ Model configurations ⎯ Integrity check for privacy configurations (e.g. rollout, uploading limit, LDP) ⎯ Different policy for major update and patch update ⎯ Delete user logs or reset uploading limit ⎯ Version matching with feature model ⎯ Delete old logs more than maximum training input ⎯ Delete logs used for training ML Model storage
  33. Table of Contents - What is Federated Learning? - Federated

    Learning at LINE - Why do we need a planform supporting multiple Federated Learning instances? - LFL client platform supporting multiple Federated Learning instances - On-device training of LFL client platform - Inside of LFL client platform - Lesson learned
  34. A large-scale project with the collaboration of multiple teams Lesson

    Learned Blind spots and discrepancies can exist in the process of collaboration Larger project, higher complexity
  35. Lesson Learned The importance of testing cannot be overemphasized Sample

    app for end-to-end testing Test tools for background processing Remote logging for critical sections