Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Explore in Timeline : from a ML pipeline perspe...

Explore in Timeline : from a ML pipeline perspective

Hochul Kim
LINE Plus Data Science Dev & ML Platform Dev Machine Learning Engineer
https://linedevday.linecorp.com/2020/ja/sessions/5543
https://linedevday.linecorp.com/2020/en/sessions/5543

LINE DevDay 2020

November 26, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda › About Explore in Timeline › What is a

    Machine Learning Pipeline? › Explore in Timeline: from a ML Pipeline perspective › Summary › Next mission
  2. Available Features › Post › Post contents › From other

    LINE Family service › News, Live, etc. › User › User’s preference / category About Explore in Timeline
  3. About Explore in Timeline But we have to consider… Large

    Scale Data Service Requirement Resource Limitation Training Time etc. Model Complexity
  4. What is a ML Pipeline? › High cost in terms

    of time and complexity › Independent tasks that have dependency in a service perspective Features Data Driven, But not Data Pipeline › Need to understand domain knowledge › The goal is providing service, so need understanding and considering the service environment Definition › The continuous process of planning and managing tasks for Data Science, ML Model Development, Production, Infra to utilize Machine Learning technology as service.
  5. What is a ML Pipeline? › Explore › Type of

    machine learning based service › What is the difference with the classical service?
  6. What is a ML Pipeline? Machine Learning Approach vs. Classical

    Approach Logic Data Result Model Data Result Given Wanted ML Classical
  7. Machine Learning Part Classical Service What is a ML Pipeline?

    Machine Learning Approach vs. Classical Approach Model Learning Process Inference Process Result Data Input Data Training Data
  8. What is a ML Pipeline? Machine Learning Approach vs. Classical

    Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Analysis Tool & Knowledge Security
  9. What is a ML Pipeline? Machine Learning Approach vs. Classical

    Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Security Analysis Tool & Knowledge Data Science & ML System & Infra
  10. What is a ML Pipeline? Machine Learning Approach vs. Classical

    Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Analysis Tool & Knowledge Data Science & ML System & Infra Model Core Security
  11. Scheduling / Managing What is a ML Pipeline? ML Pipeline

    by Data flow Production Train ML Model Data Analysis Model Serving Feature Eng Collect data Prepare Data Infra Collect & Store Data Security Validate Data EDA Create Feature Prepare Feature for train & service GPU / Computing Develop & Tuning ML Model Prepare Serving Infra Deploy trained model for service API for Service Service Monitoring
  12. Explore in Timeline: from a ML Pipeline perspective Explore in

    Timeline: from a ML Pipeline perspective › Steps of pipeline › Infra Structure of Explore › Case Study
  13. Automation / Scheduling Infra Structure of Explore Production Train ML

    Model Data Analysis Model Serving Feature Eng Collect data MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Jupyter User approach Operation
  14. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  15. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  16. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  17. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  18. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  19. Infra in use › Data Cluster - (with IU) ›

    HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
  20. Scheduling / Managing General approach ML Pipeline by Data flow

    Production Train ML Model Data Analysis Model Serving Feature Eng Collect data Prepare Data Infra Collect & Store Data Security Validate Data EDA Create Feature Prepare Feature for train & service GPU / Computing Develop & Tuning ML Model Prepare Serving Infra Deploy trained model for service API for Service Service Monitoring
  21. Collect Data Production Train ML Model Data Analysis Model Serving

    Feature Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert Data Cluster MySQL/Redis Object Storage › Define Data to collect › With Domain expert › What we collect › User’s view, click › Contents object / meta data › Some data is dumped from cluster to serving
  22. Data Analysis Train ML Model Data Analysis Model Serving Feature

    Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Object Storage Data Cluster Jupyter Production › Validate Data › Analysis Data › EDA & Visualize
  23. Feature Engineering Train ML Model Data Analysis Model Serving Feature

    Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Jupyter Serving Cluster + Kakfa Object Storage Data Cluster Production › Develop Feature › Execution type › Batch: Jupyter + papermill › Stream: › Kafka + Spark
  24. Feature Engineering Train ML Model Data Analysis Model Serving Feature

    Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Jupyter Serving Cluster + Kakfa Object Storage Data Cluster Production › Develop Feature › Execution type › Batch: Jupyter + papermill › Stream: Kafka + Spark
  25. Train ML Model Train ML Model Data Analysis Model Serving

    Feature Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › Aggregate Data › Train Model › Prepare Data for inference
  26. Model Serving Train ML Model Data Analysis Model Serving Feature

    Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Serving Cluster Production › Deploy trained model › Load balancing
  27. Production Train ML Model Data Analysis Model Serving Feature Eng

    Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Production › Mapping API with Model › Monitoring › Apply required filters
  28. Automation / Scheduling Automation / Scheduling Production Train ML Model

    Data Analysis Model Serving Feature Eng Collect data MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Jupyter User approach Operation
  29. Automation / Scheduling › Task › For providing service, there

    are some tasks like data preprocessing, modeling, and so on. › Each task must be executed in order that according to the data dependency › Things for manage › Dependencies › Task execution Schedule & status monitoring
  30. Case Study In Real World › Various types of cases

    › Some cases are not perfectly match with ML Pipeline perspective › Understand the concept of Machine Learning Pipeline with some cases › Follow embedding › Perfectly matched with ML pipeline steps › Offline Recommendation (for LINE Smart Channel) › Use pre-trained model
  31. Case: Follow embedding About Case › Background › Previous User

    Embedding › Create only the embedding of an author of posts to recommend › Follow Relation › A directed relation(e.g. ‘A follow B’ is represented as ‘A→B’ › If A follow B then A can subscribe posts of B even A and B is not a friend › Problem › To many follow relation › JP, TH about 2 billion, TW: about 0.5 billion
  32. Case: Follow embedding Follow relations to embedding VTFS < 

    ʜ> 6TFS <  ʜ> ʜ 6TFSO <  ʜ>
  33. Case: Follow embedding About Case › Goal › Implement a

    model that generate an embedding for each user › User’s embedding represent that user’s followed users › The higher similarity, the more similar followed user › Daily update › Difficulty › Too many relations
  34. Production Train ML Model Data Analysis Model Serving Feature Eng

    Collect data Data Cluster MySQL/Redis Object Storage › Define Data to collect › With Domain expert › What we collect › User’s view, click › Contents object / meta data › Some data is dumped from cluster to serving Case: Follow embedding Collect Data
  35. Case: Follow embedding Collect Data Production Train ML Model Data

    Analysis Model Serving Feature Eng Collect data Data Cluster MySQL/Redis Object Storage › Data to collect › Follow Relations › Using data already exists in Cluster
  36. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Object Storage Data Cluster Jupyter Production › Validate Data › Analysis Data › EDA & Visualize Case: Follow embedding Data Analysis
  37. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Object Storage Data Cluster Jupyter Production › Data Analysis › Find out the relation data is meaningful › average of user followed user, average of follower. › Ratio of top-k user based on follower count Case: Follow embedding Data Analysis
  38. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Jupyter Serving Cluster Object Storage Data Cluster Production › Develop Feature › Execution type › Batch: Jupyter + papermill › Stream: › Kafka + Spark Case: Follow embedding Feature Engineering
  39. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Jupyter Serving Cluster Object Storage Data Cluster Production › Feature Engineering › Tried graph based embedding model. › Tried APP, ASNE and so on › Only worked for small size relation. › PBG (PyTorch BigGraph) › Can model whole relations. Case: Follow embedding Feature Engineering
  40. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › Aggregate Data › Train Model › Prepare Data for inference Case: Follow embedding Train ML Model
  41. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › CPU based PBG › Running time > 1 day (measured with JP relations) › GPU based PBG › Set up an environment to run & Fix some codes › Running time about 6 hours (measured with JP relations) Case: Follow embedding Train ML Model
  42. Case: Follow embedding Model Serving › Model Serving › Find

    users who are following users who are similar to me › User recommendation › Based on the user that similar user followed, recommend users that I have not followed yet › Post recommendation › Recommend the post that is posted by similar user Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
  43. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Serving Cluster Production › Deploy trained model › Load balancing Case: Follow embedding Model Serving
  44. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Serving Cluster Production › Find users who are following users who are similar to me › Based on the followed user › User recommendation › recommend users that I have not followed yet › Post recommendation › Recommend the post posted by similar user Case: Follow embedding Model Serving Jupyter
  45. Train ML Model Data Analysis Model Serving Feature Eng Collect

    data MySQL/Redis Production › Create embedding first › Save created embedding for production purpose Case: Follow embedding Model Serving Jupyter Embedding Serving Cluster
  46. Case: Follow embedding Production › Production › Hard to calculate

    similarity between all user on real time › Offer pre-inference result for recommend user/post › Using embedding based search Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
  47. Production Train ML Model Data Analysis Model Serving Feature Eng

    Collect data MySQL/Redis Serving Cluster API Object Storage Data Cluster Case: Follow embedding Production › Problems › Data size is too large › Can’t satisfying time requirements
  48. Input Desire output QPTU <  ʜ> QPTU < 

    ʜ> ʜ QPTUO <  ʜ> QPTU  QPTU  <  ʜ> Case: Follow embedding Production
  49. Embedding Search Engine Collection of vectors QPTU <  ʜ>

    QPTU <  ʜ> ʜ QPTUO <  ʜ> Query by Vector QPTU  QPTU  <  ʜ> List of similar post by vector with score Case: Follow embedding Production
  50. Search Engine <  ʜ> QPTU  QPTU  Redis

    Production Train ML Model Data Analysis Model Serving Feature Eng Collect data Case: Follow embedding Production
  51. Train ML Model Data Analysis Feature Eng Collect data MySQL/Redis

    Jupyter Embedding Serving Cluster API Search Engine Offline Inference Case: Follow embedding Production Production Model Serving
  52. Case: Offline recommendation About Case › Background › Required recommendation

    › Recommend a personalized post for user › For Smart Channel › Offline method › Need to pre-inference for whole user after update model › Problem › To many users to inference
  53. Case: Offline recommendation About Case › Goal › Using the

    most recently trained model › Not affect on service in serving › Provide recommendation result to fit the format of Smart Channel › Difficulty › Present system is not support offline serving
  54. Existing tasks to serving Case: Offline recommendation Task-1 Task-2 Task-n

    … Processed Data Trained Model Inference Results
  55. Case: Offline recommendation Collect Data / Data Analysis / Feature

    Engineering / Train Model › Collect Data / Data Analysis / Feature Engineering / Train Model › Use already processed data and trained model › Only need to copy model file and corresponding data › Offline inference for user › SmartCH need a recommendation for ALL USER › Need a list of all user › Inference test with exist system › takes about 72 hour Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
  56. Case: Offline recommendation Model Serving / Production › Model Serving

    › The running service can’t recommend for all user › Due to lack of resource, latency, etc › Set a separated cluster for offline serving › Inference test with new system: < 5h › Production › Save inference result on HDFS first › Insert the result to Smart Channel Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
  57. Offline serving Cluster Inference Instance Case: Offline recommendation Model Serving

    / Production Model Data for Serving HDFS Smart Channel
  58. Summary › ML Based service › Various type of task

    › High Complexity › Logging to serving › Data Science to Frontend/Backend › Math to Computer Science › Periodic
  59. Summary If you want to serve a ML based service

    › Approaching with ML Pipeline perspective › Divide tasks according stage of ML pipeline › Define dependencies between tasks according to data relationship › Run tasks continuously › With ML Pipeline › Visualize tasks of Project(service) › Automate ETL / train / deploy process to enable tracking and debugging › Easy to find things to improvements from both model and infra
  60. Searving Cluster Docker Images Prepare Serving Develop Process Jupyter Develop

    Env Jupyter Notebook Convert Notebook Build Image Docker Images Prapare Data MySQL/Redis Data Cluster Docker Images
  61. Next mission Test Environment != Serving Environment › Preparing cluster

    that runs on Jupyter notebook environment Solution Test Environment != Serving Environment › Test Environment use Jutopia, which env is organized by user(data scientist, model developer) › Difference › Packages / Resource / ACL … is different
  62. Docker Images Prepare Serving Test Environment != Serving Environment Jupyter

    Develop Env + Serving Cluster Jupyter Notebook Docker Images Prapare Data MySQL/Redis Data Cluster Docker Images
  63. Next mission Separated data and feature › Offer Feature as

    a Service › For a large scale data Solution For analysis / Training purpose storage › Large scale data › Raw data is saved on data cluster › Feature / Model is also saved on Hive/HDFS For service › Need only a part of data › Low latency, Large scale traffic › Dump needed data from Hive / HDFS
  64. Docker Images Separated data and feature Jupyter Develop Env +

    Serving Cluster Jupyter Notebook Docker Images Feature Storage Data Cluster Docker Images