of time and complexity › Independent tasks that have dependency in a service perspective Features Data Driven, But not Data Pipeline › Need to understand domain knowledge › The goal is providing service, so need understanding and considering the service environment Definition › The continuous process of planning and managing tasks for Data Science, ML Model Development, Production, Infra to utilize Machine Learning technology as service.
Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Analysis Tool & Knowledge Security
Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Security Analysis Tool & Knowledge Data Science & ML System & Infra
Approach Feature Extraction Data Verification Data Collection Data Transform Process Management Modeling Serving Infra Automation Data Infra GPU Analysis Tool & Knowledge Data Science & ML System & Infra Model Core Security
by Data flow Production Train ML Model Data Analysis Model Serving Feature Eng Collect data Prepare Data Infra Collect & Store Data Security Validate Data EDA Create Feature Prepare Feature for train & service GPU / Computing Develop & Tuning ML Model Prepare Serving Infra Deploy trained model for service API for Service Service Monitoring
Model Data Analysis Model Serving Feature Eng Collect data MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Jupyter User approach Operation
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
HDFS + Hive with Spark › Kafka Raw Data Storage › Jupyter (with Jutopia) › On Demand Jupyter environment › Distributed GPU training environment › RDB: MySQL › IMDB: Redis Cluster Data Analysis and Modeling Serving Data Storage › Kubernetes Serving Cluster › Clipper › API › Airflow Serving Platform Workflow automation Infra Structure of Explore
Production Train ML Model Data Analysis Model Serving Feature Eng Collect data Prepare Data Infra Collect & Store Data Security Validate Data EDA Create Feature Prepare Feature for train & service GPU / Computing Develop & Tuning ML Model Prepare Serving Infra Deploy trained model for service API for Service Service Monitoring
Feature Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert Data Cluster MySQL/Redis Object Storage › Define Data to collect › With Domain expert › What we collect › User’s view, click › Contents object / meta data › Some data is dumped from cluster to serving
Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Object Storage Data Cluster Jupyter Production › Validate Data › Analysis Data › EDA & Visualize
Feature Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › Aggregate Data › Train Model › Prepare Data for inference
Eng Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Serving Cluster Production › Deploy trained model › Load balancing
Collect data App Developer Front Engineer Backend Engineer Data Engineer ML Developer Domain Expert MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Production › Mapping API with Model › Monitoring › Apply required filters
Data Analysis Model Serving Feature Eng Collect data MySQL/Redis Serving Cluster Filtering API Object Storage Data Cluster Jupyter User approach Operation
are some tasks like data preprocessing, modeling, and so on. › Each task must be executed in order that according to the data dependency › Things for manage › Dependencies › Task execution Schedule & status monitoring
› Some cases are not perfectly match with ML Pipeline perspective › Understand the concept of Machine Learning Pipeline with some cases › Follow embedding › Perfectly matched with ML pipeline steps › Offline Recommendation (for LINE Smart Channel) › Use pre-trained model
Embedding › Create only the embedding of an author of posts to recommend › Follow Relation › A directed relation(e.g. ‘A follow B’ is represented as ‘A→B’ › If A follow B then A can subscribe posts of B even A and B is not a friend › Problem › To many follow relation › JP, TH about 2 billion, TW: about 0.5 billion
model that generate an embedding for each user › User’s embedding represent that user’s followed users › The higher similarity, the more similar followed user › Daily update › Difficulty › Too many relations
Collect data Data Cluster MySQL/Redis Object Storage › Define Data to collect › With Domain expert › What we collect › User’s view, click › Contents object / meta data › Some data is dumped from cluster to serving Case: Follow embedding Collect Data
Analysis Model Serving Feature Eng Collect data Data Cluster MySQL/Redis Object Storage › Data to collect › Follow Relations › Using data already exists in Cluster
data MySQL/Redis Object Storage Data Cluster Jupyter Production › Data Analysis › Find out the relation data is meaningful › average of user followed user, average of follower. › Ratio of top-k user based on follower count Case: Follow embedding Data Analysis
data MySQL/Redis Jupyter Serving Cluster Object Storage Data Cluster Production › Feature Engineering › Tried graph based embedding model. › Tried APP, ASNE and so on › Only worked for small size relation. › PBG (PyTorch BigGraph) › Can model whole relations. Case: Follow embedding Feature Engineering
data MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › Aggregate Data › Train Model › Prepare Data for inference Case: Follow embedding Train ML Model
data MySQL/Redis Filtering Data Cluster Jupyter Production + GPUs › CPU based PBG › Running time > 1 day (measured with JP relations) › GPU based PBG › Set up an environment to run & Fix some codes › Running time about 6 hours (measured with JP relations) Case: Follow embedding Train ML Model
users who are following users who are similar to me › User recommendation › Based on the user that similar user followed, recommend users that I have not followed yet › Post recommendation › Recommend the post that is posted by similar user Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
data MySQL/Redis Serving Cluster Production › Find users who are following users who are similar to me › Based on the followed user › User recommendation › recommend users that I have not followed yet › Post recommendation › Recommend the post posted by similar user Case: Follow embedding Model Serving Jupyter
data MySQL/Redis Production › Create embedding first › Save created embedding for production purpose Case: Follow embedding Model Serving Jupyter Embedding Serving Cluster
similarity between all user on real time › Offer pre-inference result for recommend user/post › Using embedding based search Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
Collect data MySQL/Redis Serving Cluster API Object Storage Data Cluster Case: Follow embedding Production › Problems › Data size is too large › Can’t satisfying time requirements
› Recommend a personalized post for user › For Smart Channel › Offline method › Need to pre-inference for whole user after update model › Problem › To many users to inference
most recently trained model › Not affect on service in serving › Provide recommendation result to fit the format of Smart Channel › Difficulty › Present system is not support offline serving
Engineering / Train Model › Collect Data / Data Analysis / Feature Engineering / Train Model › Use already processed data and trained model › Only need to copy model file and corresponding data › Offline inference for user › SmartCH need a recommendation for ALL USER › Need a list of all user › Inference test with exist system › takes about 72 hour Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
› The running service can’t recommend for all user › Due to lack of resource, latency, etc › Set a separated cluster for offline serving › Inference test with new system: < 5h › Production › Save inference result on HDFS first › Insert the result to Smart Channel Production Train ML Model Data Analysis Model Serving Feature Eng Collect data
› Approaching with ML Pipeline perspective › Divide tasks according stage of ML pipeline › Define dependencies between tasks according to data relationship › Run tasks continuously › With ML Pipeline › Visualize tasks of Project(service) › Automate ETL / train / deploy process to enable tracking and debugging › Easy to find things to improvements from both model and infra
that runs on Jupyter notebook environment Solution Test Environment != Serving Environment › Test Environment use Jutopia, which env is organized by user(data scientist, model developer) › Difference › Packages / Resource / ACL … is different
a Service › For a large scale data Solution For analysis / Training purpose storage › Large scale data › Raw data is saved on data cluster › Feature / Model is also saved on Hive/HDFS For service › Need only a part of data › Low latency, Large scale traffic › Dump needed data from Hive / HDFS