Slide 1

Slide 1 text

Introduction to Apache Submarine Cloud-Native Machine Learning Platform

Slide 2

Slide 2 text

About me Wen-Chih (Ryan) Lo Apache Submarine committer Master’s student at NCKU Contact me [email protected] 2

Slide 3

Slide 3 text

Agenda ◦ What is MLOps? ◦ Apache Submarine Intro ◦ Community & Collaboration ◦ Release Plan ◦ Q & A 3

Slide 4

Slide 4 text

What is MLOps ? 1 4

Slide 5

Slide 5 text

“ It’s similar to DevOps ◦ Robust automation to shorten the development lifecycle (build, test, release...) ◦ Dependable releases ◦ Monitoring ◦ etc... but there is a critical difference ” data science” 5

Slide 6

Slide 6 text

Adapt DevOps to Machine Learning (MLOps = ML + Dev + Ops) Dev ML Ops 6 Experiment Data acquisition Business Understanding Initial Modeling Development Modeling + Testing Continuous Integration Continuous Deployment Operations Continuous Delivery Data Feedback Loop System + Model Monitoring

Slide 7

Slide 7 text

Challenges of using Machine Learning 7 Configuration Data Verification Data Collection Feature Extraction Data Analysis Process Management Tools ML Code Serving Infrastructure Monitoring Machine Resource Management Hidden Technical Debt in Machine Learning Systems, Google (NIPS’ 15)

Slide 8

Slide 8 text

8 Google researchers analyze 3,000 production ML pipelines, comprising over 450,000 models trained. In consistent with wide understating regarding ML is mostly about the training. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities (SIGMOD’ 21)

Slide 9

Slide 9 text

Challenges of using Machine Learning (cont.) ◦ Training-Serving Skew Ø The difference between performance during training and performance during serving ◦ Model Drift Ø The performance of a model deployed to production deteriorates on new, unseen data or the underlying assumptions about the data change 9

Slide 10

Slide 10 text

Roles and Responsibilities Data Engineer ◦ Build, maintain and optimize data pipelines ◦ Big data infra, SQL/NoSQL, ETL tools, Message queue (e.g. Kafka), etc. ◦ Support data scientists with data related requirements Data Scientist ◦ Data and algorithms exploration ◦ Build ML models that address the business question ◦ Python, R, ML / DL, CV, NLP, etc. ◦ Not familiar with platform and infrastructure stuffs ML infra engineer ◦ Build and scale machine learning infrastructure ◦ Work with ML pipelines and products ◦ Monitor model performance in production ◦ Distributed System, DevOps, System design, etc. 10 Deploy & monitor Training Build models Data preparation Data extraction

Slide 11

Slide 11 text

Netflix Tech Blog https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9 11

Slide 12

Slide 12 text

“ Data scientists want to focus on model and quickly test out how they work in the production. 12

Slide 13

Slide 13 text

What does Data Scientist expect? ◦ Data / Model exploration and analysis Ø Experiment using sampled dataset with notebooks or full dataset to get best results. ◦ Reproducible experiment Ø Record parameters, code and metrics of experiment Ø Dependency management, coding once, run everywhere ◦ Model management Ø Automated model packaging and delivery for easy deployment to production Ø Visibility into the performance of models 13

Slide 14

Slide 14 text

Apache Submarine Intro 2 14

Slide 15

Slide 15 text

Apache Submarine (Top-Level project) The following are the goals : ◦ Data scientists can create end-to-end ML workflows that cover each stage in ML model lifecycle. ◦ Easy for data scientists to manage versions of experiment and dependencies of environment. ◦ Support popular machine learning frameworks including PyTorch and TensorFlow. ◦ Run and track distributed training experiments via ease-to-use UI. ◦ Support Kubernetes/Hadoop YARN and many compute resources (e.g. CPU and GPU) 15

Slide 16

Slide 16 text

Features of Apache Submarine Experiment Submarine comes with Python SDK which enables users to run TensorFlow/PyTorch experiments from notebook. Environment Users can save different environment profile that define a set of libraries and a Docker image to run an experiment or a notebook instance. Notebook Enable users launch their own Jupyter Notebook instances in the same cluster. Easy Install Just one simple command to install Submarine on an existing K8s cluster with Helm. Model management Integrating with MLflow for tracking experiments and model registry. (coming soon) Web UI Submarine workbench provide a Data-Scientist-friendly UI to make data scientists have a good user experience. 16

Slide 17

Slide 17 text

Submarine Workbench (Experiment) 17

Slide 18

Slide 18 text

Submarine Workbench (Experiment) 18

Slide 19

Slide 19 text

Submarine Workbench (Experiment) 19

Slide 20

Slide 20 text

Submarine Workbench (Experiment) 20

Slide 21

Slide 21 text

Submarine Workbench (Experiment) 21

Slide 22

Slide 22 text

Submarine Environment ◦ Manage Docker/Conda environments, and Git integration for experiments/notebooks 22

Slide 23

Slide 23 text

Submarine Notebook Support ◦ Enable users to launch Jupyter notebook instances in the same cluster. ◦ Choose different Environment profiles (Docker/conda kernel) when you launch notebook instances. 23

Slide 24

Slide 24 text

Submarine Python SDK ◦ The Submarine Python SDK includes experiment package that makes user easy to run distributed or non-distributed TensorFlow / PyTorch experiments. ◦ Release 0.6.0 will provide model package to support model management initially. 24

Slide 25

Slide 25 text

Community & collaboration 3 25

Slide 26

Slide 26 text

Submarine community ◦ Diverse community Contributors are from Cloudera, DiDi, NetEase, JD.com, NTU, NCKU, NTCU, etc. ◦ Growing project There are 49 contributors including 9 PMCs and 28 committers. ◦ Beginner-friendly You can raise an issue on GitHub or WeChat (communicating in Chinese) even if it is a small and minor problem. 26 Special thanks to all the people who contribute to the project! GitHub: https://github.com/apache/submarine Zoom: https://cloudera.zoom.us/j/880548968 (Every week 12:40 PM GMT+8 on Tuesdays) Jira: https://issues.apache.org/jira/projects/SUBMARIN E/issues/

Slide 27

Slide 27 text

Release Plan 4 27

Slide 28

Slide 28 text

Roadmap 28 1 3 5 6 4 2 Initial release 0.1.0 Support TensorFlow distributed training on YARN (Jan, 2019) Release 0.3.0 Provide mini-submarine and basic K8s support (Feb, 2020) Release 0.5.0 Support notebook service, environment profile and UI (Dec, 2020) Release 0.2.0 Support PyTorch and integrated with TonY (Jul, 2019) Release 0.4.0 Fully support training TensorFlow and PyTorch on K8s (Jul, 2020) Release 0.6.0 Support model management initially. Refactor Submarine Operator and CI/CD. (Coming soon)

Slide 29

Slide 29 text

Thank you! Q & A You can find me at [email protected] 29