Slide 1

Slide 1 text

The Hitchhiker's Guide to MLOps Montreal, Canada David Cardozo Google Developer Expert ML

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Imagine if you will ...

Slide 4

Slide 4 text

You’re an Online Retailer Selling Shoes ... Your model predicts click-through rates (CTR), helping you decide how much inventory to order

Slide 5

Slide 5 text

When suddenly Your AUC and prediction accuracy have dropped on men’s dress shoes!

Slide 6

Slide 6 text

Why?

Slide 7

Slide 7 text

How do we know that we have a problem?

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

What causes problems? Kinds of problems ● Fast - Example: bad sensor, bad software update ● Slow - Example: drift

Slide 11

Slide 11 text

Sudden Problems Problem with data collection ○ Bad sensor/camera ○ Bad log data ○ Moved or disabled sensors/cameras Systems problem ○ Bad software update ○ Loss of network connectivity ○ System down ○ Bad credentials

Slide 12

Slide 12 text

Gradual Problems Data changes ○ Trend and seasonality ○ Distribution of features changes ○ Relative importance of features changes World changes ○ Styles change ○ Competitors change ○ Business expands to other geos

Slide 13

Slide 13 text

Why “Understand” the model? Mispredictions do not have uniform cost to your business. The data you have is rarely the data you wish you had. Model objective is nearly always a proxy for your business objectives Some percentage of your customers may have a bad experience The real world doesn’t stand still

Slide 14

Slide 14 text

Production ML and Change

Slide 15

Slide 15 text

● Ground truth changes slowly (months, years) ● Model retraining driven by: ○ Model improvements, better data ○ Changes in software and/or systems ● Labeling ○ Curated datasets ○ Crowd-based Easy Problems

Slide 16

Slide 16 text

● Ground truth changes faster (weeks) ● Model retraining driven by: ○ Declining model performance ○ Model improvements, better data ○ Changes in software and/or systems ● Labeling ○ Direct feedback ○ Crowd-based Harder Problems

Slide 17

Slide 17 text

● Ground truth changes very fast (days, hours, min) ● Model retraining driven by: ○ Declining model performance ○ Model improvements, better data ○ Changes in software and/or systems ● Labeling ○ Direct feedback ○ Weak supervision Really Hard Problems

Slide 18

Slide 18 text

Machine Learning

Slide 19

Slide 19 text

In addition to training an amazing model ... Modeling Code

Slide 20

Slide 20 text

… a production solution requires so much more Configuration Data Collection Data Verification Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring ML Code

Slide 21

Slide 21 text

Tales From The Trenches https://twitter.com/ginablaber/status/971450218095943681

Slide 22

Slide 22 text

Production Machine Learning Modern Software Development ● Scalability ● Extensibility ● Configuration ● Consistency & Reproducibility ● Modularity ● Best Practices ● Testability ● Monitoring ● Safety & Security Machine Learning Development ● Labeled data ● Feature space coverage ● Minimal dimensionality ● Maximum predictive data ● Fairness ● Rare conditions ● Data lifecycle management +

Slide 23

Slide 23 text

Continuous Training for Production ML in the TFX Platform. OpML (2019). Slice Finder: Automated Data Slicing for Model Validation. ICDE (2019). Data Validation for Machine Learning. SysML (2019). TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017). Data Management Challenges in Production Machine Learning. SIGMOD (2017). Rules of Machine Learning: Best Practices for ML Engineering. Google AI Web (2017). Machine Learning: The High Interest Credit Card of Technical Debt. NeurIPS (2015). Leading ML best practices

Slide 24

Slide 24 text

What is MLOps? “MLOps is a practice for collaboration and communication between data scientists and operations professionals to help manage production ML lifecycle.” “Similar to the DevOps or DataOps approaches, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements.” 24 https://en.wikipedia.org/wiki/MLOps

Slide 25

Slide 25 text

Production ML Infrastructure CD Foundation MLOps reference architecture 25 https://cd.foundation/blog

Slide 26

Slide 26 text

What now? How we diverge from plain Old Software Engineering

Slide 27

Slide 27 text

Enter the machine learning and deep learning revolution

Slide 28

Slide 28 text

A history perspective ● AlexNet 2012 ○ Technique that let computer figure out the rules. ○ Inherently parallel problem ○ Matrix operations ● GPUs for 2D Convolutions

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Computer Vision goes BRRR

Slide 31

Slide 31 text

● AlexNet required 2 of these cards ● Beginning of the GPGPU. ○ CUDA ○ CuDNN ● The start of our conundrums!

Slide 32

Slide 32 text

CUDA and CUDNN ● cuDNN is a GPU-accelerated library of primitives for deep neural networks. ● Convolution forward and backward ● Pooling forward and backward ● Softmax forward and backward ● Neuron activations forward and backward: ○ Rectified linear (ReLU) ○ Sigmoid ○ Hyperbolic tangent (TANH) ● Tensor transformation functions

Slide 33

Slide 33 text

It should be easy right ? Just buy a GPU and install CUDA and CUDNN

Slide 34

Slide 34 text

Linus and Nvidia, they have their issues. “Near the end of his talk, when asked by one of the attendees about NVIDIA's hardware support and lack of open-source driver enablement / documentation, he had a few choice words for the Santa Clara company.” Link

Slide 35

Slide 35 text

Enter the nvidia-driver and the ml ecosystem

Slide 36

Slide 36 text

Back to modern times ● Let us explore the workflow of generating a machine learning model from zero (DevOps perspective)

Slide 37

Slide 37 text

Install the drivers Let us summarize the experience

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Assume you have GPU Driver installation

Slide 40

Slide 40 text

Now trying to get CUDA

Slide 41

Slide 41 text

Stood in the shoulders of giants CUDNN and CUDA is too low level API

Slide 42

Slide 42 text

Deep learning moves extremely fast CUDA Incompatibility between projects

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Technology adapts

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

● Packages up software binaries and dependencies ● Isolates software from each other ● Container is a standard format ● Easily portable across environment ● Allows ecosystem to develop around its standard

Slide 49

Slide 49 text

Allow build and run GPU accelerated containers

Slide 50

Slide 50 text

Solving the issue of injecting GPU Devices ├─ nvidia-docker2 │ ├─ docker-ce │ ├─ docker-ee │ ├─ docker.io (>= 18.06.0) │ └─ nvidia-container-runtime ├─ nvidia-container-runtime │ └─ nvidia-container-toolkit ├─ nvidia-container-toolkit │ └─libnvidia-container-tools ├─ libnvidia-container-tools │ └─ libnvidia-container1 └─ libnvidia-container1

Slide 51

Slide 51 text

Reduced complexity docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Slide 52

Slide 52 text

docker run --gpus all -it --rm tensorflow/tensorflow:2.7.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm --ipc=host \ --name mypytorchproject pytorch/pytorch:1.4-cuda10.1-cudnn7-devel

Slide 53

Slide 53 text

Anatomy and Structure of Base Images Common guidelines NVIDA CUDA Tensorflow PyTorch base runtime devel Templates: 11.4.2-cudnn8-runtime-ubuntu20.04 Templates: tensorflow/tensorflow:2.7.0-gpu-jupyter GPU Jupyter CUDA CUDNN Templates: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

Slide 54

Slide 54 text

Considerations while building GPU Images Pinpoint your dependencies ● Do multistage builds. ● Images can grow pretty fast. FROM nvidia/cuda:10.2-cudnn7-devel AS builder ● Define constraints NVIDIA_REQUIRE_CUDA "cuda>=11.0 driver>=450"

Slide 55

Slide 55 text

Considerations while running GPU Containers docker run --rm --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=2,3 \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvidia/cuda nvidia-smi docker run --gpus all --rm \ --ipc=host \ or --shm -v local_dir:container_dir \ nvcr.io/nvidia/pytorch:xx.xx-py3 Keep mind on using --gpus, since this will allow docker to call nvidia-docker to inject devices an environment variables Either let ipc host (so that multiple workers can communicate) or augment the shared memory.

Slide 56

Slide 56 text

FROM nvcr.io/nvidia/pytorch:21.10-py3 RUN apt update && apt install -y zip htop \ screen libgl1-mesa-glx COPY requirements.txt . RUN python -m pip install --upgrade pip RUN pip uninstall -y nvidia-tensorboard nvidia-tensorboard-plugin-dlprof RUN pip install --no-cache -r requirements.txt coremltools \ onnx gsutil notebook wandb>=0.12.2 RUN pip install --no-cache -U torch torchvision numpy Pillow RUN mkdir -p /usr/src/app WORKDIR /usr/src/app COPY . /usr/src/app ADD https://ultralytics.com/assets/Arial.ttf /root/.config/Ultralytics/ Consider yolov5 Dockerfile

Slide 57

Slide 57 text

Developing in Containers ● Most tensorflow and pytorch images will try to run your code on the GPU if the image is specified as GPU, but they will use the CPU in case the GPU is not present (be careful about custom layers) ● Also newer images of CUDA are now hosted on nvcr.io/nvidia/cuda

Slide 58

Slide 58 text

Production ML Infrastructure CD Foundation MLOps reference architecture 58 https://cd.foundation/blog

Slide 59

Slide 59 text

Greek for “Helmsman”; also the root of the words “governor” and “cybernetic” ● Manages container clusters ● Inspired and informed by Google’s experiences and internal systems ● Supports multiple cloud and bare-metal environments ● Supports multiple container runtimes ● 100% Open source, written in Go Manage applications, not machines Kubernetes

Slide 60

Slide 60 text

kubelet UI kubelet CLI API users master(s) nodes etcd kubelet scheduler controllers apiserver The 10000 foot view

Slide 61

Slide 61 text

UI API Container Cluster All you really care about

Slide 62

Slide 62 text

Building ML Pipelines in Kubernetes Kubeflow Pipelines 62 AI Platform

Slide 63

Slide 63 text

Very High Level Architecture 63 Kubeflow Pipelines Vertex AI GCS BigQuery Dataflow Google Kubernetes Engine (GKE) TensorFlow JAX Pytorch

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Add conditional logic and branches to your pipeline Store metadata for every artifact produced by the pipeline Track artifacts, lineage, metrics, and execution across your ML workflow Vertex AI Pipelines 65

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Custom container components from kfp import dsl from kfp.dsl import Output, Dataset @dsl.container_component def create_dataset( text: str, output_gcs: Output[Dataset], ): return dsl.ContainerSpec( image='alpine', command=[ 'sh', '-c', 'mkdir --parents $(dirname "$1") && echo "$0" > "$1"', ], args=[text, output_gcs.path])

Slide 68

Slide 68 text

Lightweight Python function-based components from kfp import dsl from kfp.dsl import Input, Output, Dataset, Model @dsl.component( base_image='python:3.9', packages_to_install=['tensorflow==2.10.0'], ) def train_model( dataset: Input[Dataset], num_epochs: int, model: Output[Model], ): from tensorflow import keras # load and process the Dataset artifact with open(dataset.path) as f: x, y = ... my_model = keras.Sequential( [ layers.Dense(4, activation='relu', name='layer_1'), layers.Dense(2, activation='relu', name='layer_2'), layers.Dense(1, name='layer_3'), ] ) my_model.compile(...) # train for num_epochs my_model.fit(x, y, epochs=num_epochs) # save the Model artifact my_model.save(model.path)

Slide 69

Slide 69 text

EXAMPLE https://davidcardozo.notion.site/Building-a-Breast-Cancer-Classifica tion-Pipeline-with-Flax-NNX-Kubeflow-Pipelines-and-Vertex-AI-924 0294772df4b24bca66cfa8cc156ba

Slide 70

Slide 70 text

Demo time?

Slide 71

Slide 71 text

Status What’s next? The End.