The Hitchhiker's Guide to MLOps by David Cardozo

The Hitchhiker's Guide to MLOps Montreal, Canada David Cardozo Google
Developer Expert ML

Imagine if you will ...

You’re an Online Retailer Selling Shoes ... Your model predicts
click-through rates (CTR), helping you decide how much inventory to order

When suddenly Your AUC and prediction accuracy have dropped on
men’s dress shoes!

How do we know that we have a problem?

What causes problems? Kinds of problems • Fast - Example:
bad sensor, bad software update • Slow - Example: drift

Sudden Problems Problem with data collection ◦ Bad sensor/camera ◦
Bad log data ◦ Moved or disabled sensors/cameras Systems problem ◦ Bad software update ◦ Loss of network connectivity ◦ System down ◦ Bad credentials

Gradual Problems Data changes ◦ Trend and seasonality ◦ Distribution
of features changes ◦ Relative importance of features changes World changes ◦ Styles change ◦ Competitors change ◦ Business expands to other geos

Why “Understand” the model? Mispredictions do not have uniform cost
to your business. The data you have is rarely the data you wish you had. Model objective is nearly always a proxy for your business objectives Some percentage of your customers may have a bad experience The real world doesn’t stand still

Production ML and Change

• Ground truth changes slowly (months, years) • Model retraining
driven by: ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Curated datasets ◦ Crowd-based Easy Problems

• Ground truth changes faster (weeks) • Model retraining driven
by: ◦ Declining model performance ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Direct feedback ◦ Crowd-based Harder Problems

• Ground truth changes very fast (days, hours, min) •
Model retraining driven by: ◦ Declining model performance ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Direct feedback ◦ Weak supervision Really Hard Problems

Machine Learning

In addition to training an amazing model ... Modeling Code

… a production solution requires so much more Conﬁguration Data
Collection Data Veriﬁcation Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring ML Code

Tales From The Trenches https://twitter.com/ginablaber/status/971450218095943681

Production Machine Learning Modern Software Development • Scalability • Extensibility
• Configuration • Consistency & Reproducibility • Modularity • Best Practices • Testability • Monitoring • Safety & Security Machine Learning Development • Labeled data • Feature space coverage • Minimal dimensionality • Maximum predictive data • Fairness • Rare conditions • Data lifecycle management +

Continuous Training for Production ML in the TFX Platform. OpML
(2019). Slice Finder: Automated Data Slicing for Model Validation. ICDE (2019). Data Validation for Machine Learning. SysML (2019). TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017). Data Management Challenges in Production Machine Learning. SIGMOD (2017). Rules of Machine Learning: Best Practices for ML Engineering. Google AI Web (2017). Machine Learning: The High Interest Credit Card of Technical Debt. NeurIPS (2015). Leading ML best practices

What is MLOps? “MLOps is a practice for collaboration and
communication between data scientists and operations professionals to help manage production ML lifecycle.” “Similar to the DevOps or DataOps approaches, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements.” 24 https://en.wikipedia.org/wiki/MLOps

Production ML Infrastructure CD Foundation MLOps reference architecture 25 https://cd.foundation/blog

What now? How we diverge from plain Old Software Engineering

Enter the machine learning and deep learning revolution

A history perspective • AlexNet 2012 ◦ Technique that let
computer figure out the rules. ◦ Inherently parallel problem ◦ Matrix operations • GPUs for 2D Convolutions

Computer Vision goes BRRR

• AlexNet required 2 of these cards • Beginning of
the GPGPU. ◦ CUDA ◦ CuDNN • The start of our conundrums!

CUDA and CUDNN • cuDNN is a GPU-accelerated library of
primitives for deep neural networks. • Convolution forward and backward • Pooling forward and backward • Softmax forward and backward • Neuron activations forward and backward: ◦ Rectified linear (ReLU) ◦ Sigmoid ◦ Hyperbolic tangent (TANH) • Tensor transformation functions

It should be easy right ? Just buy a GPU
and install CUDA and CUDNN

Linus and Nvidia, they have their issues. “Near the end
of his talk, when asked by one of the attendees about NVIDIA's hardware support and lack of open-source driver enablement / documentation, he had a few choice words for the Santa Clara company.” Link

Enter the nvidia-driver and the ml ecosystem

Back to modern times • Let us explore the workflow
of generating a machine learning model from zero (DevOps perspective)

Install the drivers Let us summarize the experience

Assume you have GPU Driver installation

Now trying to get CUDA

Stood in the shoulders of giants CUDNN and CUDA is
too low level API

Deep learning moves extremely fast CUDA Incompatibility between projects

Technology adapts

• Packages up software binaries and dependencies • Isolates software
from each other • Container is a standard format • Easily portable across environment • Allows ecosystem to develop around its standard

Allow build and run GPU accelerated containers

Solving the issue of injecting GPU Devices ├─ nvidia-docker2 │
├─ docker-ce │ ├─ docker-ee │ ├─ docker.io (>= 18.06.0) │ └─ nvidia-container-runtime ├─ nvidia-container-runtime │ └─ nvidia-container-toolkit ├─ nvidia-container-toolkit │ └─libnvidia-container-tools ├─ libnvidia-container-tools │ └─ libnvidia-container1 └─ libnvidia-container1

Reduced complexity docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

docker run --gpus all -it --rm tensorflow/tensorflow:2.7.0-gpu \ python -c
"import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm --ipc=host \ --name mypytorchproject pytorch/pytorch:1.4-cuda10.1-cudnn7-devel

Anatomy and Structure of Base Images Common guidelines NVIDA CUDA
Tensorflow PyTorch base runtime devel Templates: 11.4.2-cudnn8-runtime-ubuntu20.04 Templates: tensorflow/tensorflow:2.7.0-gpu-jupyter GPU Jupyter CUDA CUDNN Templates: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

Considerations while building GPU Images Pinpoint your dependencies • Do
multistage builds. • Images can grow pretty fast. FROM nvidia/cuda:10.2-cudnn7-devel AS builder • Define constraints NVIDIA_REQUIRE_CUDA "cuda>=11.0 driver>=450"

Considerations while running GPU Containers docker run --rm --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=2,3 \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvidia/cuda nvidia-smi docker run --gpus all --rm \ --ipc=host \ or --shm -v local_dir:container_dir \ nvcr.io/nvidia/pytorch:xx.xx-py3 Keep mind on using --gpus, since this will allow docker to call nvidia-docker to inject devices an environment variables Either let ipc host (so that multiple workers can communicate) or augment the shared memory.

FROM nvcr.io/nvidia/pytorch:21.10-py3 RUN apt update && apt install -y zip
htop \ screen libgl1-mesa-glx COPY requirements.txt . RUN python -m pip install --upgrade pip RUN pip uninstall -y nvidia-tensorboard nvidia-tensorboard-plugin-dlprof RUN pip install --no-cache -r requirements.txt coremltools \ onnx gsutil notebook wandb>=0.12.2 RUN pip install --no-cache -U torch torchvision numpy Pillow RUN mkdir -p /usr/src/app WORKDIR /usr/src/app COPY . /usr/src/app ADD https://ultralytics.com/assets/Arial.ttf /root/.config/Ultralytics/ Consider yolov5 Dockerfile

Developing in Containers • Most tensorflow and pytorch images will
try to run your code on the GPU if the image is specified as GPU, but they will use the CPU in case the GPU is not present (be careful about custom layers) • Also newer images of CUDA are now hosted on nvcr.io/nvidia/cuda

Production ML Infrastructure CD Foundation MLOps reference architecture 58 https://cd.foundation/blog

Greek for “Helmsman”; also the root of the words “governor”
and “cybernetic” • Manages container clusters • Inspired and informed by Google’s experiences and internal systems • Supports multiple cloud and bare-metal environments • Supports multiple container runtimes • 100% Open source, written in Go Manage applications, not machines Kubernetes

kubelet UI kubelet CLI API users master(s) nodes etcd kubelet
scheduler controllers apiserver The 10000 foot view

UI API Container Cluster All you really care about

Building ML Pipelines in Kubernetes Kubeflow Pipelines 62 AI Platform

Very High Level Architecture 63 Kubeflow Pipelines Vertex AI GCS
BigQuery Dataflow Google Kubernetes Engine (GKE) TensorFlow JAX Pytorch

Add conditional logic and branches to your pipeline Store metadata
for every artifact produced by the pipeline Track artifacts, lineage, metrics, and execution across your ML workflow Vertex AI Pipelines 65

Custom container components from kfp import dsl from kfp.dsl import
Output, Dataset @dsl.container_component def create_dataset( text: str, output_gcs: Output[Dataset], ): return dsl.ContainerSpec( image='alpine', command=[ 'sh', '-c', 'mkdir --parents $(dirname "$1") && echo "$0" > "$1"', ], args=[text, output_gcs.path])

Lightweight Python function-based components from kfp import dsl from kfp.dsl
import Input, Output, Dataset, Model @dsl.component( base_image='python:3.9', packages_to_install=['tensorflow==2.10.0'], ) def train_model( dataset: Input[Dataset], num_epochs: int, model: Output[Model], ): from tensorflow import keras # load and process the Dataset artifact with open(dataset.path) as f: x, y = ... my_model = keras.Sequential( [ layers.Dense(4, activation='relu', name='layer_1'), layers.Dense(2, activation='relu', name='layer_2'), layers.Dense(1, name='layer_3'), ] ) my_model.compile(...) # train for num_epochs my_model.fit(x, y, epochs=num_epochs) # save the Model artifact my_model.save(model.path)

EXAMPLE https://davidcardozo.notion.site/Building-a-Breast-Cancer-Classifica tion-Pipeline-with-Flax-NNX-Kubeflow-Pipelines-and-Vertex-AI-924 0294772df4b24bca66cfa8cc156ba

Demo time?

Status What’s next? The End.

The Hitchhiker's Guide to MLOps by David Cardozo

The Hitchhiker's Guide to MLOps by David Cardozo

More Decks by GDG Montreal

Featured

Transcript