Machine Learning in Production

Machine Learning in production Ettie Eyre Interactive AI CDT, Bristol
University, 9th December 2019 @ettieeyre

Before we start…. • I’ve been in industry a long
time - if I start talking about unfamiliar concepts please shout. • Some material might be familiar, some unfamiliar, we can speed up or slow down so keep me updated on progress!

Overview • Intro • Devops primer ◦ Docker ◦ Kubernetes
◦ CI/CD • MLOps ◦ Pachyderm ◦ Kubeﬂow

About me • PHD Bristol University 2009-2013 ◦ DTC BCCS
• ML consultant at SecondSync 2011-2013 • Postdoc at QMUL 2013 in computational creativity • Data Scientist/data architect at Black Swan data 2013-2014 • Research Engineer at Gluru 2015-2016 • AI lead at Adarga 2016-2018 • Currently Machine Learning Infrastructure Lead at Cookpad • Co-organise Bristol Machine Learning meetup @ettieeyre

Cookpad • Cookpad is a community platform that enables people
to share recipe ideas and cooking tips. • Started in Japan in 1997, listed company at Tokyo Stock Exchange. • We’re a global company with oﬃces in 10 countries and a team of 700 • In 2017, we set our Global HQ up in Bristol. We currently have about 100 employees in Bristol.

Our mission is to make everyday cooking fun Because we
believe that cooking is the key to a happier and healthier life for people, communities and the planet. The choices we make shape our world. And when we cook, the choices we make have an impact on ourselves, the people we cook for, the growers and producers we buy from and the wider environment. By building the platform that solves the issues related to everyday cooking and helps more people to cook, we believe we can help build a better world.

105 m people on average use Cookpad every month 75+
countries in 6 m 30+ languages recipes on the platform in

Devops primer

Devops “DevOps is a set of practices that combines software
development and information-technology operations which aims to shorten the systems development life cycle and provide continuous delivery with high software quality.”

Devops - a history • Edit code live on a
server • Ops team build your executables, and deploy onto a server • Cloud services (AWS, GCP, Azure, IBM, Oracle) • Immutable infrastructure, docker and kubernetes. • Continuous integration and continuous delivery • Gitops!

Devops 2010 2015 2000 Continuous Integration 2005 Continuous delivery 2013
1995 Edit code live on server? Hand binaries to ops team

2013 2014-5 Verma, Abhishek, et al. "Large-scale cluster management at
Google with Borg." Proceedings of the Tenth European Conference on Computer Systems. ACM, 2015.

https://blog.netapp.com/blogs/containers-vs-vms/

https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

Git reminder https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it

Gitﬂow (feature branch development) Trunk-based development https://codeburst.io/trunk-based-development-vs-git-flow-a0212a6cae64

What does a CI pipeline look like?

What does a CD pipeline look like? @jnavarro86

CD Demo • Make PR on github • Wait for
CI approval • Merge PR • CI will build and push new image • Flux deploys latest image

Devops reading list http://www.davefarley.net/

Devops summary • Fast • Reliable • Delivery

MLOps - principles DataOps MLInfra MLEng

Machine Learning & Software Delivery • Continuous integration • Container
orchestration • Continuous delivery/deployment • ? • ? • ? Traditional Software Input Data Program Results Machine Learning Program Input Data Target Data

MLOps in the literature NIPS, 2013 NIPS, 2017

A production system is >> ML Code Sculley, David, et
al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems. 2015.

Pillars of MLOps • Model Development • ML infrastructure •
Monitoring for ML For each of these we want to achieve • Reliability • Test coverage • Reproducible behaviour • Joint ownership of development and operations

Model Development • Model conﬁguration is code reviewed and is
checked into a repository. • Online metrics vs. proxy metrics - do they . • Model staleness should be monitored (concept drift). • Simple models provide sensible baseline measures. • Model bias is tested/monitored.

ML Infrastructure • Training can be reproduced (two models generate
the same distribution). • Unit tests for model code. • Integration tests on a full ML pipeline (data-features-training-model-serving). • Model quality is automatically assessed before serving. • Rollbacks in production if

Monitoring for ML • Observe (in)stability of data distribution -
can cause upstream problems. • Training and serving feature generation should compute the same values. • Monitor for model staleness. • NaNs in your data pipelines - what will the effect be? • Regressions in prediction quality - where will you generate and store the data to be able to do this?

MLOps - technologies

The rise of machine learning libraries

The rise of cloud native machine learning

Pachyderm • Company founded in 2014 • Pachyderm platform runs
on kubernetes • Git for data, stored in custom distributed ﬁle system PFS (pachyderm ﬁle system). Tracks data commits and automates containerised processing between pfs buckets • Pachyderm pipelines automate data processing

Pachyderm Data Pipelines Version control for data

ML research pipeline train embeddings embeddings recipes train recommende r
user interaction data recommend er model metrics

Cuisine inference pipeline tagged recipes train classifier metrics model inference
recipes cuisine tags

Pachyderm

Get it working... • You will need a running k8s
cluster, or minikube • Install kubectl, pachctl • Lots of examples at http://docs.pachyderm.io/en/latest/cookbook/ml.html

Kubeﬂow • ML platform which runs on kubernetes • Main
features: ◦ pipelines ◦ katib ◦ Kube bench ◦ Notebooks ◦ Model store (versioned) ◦ Versioned data through metadata store • Pachyderm pipelines automate data processing

Kubeﬂow https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb4ZYfhniInI--NtUanIivkOM/edit#slide=id.g51f7fee6c8_14_77

Kubeﬂow architecture • Kubeﬂow makes it easy to deploy ML
apps • Composable - Use the libraries/frameworks of your choice • Scalable - number of users & workload size • Portable - on prem, public cloud, local Libraries and CLIs - Focus on end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb 4ZYfhniInI--NtUanIivkOM/edit#slide=id.g501dac0d64_0_1

But what about TFX? Libraries and CLIs - Focus on
end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling • TFX is an e2e solution to ML, using kubernetes/kubeﬂow as an orchestrator.

Kubeﬂow is winning the race • Will standardise the ML
devops pipeline • Technologies which solve parts of the MLOps problem have been integrated into kubeﬂow ecosystem ◦ Pachyderm ◦ Seldon ◦ TFX ◦ Pytorch

Kubeﬂow is winning the race

Machine Learning in Production

Machine Learning in Production

More Decks by Cookpad Bristol

Featured

Transcript