Slide 1

Slide 1 text

Machine Learning in production Ettie Eyre Interactive AI CDT, Bristol University, 9th December 2019 @ettieeyre

Slide 2

Slide 2 text

Before we start…. ● I’ve been in industry a long time - if I start talking about unfamiliar concepts please shout. ● Some material might be familiar, some unfamiliar, we can speed up or slow down so keep me updated on progress!

Slide 3

Slide 3 text

Overview ● Intro ● Devops primer ○ Docker ○ Kubernetes ○ CI/CD ● MLOps ○ Pachyderm ○ Kubeflow

Slide 4

Slide 4 text

About me ● PHD Bristol University 2009-2013 ○ DTC BCCS ● ML consultant at SecondSync 2011-2013 ● Postdoc at QMUL 2013 in computational creativity ● Data Scientist/data architect at Black Swan data 2013-2014 ● Research Engineer at Gluru 2015-2016 ● AI lead at Adarga 2016-2018 ● Currently Machine Learning Infrastructure Lead at Cookpad ● Co-organise Bristol Machine Learning meetup @ettieeyre

Slide 5

Slide 5 text

Cookpad ● Cookpad is a community platform that enables people to share recipe ideas and cooking tips. ● Started in Japan in 1997, listed company at Tokyo Stock Exchange. ● We’re a global company with offices in 10 countries and a team of 700 ● In 2017, we set our Global HQ up in Bristol. We currently have about 100 employees in Bristol.

Slide 6

Slide 6 text

Our mission is to make everyday cooking fun Because we believe that cooking is the key to a happier and healthier life for people, communities and the planet. The choices we make shape our world. And when we cook, the choices we make have an impact on ourselves, the people we cook for, the growers and producers we buy from and the wider environment. By building the platform that solves the issues related to everyday cooking and helps more people to cook, we believe we can help build a better world.

Slide 7

Slide 7 text

105 m people on average use Cookpad every month 75+ countries in 6 m 30+ languages recipes on the platform in

Slide 8

Slide 8 text

Devops primer

Slide 9

Slide 9 text

Devops “DevOps is a set of practices that combines software development and information-technology operations which aims to shorten the systems development life cycle and provide continuous delivery with high software quality.”

Slide 10

Slide 10 text

Devops - a history ● Edit code live on a server ● Ops team build your executables, and deploy onto a server ● Cloud services (AWS, GCP, Azure, IBM, Oracle) ● Immutable infrastructure, docker and kubernetes. ● Continuous integration and continuous delivery ● Gitops!

Slide 11

Slide 11 text

Devops 2010 2015 2000 Continuous Integration 2005 Continuous delivery 2013 1995 Edit code live on server? Hand binaries to ops team

Slide 12

Slide 12 text

2013 2014-5 Verma, Abhishek, et al. "Large-scale cluster management at Google with Borg." Proceedings of the Tenth European Conference on Computer Systems. ACM, 2015.

Slide 13

Slide 13 text

https://blog.netapp.com/blogs/containers-vs-vms/

Slide 14

Slide 14 text

https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

Slide 15

Slide 15 text

Git reminder https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it

Slide 16

Slide 16 text

Gitflow (feature branch development) Trunk-based development https://codeburst.io/trunk-based-development-vs-git-flow-a0212a6cae64

Slide 17

Slide 17 text

What does a CI pipeline look like?

Slide 18

Slide 18 text

What does a CD pipeline look like? @jnavarro86

Slide 19

Slide 19 text

CD Demo ● Make PR on github ● Wait for CI approval ● Merge PR ● CI will build and push new image ● Flux deploys latest image

Slide 20

Slide 20 text

Devops reading list http://www.davefarley.net/

Slide 21

Slide 21 text

Devops summary ● Fast ● Reliable ● Delivery

Slide 22

Slide 22 text

MLOps - principles DataOps MLInfra MLEng

Slide 23

Slide 23 text

Machine Learning & Software Delivery ● Continuous integration ● Container orchestration ● Continuous delivery/deployment ● ? ● ? ● ? Traditional Software Input Data Program Results Machine Learning Program Input Data Target Data

Slide 24

Slide 24 text

MLOps in the literature NIPS, 2013 NIPS, 2017

Slide 25

Slide 25 text

A production system is >> ML Code Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems. 2015.

Slide 26

Slide 26 text

Pillars of MLOps ● Model Development ● ML infrastructure ● Monitoring for ML For each of these we want to achieve ● Reliability ● Test coverage ● Reproducible behaviour ● Joint ownership of development and operations

Slide 27

Slide 27 text

Model Development ● Model configuration is code reviewed and is checked into a repository. ● Online metrics vs. proxy metrics - do they . ● Model staleness should be monitored (concept drift). ● Simple models provide sensible baseline measures. ● Model bias is tested/monitored.

Slide 28

Slide 28 text

ML Infrastructure ● Training can be reproduced (two models generate the same distribution). ● Unit tests for model code. ● Integration tests on a full ML pipeline (data-features-training-model-serving). ● Model quality is automatically assessed before serving. ● Rollbacks in production if

Slide 29

Slide 29 text

Monitoring for ML ● Observe (in)stability of data distribution - can cause upstream problems. ● Training and serving feature generation should compute the same values. ● Monitor for model staleness. ● NaNs in your data pipelines - what will the effect be? ● Regressions in prediction quality - where will you generate and store the data to be able to do this?

Slide 30

Slide 30 text

MLOps - technologies

Slide 31

Slide 31 text

The rise of machine learning libraries

Slide 32

Slide 32 text

The rise of cloud native machine learning

Slide 33

Slide 33 text

Pachyderm ● Company founded in 2014 ● Pachyderm platform runs on kubernetes ● Git for data, stored in custom distributed file system PFS (pachyderm file system). Tracks data commits and automates containerised processing between pfs buckets ● Pachyderm pipelines automate data processing

Slide 34

Slide 34 text

Pachyderm Data Pipelines Version control for data

Slide 35

Slide 35 text

ML research pipeline train embeddings embeddings recipes train recommende r user interaction data recommend er model metrics

Slide 36

Slide 36 text

Cuisine inference pipeline tagged recipes train classifier metrics model inference recipes cuisine tags

Slide 37

Slide 37 text

Pachyderm

Slide 38

Slide 38 text

Get it working... ● You will need a running k8s cluster, or minikube ● Install kubectl, pachctl ● Lots of examples at http://docs.pachyderm.io/en/latest/cookbook/ml.html

Slide 39

Slide 39 text

Kubeflow ● ML platform which runs on kubernetes ● Main features: ○ pipelines ○ katib ○ Kube bench ○ Notebooks ○ Model store (versioned) ○ Versioned data through metadata store ● Pachyderm pipelines automate data processing

Slide 40

Slide 40 text

Kubeflow https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb4ZYfhniInI--NtUanIivkOM/edit#slide=id.g51f7fee6c8_14_77

Slide 41

Slide 41 text

Kubeflow architecture ● Kubeflow makes it easy to deploy ML apps ● Composable - Use the libraries/frameworks of your choice ● Scalable - number of users & workload size ● Portable - on prem, public cloud, local Libraries and CLIs - Focus on end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb 4ZYfhniInI--NtUanIivkOM/edit#slide=id.g501dac0d64_0_1

Slide 42

Slide 42 text

But what about TFX? Libraries and CLIs - Focus on end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling ● TFX is an e2e solution to ML, using kubernetes/kubeflow as an orchestrator.

Slide 43

Slide 43 text

Kubeflow is winning the race ● Will standardise the ML devops pipeline ● Technologies which solve parts of the MLOps problem have been integrated into kubeflow ecosystem ○ Pachyderm ○ Seldon ○ TFX ○ Pytorch

Slide 44

Slide 44 text

Kubeflow is winning the race

Slide 45

Slide 45 text

No content