Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning in Production

Cookpad Bristol
December 09, 2019
1.3k

Machine Learning in Production

Presentation given at Interactive AI CDT, Bristol University, 9th December 2019

Cookpad Bristol

December 09, 2019
Tweet

Transcript

  1. Before we start…. • I’ve been in industry a long

    time - if I start talking about unfamiliar concepts please shout. • Some material might be familiar, some unfamiliar, we can speed up or slow down so keep me updated on progress!
  2. Overview • Intro • Devops primer ◦ Docker ◦ Kubernetes

    ◦ CI/CD • MLOps ◦ Pachyderm ◦ Kubeflow
  3. About me • PHD Bristol University 2009-2013 ◦ DTC BCCS

    • ML consultant at SecondSync 2011-2013 • Postdoc at QMUL 2013 in computational creativity • Data Scientist/data architect at Black Swan data 2013-2014 • Research Engineer at Gluru 2015-2016 • AI lead at Adarga 2016-2018 • Currently Machine Learning Infrastructure Lead at Cookpad • Co-organise Bristol Machine Learning meetup @ettieeyre
  4. Cookpad • Cookpad is a community platform that enables people

    to share recipe ideas and cooking tips. • Started in Japan in 1997, listed company at Tokyo Stock Exchange. • We’re a global company with offices in 10 countries and a team of 700 • In 2017, we set our Global HQ up in Bristol. We currently have about 100 employees in Bristol.
  5. Our mission is to make everyday cooking fun Because we

    believe that cooking is the key to a happier and healthier life for people, communities and the planet. The choices we make shape our world. And when we cook, the choices we make have an impact on ourselves, the people we cook for, the growers and producers we buy from and the wider environment. By building the platform that solves the issues related to everyday cooking and helps more people to cook, we believe we can help build a better world.
  6. 105 m people on average use Cookpad every month 75+

    countries in 6 m 30+ languages recipes on the platform in
  7. Devops “DevOps is a set of practices that combines software

    development and information-technology operations which aims to shorten the systems development life cycle and provide continuous delivery with high software quality.”
  8. Devops - a history • Edit code live on a

    server • Ops team build your executables, and deploy onto a server • Cloud services (AWS, GCP, Azure, IBM, Oracle) • Immutable infrastructure, docker and kubernetes. • Continuous integration and continuous delivery • Gitops!
  9. Devops 2010 2015 2000 Continuous Integration 2005 Continuous delivery 2013

    1995 Edit code live on server? Hand binaries to ops team
  10. 2013 2014-5 Verma, Abhishek, et al. "Large-scale cluster management at

    Google with Borg." Proceedings of the Tenth European Conference on Computer Systems. ACM, 2015.
  11. CD Demo • Make PR on github • Wait for

    CI approval • Merge PR • CI will build and push new image • Flux deploys latest image
  12. Machine Learning & Software Delivery • Continuous integration • Container

    orchestration • Continuous delivery/deployment • ? • ? • ? Traditional Software Input Data Program Results Machine Learning Program Input Data Target Data
  13. A production system is >> ML Code Sculley, David, et

    al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems. 2015.
  14. Pillars of MLOps • Model Development • ML infrastructure •

    Monitoring for ML For each of these we want to achieve • Reliability • Test coverage • Reproducible behaviour • Joint ownership of development and operations
  15. Model Development • Model configuration is code reviewed and is

    checked into a repository. • Online metrics vs. proxy metrics - do they . • Model staleness should be monitored (concept drift). • Simple models provide sensible baseline measures. • Model bias is tested/monitored.
  16. ML Infrastructure • Training can be reproduced (two models generate

    the same distribution). • Unit tests for model code. • Integration tests on a full ML pipeline (data-features-training-model-serving). • Model quality is automatically assessed before serving. • Rollbacks in production if
  17. Monitoring for ML • Observe (in)stability of data distribution -

    can cause upstream problems. • Training and serving feature generation should compute the same values. • Monitor for model staleness. • NaNs in your data pipelines - what will the effect be? • Regressions in prediction quality - where will you generate and store the data to be able to do this?
  18. Pachyderm • Company founded in 2014 • Pachyderm platform runs

    on kubernetes • Git for data, stored in custom distributed file system PFS (pachyderm file system). Tracks data commits and automates containerised processing between pfs buckets • Pachyderm pipelines automate data processing
  19. ML research pipeline train embeddings embeddings recipes train recommende r

    user interaction data recommend er model metrics
  20. Get it working... • You will need a running k8s

    cluster, or minikube • Install kubectl, pachctl • Lots of examples at http://docs.pachyderm.io/en/latest/cookbook/ml.html
  21. Kubeflow • ML platform which runs on kubernetes • Main

    features: ◦ pipelines ◦ katib ◦ Kube bench ◦ Notebooks ◦ Model store (versioned) ◦ Versioned data through metadata store • Pachyderm pipelines automate data processing
  22. Kubeflow architecture • Kubeflow makes it easy to deploy ML

    apps • Composable - Use the libraries/frameworks of your choice • Scalable - number of users & workload size • Portable - on prem, public cloud, local Libraries and CLIs - Focus on end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb 4ZYfhniInI--NtUanIivkOM/edit#slide=id.g501dac0d64_0_1
  23. But what about TFX? Libraries and CLIs - Focus on

    end users Systems - Combine multiple services Low Level APIs / Services (single function) Arena kfctl kubectl katib pipelines notebooks fairing TFJob PyTorch Job Jupyter CR Seldon CR kube bench Metadata Orchestration Pipelines CR Argo Study Job MPI CR Spark Job Model DB TFX Developed By Kubeflow Developed Outside Kubeflow * Not all components shown IAM Scheduling • TFX is an e2e solution to ML, using kubernetes/kubeflow as an orchestrator.
  24. Kubeflow is winning the race • Will standardise the ML

    devops pipeline • Technologies which solve parts of the MLOps problem have been integrated into kubeflow ecosystem ◦ Pachyderm ◦ Seldon ◦ TFX ◦ Pytorch