$30 off During Our Annual Pro Sale. View Details »

Machine Learning in Production

Cookpad Bristol
December 09, 2019
1.2k

Machine Learning in Production

Presentation given at Interactive AI CDT, Bristol University, 9th December 2019

Cookpad Bristol

December 09, 2019
Tweet

Transcript

  1. Machine Learning in
    production
    Ettie Eyre
    Interactive AI CDT, Bristol University, 9th December 2019
    @ettieeyre

    View Slide

  2. Before we start….
    ● I’ve been in industry a long time - if I start
    talking about unfamiliar concepts please
    shout.
    ● Some material might be familiar, some
    unfamiliar, we can speed up or slow down so
    keep me updated on progress!

    View Slide

  3. Overview
    ● Intro
    ● Devops primer
    ○ Docker
    ○ Kubernetes
    ○ CI/CD
    ● MLOps
    ○ Pachyderm
    ○ Kubeflow

    View Slide

  4. About me
    ● PHD Bristol University 2009-2013
    ○ DTC BCCS
    ● ML consultant at SecondSync 2011-2013
    ● Postdoc at QMUL 2013 in computational
    creativity
    ● Data Scientist/data architect at Black Swan
    data 2013-2014
    ● Research Engineer at Gluru 2015-2016
    ● AI lead at Adarga 2016-2018
    ● Currently Machine Learning Infrastructure
    Lead at Cookpad
    ● Co-organise Bristol Machine Learning meetup
    @ettieeyre

    View Slide

  5. Cookpad
    ● Cookpad is a community platform that enables people to share recipe ideas
    and cooking tips.
    ● Started in Japan in 1997, listed company at Tokyo Stock Exchange.
    ● We’re a global company with offices in 10 countries and a team of 700
    ● In 2017, we set our Global HQ up in Bristol. We currently have about 100
    employees in Bristol.

    View Slide

  6. Our mission is to make everyday cooking fun
    Because we believe that cooking is the key to a happier and healthier life for
    people, communities and the planet.
    The choices we make shape our world.
    And when we cook, the choices we make have an impact on ourselves, the
    people we cook for, the growers and producers we buy from and the wider
    environment.
    By building the platform that solves the issues related to everyday cooking and
    helps more people to cook, we believe we can help build a better world.

    View Slide

  7. 105 m
    people on average use Cookpad
    every month
    75+
    countries
    in
    6 m 30+
    languages
    recipes on the platform
    in

    View Slide

  8. Devops primer

    View Slide

  9. Devops
    “DevOps is a set of practices that combines software development and
    information-technology operations which aims to shorten the systems
    development life cycle and provide continuous delivery with high software
    quality.”

    View Slide

  10. Devops - a history
    ● Edit code live on a server
    ● Ops team build your executables, and deploy onto a server
    ● Cloud services (AWS, GCP, Azure, IBM, Oracle)
    ● Immutable infrastructure, docker and kubernetes.
    ● Continuous integration and continuous delivery
    ● Gitops!

    View Slide

  11. Devops
    2010
    2015
    2000
    Continuous Integration
    2005
    Continuous delivery
    2013
    1995
    Edit code live on server?
    Hand binaries to ops team

    View Slide

  12. 2013 2014-5
    Verma, Abhishek, et al. "Large-scale cluster
    management at Google with Borg." Proceedings of the
    Tenth European Conference on Computer Systems.
    ACM, 2015.

    View Slide

  13. https://blog.netapp.com/blogs/containers-vs-vms/

    View Slide

  14. https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

    View Slide

  15. Git reminder
    https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it https://www.nobledesktop.com/blog/what-is-git-and-why-should-you-use-it

    View Slide

  16. Gitflow (feature branch
    development)
    Trunk-based development
    https://codeburst.io/trunk-based-development-vs-git-flow-a0212a6cae64

    View Slide

  17. What does a CI pipeline look like?

    View Slide

  18. What does a CD pipeline look like?
    @jnavarro86

    View Slide

  19. CD Demo
    ● Make PR on github
    ● Wait for CI approval
    ● Merge PR
    ● CI will build and push new image
    ● Flux deploys latest image

    View Slide

  20. Devops reading list
    http://www.davefarley.net/

    View Slide

  21. Devops summary
    ● Fast
    ● Reliable
    ● Delivery

    View Slide

  22. MLOps - principles
    DataOps
    MLInfra
    MLEng

    View Slide

  23. Machine Learning & Software Delivery
    ● Continuous integration
    ● Container orchestration
    ● Continuous delivery/deployment
    ● ?
    ● ?
    ● ?
    Traditional
    Software
    Input Data
    Program
    Results
    Machine
    Learning
    Program
    Input Data
    Target Data

    View Slide

  24. MLOps in the literature
    NIPS, 2013 NIPS, 2017

    View Slide

  25. A production system is >> ML Code
    Sculley, David, et al. "Hidden technical debt in machine learning systems." Advances in neural information processing systems. 2015.

    View Slide

  26. Pillars of MLOps
    ● Model Development
    ● ML infrastructure
    ● Monitoring for ML
    For each of these we want to achieve
    ● Reliability
    ● Test coverage
    ● Reproducible behaviour
    ● Joint ownership of development and operations

    View Slide

  27. Model Development
    ● Model configuration is code reviewed and is checked into a repository.
    ● Online metrics vs. proxy metrics - do they .
    ● Model staleness should be monitored (concept drift).
    ● Simple models provide sensible baseline measures.
    ● Model bias is tested/monitored.

    View Slide

  28. ML Infrastructure
    ● Training can be reproduced (two models generate the same distribution).
    ● Unit tests for model code.
    ● Integration tests on a full ML pipeline
    (data-features-training-model-serving).
    ● Model quality is automatically assessed before serving.
    ● Rollbacks in production if

    View Slide

  29. Monitoring for ML
    ● Observe (in)stability of data distribution - can cause upstream problems.
    ● Training and serving feature generation should compute the same values.
    ● Monitor for model staleness.
    ● NaNs in your data pipelines - what will the effect be?
    ● Regressions in prediction quality - where will you generate and store the
    data to be able to do this?

    View Slide

  30. MLOps - technologies

    View Slide

  31. The rise of machine
    learning libraries

    View Slide

  32. The rise of cloud
    native machine
    learning

    View Slide

  33. Pachyderm
    ● Company founded in 2014
    ● Pachyderm platform runs on kubernetes
    ● Git for data, stored in custom distributed file system PFS (pachyderm file
    system). Tracks data commits and automates containerised processing
    between pfs buckets
    ● Pachyderm pipelines automate data processing

    View Slide

  34. Pachyderm
    Data Pipelines
    Version control for data

    View Slide

  35. ML research pipeline
    train
    embeddings embeddings
    recipes
    train
    recommende
    r
    user
    interaction
    data
    recommend
    er model
    metrics

    View Slide

  36. Cuisine inference pipeline
    tagged
    recipes
    train
    classifier
    metrics
    model
    inference
    recipes
    cuisine
    tags

    View Slide

  37. Pachyderm

    View Slide

  38. Get it working...
    ● You will need a running k8s cluster, or minikube
    ● Install kubectl, pachctl
    ● Lots of examples at http://docs.pachyderm.io/en/latest/cookbook/ml.html

    View Slide

  39. Kubeflow
    ● ML platform which runs on kubernetes
    ● Main features:
    ○ pipelines
    ○ katib
    ○ Kube bench
    ○ Notebooks
    ○ Model store (versioned)
    ○ Versioned data through metadata store
    ● Pachyderm pipelines automate data processing

    View Slide

  40. Kubeflow
    https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb4ZYfhniInI--NtUanIivkOM/edit#slide=id.g51f7fee6c8_14_77

    View Slide

  41. Kubeflow
    architecture
    ● Kubeflow makes it easy to
    deploy ML apps
    ● Composable - Use the
    libraries/frameworks of your choice
    ● Scalable - number of users & workload
    size
    ● Portable - on prem, public cloud, local
    Libraries and CLIs - Focus on end users
    Systems - Combine multiple services
    Low Level APIs / Services (single function)
    Arena kfctl kubectl
    katib pipelines
    notebooks
    fairing
    TFJob
    PyTorch
    Job
    Jupyter
    CR
    Seldon CR
    kube
    bench
    Metadata
    Orchestration
    Pipelines
    CR
    Argo
    Study Job
    MPI CR
    Spark
    Job
    Model
    DB
    TFX
    Developed
    By Kubeflow
    Developed
    Outside Kubeflow
    * Not all components shown
    IAM Scheduling
    https://docs.google.com/presentation/d/13a3shc98F-G779tnNFXb
    4ZYfhniInI--NtUanIivkOM/edit#slide=id.g501dac0d64_0_1

    View Slide

  42. But what about TFX?
    Libraries and CLIs - Focus on end users
    Systems - Combine multiple services
    Low Level APIs / Services (single function)
    Arena kfctl kubectl
    katib pipelines
    notebooks
    fairing
    TFJob
    PyTorch
    Job
    Jupyter
    CR
    Seldon CR
    kube
    bench
    Metadata
    Orchestration
    Pipelines
    CR
    Argo
    Study Job
    MPI CR
    Spark
    Job
    Model
    DB
    TFX
    Developed
    By Kubeflow
    Developed
    Outside Kubeflow
    * Not all components shown
    IAM Scheduling
    ● TFX is an e2e solution to ML,
    using kubernetes/kubeflow as
    an orchestrator.

    View Slide

  43. Kubeflow is winning the race
    ● Will standardise the ML devops pipeline
    ● Technologies which solve parts of the MLOps problem have been
    integrated into kubeflow ecosystem
    ○ Pachyderm
    ○ Seldon
    ○ TFX
    ○ Pytorch

    View Slide

  44. Kubeflow is winning the race

    View Slide

  45. View Slide