Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Docker and Python: making them play nicely and securely for Ml and DS

Docker and Python: making them play nicely and securely for Ml and DS

Docker has become a standard tool for developers around the world to deploy applications in a reproducible and robust manner. The existence of Docker and Docker compose have reduced the time needed to set up new software and implementing complex technology stacks for our applications.

Now, six years after the initial release of` Docker, we can say with confidence that containers and containers orchestration have become some of the defaults in the current technology stacks.

There are thousands of tutorials and getting started documents for those wanting to adopt Docker for apps deployment. However, if you are a Data Scientist, a researcher or someone working on scientific computing wanting to adopt Docker, the story is quite different. There are very few tutorials (in comparison to app/web) and documents focused on Docker best practices for DS and scientific computing. If you are working on DS, ML or scientific computing, this talk is for you. We’ll cover best practices when building Docker containers for data-intensive applications, from optimising your image build, to ensuring your containers are secure and efficient deployment workflows. We will talk about the most common problems faced while using Docker with data-intensive applications and how you can overcome most of them. Finally, I’ll give some practical and useful tips for you to improve your Docker workflows and practises.

Attendees will leave the talk feeling confident about adopting Docker across a range of DS, ML and research projects.

Tania Allard

October 20, 2020
Tweet

More Decks by Tania Allard

Other Decks in Technology

Transcript

  1. Python
    and docker
    Making them play nicely and
    securely for Data science and
    machine learning
    Tania Allard, PhD
    @ixek

    View Slide

  2. TABLE OF CONTENTS
    Python and the ML
    scene in 2020
    Introduction Summary
    Top tips
    Docker for ML
    Docker for data science:
    caveats and gotchas
    01 02
    Best practices
    Making the most of
    Docker
    03 04

    View Slide

  3. Key takeaways
    Beginner
    Why you’d want to use
    Docker to isolate
    your environments
    Best practices for
    using Docker for ML
    and DS
    Intermediate
    Some techniques and
    tips to optimise
    your Docker images
    and workflows
    Advanced

    View Slide

  4. Hi I am Tania
    - I love Open Source and all
    things data
    - I am a Sr. Developer
    Advocate @Microsoft
    - I am obsessed with Outrun
    and Cyberpunk aesthetics
    - I love mechanical keyboards
    - You can find me at
    trallard.dev
    My dog loves barking while I
    am giving talks

    View Slide

  5. THESE SLIDES
    BIT.LY/ATO-ML-DOCKER

    View Slide

  6. INTRODUCTION
    Understanding the ML ecosystem
    in 2020
    01

    View Slide

  7. Why python?
    https:!//octoverse.github.com/

    View Slide

  8. Data science growth
    https:!//octoverse.github.com/

    View Slide

  9. Within the Python community
    Data analysis and Machine Learning
    are within the top 3 uses for
    Python
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  10. Pythonistas’ pet-peeve
    https:!//xkcd.com/1987/ ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  11. Installation trends
    Down the rabbit hole - core
    Python installation
    https:!//www.jetbrains.com/lp/python-developers-survey-2019/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  12. Frameworks popularity
    NumPy is the most used
    library - basically holds
    the entire scientific
    python ecosystem
    https:!//www.jetbrains.com/lp/python-developers-survey-2019/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  13. Env isolation trends
    Docker has steadily gained
    momentum over the years
    https:!//www.jetbrains.com/lp/python-developers-survey-2019/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  14. What is Docker?
    A tool that helps you to create, deploy and run
    your applications or projects by using
    containers.
    This is a container
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  15. How do containers help me?
    They provide a solution to the
    problem of how to get software to
    run reliably when moved from one
    computing environment to another
    Your laptop
    Test environment
    Staging environment
    Production environment
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  16. Everything works fine… on my laptop
    Your application
    How are your users or colleagues meant to know what dependencies they need?
    Import Error:
    no module name
    x, y, x
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  17. Everything works fine… on my laptop
    Your application
    And even with package managers!
    Import Error:
    no module name
    x, y, x
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  18. Dev life with containers
    Your application
    Libraries, dependencies,
    runtime environment,
    configuration files
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  19. Bliss
    But… are containers the one-stop
    solution?

    View Slide

  20. Docker for ML/DS
    Caveats and gotchas
    02

    View Slide

  21. The good…
    Good
    Provides app level env
    isolation. So does not mess up
    with your local env.
    As each image is tagged you can
    keep track not only of your app/
    library versions but also your
    dev environment
    Better
    Docker
    image
    $ docker run
    Latest
    1.0.2
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  22. The bad and the ugly
    Bad
    As a beginner - most Docker
    tutorials out there do not
    focus on ML
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  23. The bad and the ugly
    Bad
    As a beginner - most tutorials out
    there do not focus on ML. Plus Docker
    can have a steep learning curve
    The Scientific Python ecosystem is great…
    dealing with dependencies can be a pain.
    Add GPUs, multiple architectures,
    multiple OS, Python versions, notebooks,
    Dashboards, APIs that expose models
    The ugly*
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  24. Common pain points in DS and ML
    - Complex setups / dependencies
    - Might need GPUs or libraries like Dask
    - Not everything can be exposed as an API
    - Reliance on data / databases
    - Fast evolving projects (iterative R&D process)
    - Docker is complex and can take a lot of time to upskill
    - Are containers secure enough for my data / model /algorithm?
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  25. How is it different from web apps for example?
    https:!//twitter.com/dstufft/status/1095164069802397696
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  26. How is it different from web apps ?
    - Not every deliverable is an app
    - Not every deliverable is a model either
    - Heavily relies on data
    - Mixture of wheels and compiled packages
    - Security access levels - for data and software
    - Mixture of stakeholders: data scientists, software engineers, ML
    engineers
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  27. ML goes way beyond a “model”
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  28. The pillars of reproducibility
    Environment
    Code
    Data
    If one changes everything changes
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  29. Dockerfiles are used to create
    Docker images by providing a set of
    instructions to install software,
    configure your image or copy files
    Building Docker
    images
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  30. Base image
    Main instructions
    Entry command
    Dissecting Docker
    images
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  31. Install pandas
    Install requests
    Dissecting Docker
    images
    Install flask
    Base
    image
    Each instruction creates
    A layer (like an onion)
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  32. Choosing the best base image
    https:!//github.com/docker-library/docs/tree/master/python
    ● If building from scratch use the official Python images
    https:!//hub.docker.com/_/python
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  33. The Jupyter docker stack
    Need Conda, notebooks and scientific
    Python ecosystem?
    Try Jupyter Docker stacks
    https:!//jupyter-docker-stacks.readthedocs.io/
    [email protected]
    base-notebook
    minimal-notebook
    scipy-notebook r-notebook
    tensorflow-notebook datascience-notebook pyspark-notebook
    all-spark-notebook
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  34. Best practices
    - Always know what you are expecting
    - Provide context with LABELS
    - Split complex RUN statements and
    sort them
    - Prefer COPY to add files
    https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  35. Speed up your build
    - Leverage build cache
    - Install only necessary
    packages
    https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  36. Speed up your build and proof
    - Leverage build cache
    - Install only necessary
    packages
    - Explicitly ignore files
    https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  37. Mount volumes to access data
    -You can use bind mounts to
    directories (unless you are
    using a database)
    -Avoid issues by creating a
    non-root user
    https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  38. Mount volumes to access data
    - You can use bind mounts to
    directories (unless you are
    using a database)
    - Avoid issues by creating a
    non-root user
    https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  39. USE Multi stage builds
    - Fetch and manage secrets in an
    intermediate layer
    - Not all your dependencies will
    have been packed as wheels
    so you might need a
    compiler - build a compile
    and a runtime image
    - Smaller images overall
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  40. USE Multi stage builds
    Compile-image
    Docker
    image
    Runtime-image
    Copy virtual
    Environment
    $ docker build ---pull ---rm -f “Dockerfile"\
    -t trallard:data-scratch-1.0 "."
    Docker
    image
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  41. USE Multi stage builds
    Docker
    image
    Runtime-image
    FINAL IMAGE
    trallard:data-scratch-1.0
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  42. Do not reinvent the wheel
    Leverage the existence and usage
    of tools like repo2docker.
    Already configured and optimised
    for Data Science / Scientific
    computing.
    https:!//repo2docker.readthedocs.io/en/latest
    $ conda install jupyter repo2docker
    $ jupyter-repo2docker “.”
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  43. Delegate to your continuous integration Tool
    Set Continuous integration
    (Travis, GitHub Actions, whatever
    you prefer).
    And delegate your build - also
    build often.
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  44. This workflow
    Docker Docker
    - Code in version control
    - Trigger on tag / Also scheduled trigger
    - Build image
    - Push image
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  45. Top tips
    My top recommendations
    04

    View Slide

  46. Top tips
    1.Rebuild your images frequently - get security updates for system packages
    2.Never work as root / minimise the privileges
    3.You do not want to use Alpine Linux (go for buster, stretch or the
    Jupyter stack)
    4.Always know what you are expecting: pin / version EVERYTHING (use pip-
    tools, conda, poetry or pipenv)
    5.Leverage build cache
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  47. Top tips
    6.Use one Dockerfile per project
    7.Use multi-stage builds - need to compile code? Need to reduce your image size?
    8.Make your images identifiable (test, production, R&D) - also be careful when
    accessing databases and using ENV variables / build variables
    9.Do not reinvent the wheel! Use repo2docker
    10.Automate - no need to build and push manually
    11.Use a linter
    ixek | https:!//bit.ly/ato-ml-docker

    View Slide

  48. THANKS
    Get in touch:
    trallard.dev

    View Slide