Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Docker and Python: making them play nicely and securely for Ml and DS

Docker and Python: making them play nicely and securely for Ml and DS

Docker has become a standard tool for developers around the world to deploy applications in a reproducible and robust manner. The existence of Docker and Docker compose have reduced the time needed to set up new software and implementing complex technology stacks for our applications.

Now, six years after the initial release of` Docker, we can say with confidence that containers and containers orchestration have become some of the defaults in the current technology stacks.

There are thousands of tutorials and getting started documents for those wanting to adopt Docker for apps deployment. However, if you are a Data Scientist, a researcher or someone working on scientific computing wanting to adopt Docker, the story is quite different. There are very few tutorials (in comparison to app/web) and documents focused on Docker best practices for DS and scientific computing. If you are working on DS, ML or scientific computing, this talk is for you. We’ll cover best practices when building Docker containers for data-intensive applications, from optimising your image build, to ensuring your containers are secure and efficient deployment workflows. We will talk about the most common problems faced while using Docker with data-intensive applications and how you can overcome most of them. Finally, I’ll give some practical and useful tips for you to improve your Docker workflows and practises.

Attendees will leave the talk feeling confident about adopting Docker across a range of DS, ML and research projects.


Tania Allard

October 20, 2020


  1. Python and docker Making them play nicely and securely for

    Data science and machine learning Tania Allard, PhD @ixek
  2. TABLE OF CONTENTS Python and the ML scene in 2020

    Introduction Summary Top tips Docker for ML Docker for data science: caveats and gotchas 01 02 Best practices Making the most of Docker 03 04
  3. Key takeaways Beginner Why you’d want to use Docker to

    isolate your environments Best practices for using Docker for ML and DS Intermediate Some techniques and tips to optimise your Docker images and workflows Advanced
  4. Hi I am Tania - I love Open Source and

    all things data - I am a Sr. Developer Advocate @Microsoft - I am obsessed with Outrun and Cyberpunk aesthetics - I love mechanical keyboards - You can find me at trallard.dev My dog loves barking while I am giving talks

  6. INTRODUCTION Understanding the ML ecosystem in 2020 01

  7. Why python? https:!//octoverse.github.com/

  8. Data science growth https:!//octoverse.github.com/

  9. Within the Python community Data analysis and Machine Learning are

    within the top 3 uses for Python ixek | https:!//bit.ly/ato-ml-docker
  10. Pythonistas’ pet-peeve https:!//xkcd.com/1987/ ixek | https:!//bit.ly/ato-ml-docker

  11. Installation trends Down the rabbit hole - core Python installation

    https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker
  12. Frameworks popularity NumPy is the most used library - basically

    holds the entire scientific python ecosystem https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker
  13. Env isolation trends Docker has steadily gained momentum over the

    years https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker
  14. What is Docker? A tool that helps you to create,

    deploy and run your applications or projects by using containers. This is a container ixek | https:!//bit.ly/ato-ml-docker
  15. How do containers help me? They provide a solution to

    the problem of how to get software to run reliably when moved from one computing environment to another Your laptop Test environment Staging environment Production environment ixek | https:!//bit.ly/ato-ml-docker
  16. Everything works fine… on my laptop Your application How are

    your users or colleagues meant to know what dependencies they need? Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker
  17. Everything works fine… on my laptop Your application And even

    with package managers! Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker
  18. Dev life with containers Your application Libraries, dependencies, runtime environment,

    configuration files ixek | https:!//bit.ly/ato-ml-docker
  19. Bliss But… are containers the one-stop solution?

  20. Docker for ML/DS Caveats and gotchas 02

  21. The good… Good Provides app level env isolation. So does

    not mess up with your local env. As each image is tagged you can keep track not only of your app/ library versions but also your dev environment Better Docker image $ docker run Latest 1.0.2 ixek | https:!//bit.ly/ato-ml-docker
  22. The bad and the ugly Bad As a beginner -

    most Docker tutorials out there do not focus on ML ixek | https:!//bit.ly/ato-ml-docker
  23. The bad and the ugly Bad As a beginner -

    most tutorials out there do not focus on ML. Plus Docker can have a steep learning curve The Scientific Python ecosystem is great… dealing with dependencies can be a pain. Add GPUs, multiple architectures, multiple OS, Python versions, notebooks, Dashboards, APIs that expose models The ugly* ixek | https:!//bit.ly/ato-ml-docker
  24. Common pain points in DS and ML - Complex setups

    / dependencies - Might need GPUs or libraries like Dask - Not everything can be exposed as an API - Reliance on data / databases - Fast evolving projects (iterative R&D process) - Docker is complex and can take a lot of time to upskill - Are containers secure enough for my data / model /algorithm? ixek | https:!//bit.ly/ato-ml-docker
  25. How is it different from web apps for example? https:!//twitter.com/dstufft/status/1095164069802397696

    ixek | https:!//bit.ly/ato-ml-docker
  26. How is it different from web apps ? - Not

    every deliverable is an app - Not every deliverable is a model either - Heavily relies on data - Mixture of wheels and compiled packages - Security access levels - for data and software - Mixture of stakeholders: data scientists, software engineers, ML engineers ixek | https:!//bit.ly/ato-ml-docker
  27. ML goes way beyond a “model” ixek | https:!//bit.ly/ato-ml-docker

  28. The pillars of reproducibility Environment Code Data If one changes

    everything changes ixek | https:!//bit.ly/ato-ml-docker
  29. Dockerfiles are used to create Docker images by providing a

    set of instructions to install software, configure your image or copy files Building Docker images ixek | https:!//bit.ly/ato-ml-docker
  30. Base image Main instructions Entry command Dissecting Docker images ixek

    | https:!//bit.ly/ato-ml-docker
  31. Install pandas Install requests Dissecting Docker images Install flask Base

    image Each instruction creates A layer (like an onion) ixek | https:!//bit.ly/ato-ml-docker
  32. Choosing the best base image https:!//github.com/docker-library/docs/tree/master/python • If building from

    scratch use the official Python images https:!//hub.docker.com/_/python ixek | https:!//bit.ly/ato-ml-docker
  33. The Jupyter docker stack Need Conda, notebooks and scientific Python

    ecosystem? Try Jupyter Docker stacks https:!//jupyter-docker-stacks.readthedocs.io/ ubuntu@SHA base-notebook minimal-notebook scipy-notebook r-notebook tensorflow-notebook datascience-notebook pyspark-notebook all-spark-notebook ixek | https:!//bit.ly/ato-ml-docker
  34. Best practices - Always know what you are expecting -

    Provide context with LABELS - Split complex RUN statements and sort them - Prefer COPY to add files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker
  35. Speed up your build - Leverage build cache - Install

    only necessary packages https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker
  36. Speed up your build and proof - Leverage build cache

    - Install only necessary packages - Explicitly ignore files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker
  37. Mount volumes to access data -You can use bind mounts

    to directories (unless you are using a database) -Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker
  38. Mount volumes to access data - You can use bind

    mounts to directories (unless you are using a database) - Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker
  39. USE Multi stage builds - Fetch and manage secrets in

    an intermediate layer - Not all your dependencies will have been packed as wheels so you might need a compiler - build a compile and a runtime image - Smaller images overall ixek | https:!//bit.ly/ato-ml-docker
  40. USE Multi stage builds Compile-image Docker image Runtime-image Copy virtual

    Environment $ docker build ---pull ---rm -f “Dockerfile"\ -t trallard:data-scratch-1.0 "." Docker image ixek | https:!//bit.ly/ato-ml-docker
  41. USE Multi stage builds Docker image Runtime-image FINAL IMAGE trallard:data-scratch-1.0

    ixek | https:!//bit.ly/ato-ml-docker
  42. Do not reinvent the wheel Leverage the existence and usage

    of tools like repo2docker. Already configured and optimised for Data Science / Scientific computing. https:!//repo2docker.readthedocs.io/en/latest $ conda install jupyter repo2docker $ jupyter-repo2docker “.” ixek | https:!//bit.ly/ato-ml-docker
  43. Delegate to your continuous integration Tool Set Continuous integration (Travis,

    GitHub Actions, whatever you prefer). And delegate your build - also build often. ixek | https:!//bit.ly/ato-ml-docker
  44. This workflow Docker Docker - Code in version control -

    Trigger on tag / Also scheduled trigger - Build image - Push image ixek | https:!//bit.ly/ato-ml-docker
  45. Top tips My top recommendations 04

  46. Top tips 1.Rebuild your images frequently - get security updates

    for system packages 2.Never work as root / minimise the privileges 3.You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack) 4.Always know what you are expecting: pin / version EVERYTHING (use pip- tools, conda, poetry or pipenv) 5.Leverage build cache ixek | https:!//bit.ly/ato-ml-docker
  47. Top tips 6.Use one Dockerfile per project 7.Use multi-stage builds

    - need to compile code? Need to reduce your image size? 8.Make your images identifiable (test, production, R&D) - also be careful when accessing databases and using ENV variables / build variables 9.Do not reinvent the wheel! Use repo2docker 10.Automate - no need to build and push manually 11.Use a linter ixek | https:!//bit.ly/ato-ml-docker
  48. THANKS Get in touch: trallard.dev