Slide 1

Slide 1 text

Python and docker Making them play nicely and securely for Data science and machine learning Tania Allard, PhD @ixek

Slide 2

Slide 2 text

TABLE OF CONTENTS Python and the ML scene in 2020 Introduction Summary Top tips Docker for ML Docker for data science: caveats and gotchas 01 02 Best practices Making the most of Docker 03 04

Slide 3

Slide 3 text

Key takeaways Beginner Why you’d want to use Docker to isolate your environments Best practices for using Docker for ML and DS Intermediate Some techniques and tips to optimise your Docker images and workflows Advanced

Slide 4

Slide 4 text

Hi I am Tania - I love Open Source and all things data - I am a Sr. Developer Advocate @Microsoft - I am obsessed with Outrun and Cyberpunk aesthetics - I love mechanical keyboards - You can find me at trallard.dev My dog loves barking while I am giving talks

Slide 5

Slide 5 text

THESE SLIDES BIT.LY/ATO-ML-DOCKER

Slide 6

Slide 6 text

INTRODUCTION Understanding the ML ecosystem in 2020 01

Slide 7

Slide 7 text

Why python? https:!//octoverse.github.com/

Slide 8

Slide 8 text

Data science growth https:!//octoverse.github.com/

Slide 9

Slide 9 text

Within the Python community Data analysis and Machine Learning are within the top 3 uses for Python ixek | https:!//bit.ly/ato-ml-docker

Slide 10

Slide 10 text

Pythonistas’ pet-peeve https:!//xkcd.com/1987/ ixek | https:!//bit.ly/ato-ml-docker

Slide 11

Slide 11 text

Installation trends Down the rabbit hole - core Python installation https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

Slide 12

Slide 12 text

Frameworks popularity NumPy is the most used library - basically holds the entire scientific python ecosystem https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

Slide 13

Slide 13 text

Env isolation trends Docker has steadily gained momentum over the years https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

Slide 14

Slide 14 text

What is Docker? A tool that helps you to create, deploy and run your applications or projects by using containers. This is a container ixek | https:!//bit.ly/ato-ml-docker

Slide 15

Slide 15 text

How do containers help me? They provide a solution to the problem of how to get software to run reliably when moved from one computing environment to another Your laptop Test environment Staging environment Production environment ixek | https:!//bit.ly/ato-ml-docker

Slide 16

Slide 16 text

Everything works fine… on my laptop Your application How are your users or colleagues meant to know what dependencies they need? Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker

Slide 17

Slide 17 text

Everything works fine… on my laptop Your application And even with package managers! Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker

Slide 18

Slide 18 text

Dev life with containers Your application Libraries, dependencies, runtime environment, configuration files ixek | https:!//bit.ly/ato-ml-docker

Slide 19

Slide 19 text

Bliss But… are containers the one-stop solution?

Slide 20

Slide 20 text

Docker for ML/DS Caveats and gotchas 02

Slide 21

Slide 21 text

The good… Good Provides app level env isolation. So does not mess up with your local env. As each image is tagged you can keep track not only of your app/ library versions but also your dev environment Better Docker image $ docker run Latest 1.0.2 ixek | https:!//bit.ly/ato-ml-docker

Slide 22

Slide 22 text

The bad and the ugly Bad As a beginner - most Docker tutorials out there do not focus on ML ixek | https:!//bit.ly/ato-ml-docker

Slide 23

Slide 23 text

The bad and the ugly Bad As a beginner - most tutorials out there do not focus on ML. Plus Docker can have a steep learning curve The Scientific Python ecosystem is great… dealing with dependencies can be a pain. Add GPUs, multiple architectures, multiple OS, Python versions, notebooks, Dashboards, APIs that expose models The ugly* ixek | https:!//bit.ly/ato-ml-docker

Slide 24

Slide 24 text

Common pain points in DS and ML - Complex setups / dependencies - Might need GPUs or libraries like Dask - Not everything can be exposed as an API - Reliance on data / databases - Fast evolving projects (iterative R&D process) - Docker is complex and can take a lot of time to upskill - Are containers secure enough for my data / model /algorithm? ixek | https:!//bit.ly/ato-ml-docker

Slide 25

Slide 25 text

How is it different from web apps for example? https:!//twitter.com/dstufft/status/1095164069802397696 ixek | https:!//bit.ly/ato-ml-docker

Slide 26

Slide 26 text

How is it different from web apps ? - Not every deliverable is an app - Not every deliverable is a model either - Heavily relies on data - Mixture of wheels and compiled packages - Security access levels - for data and software - Mixture of stakeholders: data scientists, software engineers, ML engineers ixek | https:!//bit.ly/ato-ml-docker

Slide 27

Slide 27 text

ML goes way beyond a “model” ixek | https:!//bit.ly/ato-ml-docker

Slide 28

Slide 28 text

The pillars of reproducibility Environment Code Data If one changes everything changes ixek | https:!//bit.ly/ato-ml-docker

Slide 29

Slide 29 text

Dockerfiles are used to create Docker images by providing a set of instructions to install software, configure your image or copy files Building Docker images ixek | https:!//bit.ly/ato-ml-docker

Slide 30

Slide 30 text

Base image Main instructions Entry command Dissecting Docker images ixek | https:!//bit.ly/ato-ml-docker

Slide 31

Slide 31 text

Install pandas Install requests Dissecting Docker images Install flask Base image Each instruction creates A layer (like an onion) ixek | https:!//bit.ly/ato-ml-docker

Slide 32

Slide 32 text

Choosing the best base image https:!//github.com/docker-library/docs/tree/master/python ● If building from scratch use the official Python images https:!//hub.docker.com/_/python ixek | https:!//bit.ly/ato-ml-docker

Slide 33

Slide 33 text

The Jupyter docker stack Need Conda, notebooks and scientific Python ecosystem? Try Jupyter Docker stacks https:!//jupyter-docker-stacks.readthedocs.io/ ubuntu@SHA base-notebook minimal-notebook scipy-notebook r-notebook tensorflow-notebook datascience-notebook pyspark-notebook all-spark-notebook ixek | https:!//bit.ly/ato-ml-docker

Slide 34

Slide 34 text

Best practices - Always know what you are expecting - Provide context with LABELS - Split complex RUN statements and sort them - Prefer COPY to add files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Slide 35

Slide 35 text

Speed up your build - Leverage build cache - Install only necessary packages https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Slide 36

Slide 36 text

Speed up your build and proof - Leverage build cache - Install only necessary packages - Explicitly ignore files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Slide 37

Slide 37 text

Mount volumes to access data -You can use bind mounts to directories (unless you are using a database) -Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Slide 38

Slide 38 text

Mount volumes to access data - You can use bind mounts to directories (unless you are using a database) - Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Slide 39

Slide 39 text

USE Multi stage builds - Fetch and manage secrets in an intermediate layer - Not all your dependencies will have been packed as wheels so you might need a compiler - build a compile and a runtime image - Smaller images overall ixek | https:!//bit.ly/ato-ml-docker

Slide 40

Slide 40 text

USE Multi stage builds Compile-image Docker image Runtime-image Copy virtual Environment $ docker build ---pull ---rm -f “Dockerfile"\ -t trallard:data-scratch-1.0 "." Docker image ixek | https:!//bit.ly/ato-ml-docker

Slide 41

Slide 41 text

USE Multi stage builds Docker image Runtime-image FINAL IMAGE trallard:data-scratch-1.0 ixek | https:!//bit.ly/ato-ml-docker

Slide 42

Slide 42 text

Do not reinvent the wheel Leverage the existence and usage of tools like repo2docker. Already configured and optimised for Data Science / Scientific computing. https:!//repo2docker.readthedocs.io/en/latest $ conda install jupyter repo2docker $ jupyter-repo2docker “.” ixek | https:!//bit.ly/ato-ml-docker

Slide 43

Slide 43 text

Delegate to your continuous integration Tool Set Continuous integration (Travis, GitHub Actions, whatever you prefer). And delegate your build - also build often. ixek | https:!//bit.ly/ato-ml-docker

Slide 44

Slide 44 text

This workflow Docker Docker - Code in version control - Trigger on tag / Also scheduled trigger - Build image - Push image ixek | https:!//bit.ly/ato-ml-docker

Slide 45

Slide 45 text

Top tips My top recommendations 04

Slide 46

Slide 46 text

Top tips 1.Rebuild your images frequently - get security updates for system packages 2.Never work as root / minimise the privileges 3.You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack) 4.Always know what you are expecting: pin / version EVERYTHING (use pip- tools, conda, poetry or pipenv) 5.Leverage build cache ixek | https:!//bit.ly/ato-ml-docker

Slide 47

Slide 47 text

Top tips 6.Use one Dockerfile per project 7.Use multi-stage builds - need to compile code? Need to reduce your image size? 8.Make your images identifiable (test, production, R&D) - also be careful when accessing databases and using ENV variables / build variables 9.Do not reinvent the wheel! Use repo2docker 10.Automate - no need to build and push manually 11.Use a linter ixek | https:!//bit.ly/ato-ml-docker

Slide 48

Slide 48 text

THANKS Get in touch: trallard.dev