Slide 1

Slide 1 text

TANIA ALLARD, PHD Making them play nicely and securely for Data Science and Machine Learning DOCKER AND PYTHON Sr. Developer Advocate @Microsoft. ixek | https://bit.ly/pycon2020-ml-docker

Slide 2

Slide 2 text

@ixek @trallard trallard.dev

Slide 3

Slide 3 text

https://bit.ly/pycon2020-ml-docker THESE SLIDES

Slide 4

Slide 4 text

WHAT YOU’LL LEARN TODAY -Why using Docker? -Docker for Data Science and Machine Learning -Security and performance -Do not reinvent the wheel, automate -Tips and trick to use Docker ixek | https://bit.ly/pycon2020-ml-docker

Slide 5

Slide 5 text

WHY DOCKER?

Slide 6

Slide 6 text

DEV LIFE WITHOUT DOCKER OR CONTAINERS Your application How are your users or colleagues meant to know what dependencies they need? Import Error: no module name x, y, x ixek | https://bit.ly/pycon2020-ml-docker

Slide 7

Slide 7 text

WHAT IS DOCKER? A tool that helps you to create, deploy and run your applications or projects by using containers. This is a container ixek | https://bit.ly/pycon2020-ml-docker

Slide 8

Slide 8 text

HOW DO CONTAINERS HELP ME? They provide a solution to the problem of how to get software to run reliably when moved from one computing environment to another Your laptop Test environment Staging environment Production environment ixek | https://bit.ly/pycon2020-ml-docker

Slide 9

Slide 9 text

DEV LIFE WITH CONTAINERS Your application Libraries, dependencies, runtime environment, configuration files ixek | https://bit.ly/pycon2020-ml-docker

Slide 10

Slide 10 text

THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE Each app is containerised INFRASTRUCTURE HOST OPERATING SYSTEM DOCKER APP APP APP APP APP ixek | https://bit.ly/pycon2020-ml-docker At the app level: Each runs as an isolated process

Slide 11

Slide 11 text

THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE ixek | https://bit.ly/pycon2020-ml-docker CONTAINERS INFRASTRUCTURE HOST OPERATING SYSTEM DOCKER APP APP APP APP APP INFRASTRUCTURE HYPERVISOR APP GUEST OS VIRTUAL MACHINE VIRTUAL MACHINE At the hardware level Full OS + app + binaries + libraries APP GUEST OS VIRTUAL MACHINE

Slide 12

Slide 12 text

-Image: archive with all the data needed to run the app -When you run an image it creates a container IMAGE VS CONTAINER Docker image $ docker run Latest 1.0.2 ixek | https://bit.ly/pycon2020-ml-docker

Slide 13

Slide 13 text

-Complex setups / dependencies -Reliance on data / databases -Fast evolving projects (iterative R&D process) -Docker is complex and can take a lot of time to upskill -Are containers secure enough for my data / model /algorithm? COMMON PAIN POINTS IN DS AND ML

Slide 14

Slide 14 text

DOCKER FOR DATA SCIENCE AND MACHINE LEARNING

Slide 15

Slide 15 text

HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE? https://twitter.com/dstufft/status/1095164069802397696 ixek | https://bit.ly/pycon2020-ml-docker

Slide 16

Slide 16 text

-Not every deliverable is an app -Not every deliverable is a model either -Heavily relies on data -Mixture of wheels and compiled packages -Security access levels - for data and software -Mixture of stakeholders: data scientists, software engineers, ML engineers HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE? ixek | https://bit.ly/pycon2020-ml-docker

Slide 17

Slide 17 text

Dockerfiles are used to create Docker images by providing a set of instructions to install software, configure your image or copy files BUILDING DOCKER IMAGES ixek | https://bit.ly/pycon2020-ml-docker

Slide 18

Slide 18 text

ixek | https://bit.ly/pycon2020-ml-docker Base image Main instructions Entry command DISSECTING DOCKER IMAGES

Slide 19

Slide 19 text

INSTALL PANDAS INSTALL REQUESTS ixek | https://bit.ly/pycon2020-ml-docker DISSECTING DOCKER IMAGES INSTALL FLASK BASE IMAGE Each instruction creates A layer (like an onion)

Slide 20

Slide 20 text

ixek | https://bit.ly/pycon2020-ml-docker CHOOSING THE BEST BASE IMAGE https://github.com/docker-library/docs/tree/master/python If building from scratch use the official Python images https://hub.docker.com/_/python

Slide 21

Slide 21 text

ixek | https://bit.ly/pycon2020-ml-docker THE JUPYTER DOCKER STACK Need Conda, notebooks and scientific Python ecosystem? Try Jupyter Docker stacks https://jupyter-docker-stacks.readthedocs.io/ ubuntu@SHA base-notebook minimal-notebook scipy-notebook r-notebook tensorflow-notebook datascience-notebook pyspark-notebook all-spark-notebook

Slide 22

Slide 22 text

ixek | https://bit.ly/pycon2020-ml-docker - Always know what you are expecting -Provide context with LABELS -Split complex RUN statements and sort them -Prefer COPY to add files BEST PRACTICES https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Slide 23

Slide 23 text

ixek | https://bit.ly/pycon2020-ml-docker - Leverage build cache -Install only necessary packages SPEED UP YOUR BUILD https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Slide 24

Slide 24 text

ixek | https://bit.ly/pycon2020-ml-docker - Leverage build cache -Install only necessary packages -Explicitly ignore files https://docs.docker.com/develop/develop-images/dockerfile_best-practices/ SPEED UP YOUR BUILD AND PROOF

Slide 25

Slide 25 text

ixek | https://bit.ly/pycon2020-ml-docker -You can use bind mounts to directories (unless you are using a database) -Avoid issues by creating a non-root user https://docs.docker.com/develop/develop-images/dockerfile_best-practices/ MOUNT VOLUMES TO ACCESS DATA

Slide 26

Slide 26 text

SECURITY AND PERFORMANCE

Slide 27

Slide 27 text

ixek | https://bit.ly/pycon2020-ml-docker Lock down your container: - Run as non-root user (Docker runs as root by default) - Minimise capabilities MINIMISE PRIVILEGE - FAVOUR LESS PRIVILEGED USER

Slide 28

Slide 28 text

ixek | https://bit.ly/pycon2020-ml-docker Remember Docker images are like onions. If you copy keys in an intermediate layer they are cached. Keep them out of your Dockerfile. DON’T LEAK SENSITIVE INFORMATION

Slide 29

Slide 29 text

-Fetch and manage secrets in an intermediate layer -Not all your dependencies will have been packed as wheels so you might need a compiler - build a compile and a runtime image -Smaller images overall USE MULTI STAGE BUILDS

Slide 30

Slide 30 text

USE MULTI STAGE BUILDS Compile-image Docker image Runtime-image Copy virtual Environment $ docker build --pull --rm -f “Dockerfile"\ -t trallard:data-scratch-1.0 "." Docker image

Slide 31

Slide 31 text

USE MULTI STAGE BUILDS Docker image Runtime-image FINAL IMAGE trallard:data-scratch-1.0

Slide 32

Slide 32 text

AUTOMATE

Slide 33

Slide 33 text

PROJECT TEMPLATES Need a standard project template? Use cookie cutter data science Or cookie cutter docker science https://github.com/docker-science/cookiecutter-docker-science https://drivendata.github.io/cookiecutter-data-science/

Slide 34

Slide 34 text

DO NOT REINVENT THE WHEEL Leverage the existence and usage of tools like repo2docker. Already configured and optimised for Data Science / Scientific computing. https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker $ conda install jupyter repo2docker $ jupyter-repo2docker “.”

Slide 35

Slide 35 text

DO NOT REINVENT THE WHEEL Leverage the existence and usage of tools like repo2docker. Already configured and optimised for Data Science / Scientific computing. https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker

Slide 36

Slide 36 text

DELEGATE TO YOUR CONTINUOUS INTEGRATION TOOL Set Continuous integration (Travis, GitHub Actions, whatever you prefer). And delegate your build - also build often. https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker

Slide 37

Slide 37 text

THIS WORKFLOW Docker image Docker image ixek | https://bit.ly/pycon2020-ml-docker -Code in version control -Trigger on tag / Also scheduled trigger -Build image -Push image

Slide 38

Slide 38 text

TOP TIPS

Slide 39

Slide 39 text

1. Rebuild your images frequently - get security updates for system packages 2. Never work as root / minimise the privileges 3. You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack) 4. Always know what you are expecting: pin / version EVERYTHING (use pip- tools, conda, poetry or pipenv) 5. Leverage build cache TOP TIPS

Slide 40

Slide 40 text

6. Use one Dockerfile per project 7. Use multi-stage builds - need to compile code? Need to reduce your image size? 8. Make your images identifiable (test, production, R&D) - also be careful when accessing databases and using ENV variables / build variables 9. Do not reinvent the wheel! Use repo2docker 10.Automate - no need to build and push manually 11. Use a linter TOP TIPS

Slide 41

Slide 41 text

THANK YOU @ixek @trallard trallard.dev