Docker and Python: making them play nicely and securely for Ml and DS

Python and docker Making them play nicely and securely for
Data science and machine learning Tania Allard, PhD @ixek

TABLE OF CONTENTS Python and the ML scene in 2020
Introduction Summary Top tips Docker for ML Docker for data science: caveats and gotchas 01 02 Best practices Making the most of Docker 03 04

Key takeaways Beginner Why you’d want to use Docker to
isolate your environments Best practices for using Docker for ML and DS Intermediate Some techniques and tips to optimise your Docker images and workflows Advanced

Hi I am Tania - I love Open Source and
all things data - I am a Sr. Developer Advocate @Microsoft - I am obsessed with Outrun and Cyberpunk aesthetics - I love mechanical keyboards - You can find me at trallard.dev My dog loves barking while I am giving talks

THESE SLIDES BIT.LY/ATO-ML-DOCKER

INTRODUCTION Understanding the ML ecosystem in 2020 01

Why python? https:!//octoverse.github.com/

Data science growth https:!//octoverse.github.com/

Within the Python community Data analysis and Machine Learning are
within the top 3 uses for Python ixek | https:!//bit.ly/ato-ml-docker

Pythonistas’ pet-peeve https:!//xkcd.com/1987/ ixek | https:!//bit.ly/ato-ml-docker

Installation trends Down the rabbit hole - core Python installation
https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

Frameworks popularity NumPy is the most used library - basically
holds the entire scientific python ecosystem https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

Env isolation trends Docker has steadily gained momentum over the
years https:!//www.jetbrains.com/lp/python-developers-survey-2019/ ixek | https:!//bit.ly/ato-ml-docker

What is Docker? A tool that helps you to create,
deploy and run your applications or projects by using containers. This is a container ixek | https:!//bit.ly/ato-ml-docker

How do containers help me? They provide a solution to
the problem of how to get software to run reliably when moved from one computing environment to another Your laptop Test environment Staging environment Production environment ixek | https:!//bit.ly/ato-ml-docker

Everything works fine… on my laptop Your application How are
your users or colleagues meant to know what dependencies they need? Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker

Everything works fine… on my laptop Your application And even
with package managers! Import Error: no module name x, y, x ixek | https:!//bit.ly/ato-ml-docker

Dev life with containers Your application Libraries, dependencies, runtime environment,
configuration files ixek | https:!//bit.ly/ato-ml-docker

Bliss But… are containers the one-stop solution?

Docker for ML/DS Caveats and gotchas 02

The good… Good Provides app level env isolation. So does
not mess up with your local env. As each image is tagged you can keep track not only of your app/ library versions but also your dev environment Better Docker image $ docker run Latest 1.0.2 ixek | https:!//bit.ly/ato-ml-docker

The bad and the ugly Bad As a beginner -
most Docker tutorials out there do not focus on ML ixek | https:!//bit.ly/ato-ml-docker

The bad and the ugly Bad As a beginner -
most tutorials out there do not focus on ML. Plus Docker can have a steep learning curve The Scientific Python ecosystem is great… dealing with dependencies can be a pain. Add GPUs, multiple architectures, multiple OS, Python versions, notebooks, Dashboards, APIs that expose models The ugly* ixek | https:!//bit.ly/ato-ml-docker

Common pain points in DS and ML - Complex setups
/ dependencies - Might need GPUs or libraries like Dask - Not everything can be exposed as an API - Reliance on data / databases - Fast evolving projects (iterative R&D process) - Docker is complex and can take a lot of time to upskill - Are containers secure enough for my data / model /algorithm? ixek | https:!//bit.ly/ato-ml-docker

How is it different from web apps for example? https:!//twitter.com/dstufft/status/1095164069802397696
ixek | https:!//bit.ly/ato-ml-docker

How is it different from web apps ? - Not
every deliverable is an app - Not every deliverable is a model either - Heavily relies on data - Mixture of wheels and compiled packages - Security access levels - for data and software - Mixture of stakeholders: data scientists, software engineers, ML engineers ixek | https:!//bit.ly/ato-ml-docker

ML goes way beyond a “model” ixek | https:!//bit.ly/ato-ml-docker

The pillars of reproducibility Environment Code Data If one changes
everything changes ixek | https:!//bit.ly/ato-ml-docker

Dockerfiles are used to create Docker images by providing a
set of instructions to install software, configure your image or copy files Building Docker images ixek | https:!//bit.ly/ato-ml-docker

Base image Main instructions Entry command Dissecting Docker images ixek
| https:!//bit.ly/ato-ml-docker

Install pandas Install requests Dissecting Docker images Install flask Base
image Each instruction creates A layer (like an onion) ixek | https:!//bit.ly/ato-ml-docker

Choosing the best base image https:!//github.com/docker-library/docs/tree/master/python • If building from
scratch use the official Python images https:!//hub.docker.com/_/python ixek | https:!//bit.ly/ato-ml-docker

The Jupyter docker stack Need Conda, notebooks and scientific Python
ecosystem? Try Jupyter Docker stacks https:!//jupyter-docker-stacks.readthedocs.io/ ubuntu@SHA base-notebook minimal-notebook scipy-notebook r-notebook tensorﬂow-notebook datascience-notebook pyspark-notebook all-spark-notebook ixek | https:!//bit.ly/ato-ml-docker

Best practices - Always know what you are expecting -
Provide context with LABELS - Split complex RUN statements and sort them - Prefer COPY to add files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Speed up your build - Leverage build cache - Install
only necessary packages https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Speed up your build and proof - Leverage build cache
- Install only necessary packages - Explicitly ignore files https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Mount volumes to access data -You can use bind mounts
to directories (unless you are using a database) -Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

Mount volumes to access data - You can use bind
mounts to directories (unless you are using a database) - Avoid issues by creating a non-root user https:!//docs.docker.com/develop/develop-images/dockerfile_best-practices/ ixek | https:!//bit.ly/ato-ml-docker

USE Multi stage builds - Fetch and manage secrets in
an intermediate layer - Not all your dependencies will have been packed as wheels so you might need a compiler - build a compile and a runtime image - Smaller images overall ixek | https:!//bit.ly/ato-ml-docker

USE Multi stage builds Compile-image Docker image Runtime-image Copy virtual
Environment $ docker build ---pull ---rm -f “Dockerfile"\ -t trallard:data-scratch-1.0 "." Docker image ixek | https:!//bit.ly/ato-ml-docker

USE Multi stage builds Docker image Runtime-image FINAL IMAGE trallard:data-scratch-1.0
ixek | https:!//bit.ly/ato-ml-docker

Do not reinvent the wheel Leverage the existence and usage
of tools like repo2docker. Already configured and optimised for Data Science / Scientific computing. https:!//repo2docker.readthedocs.io/en/latest $ conda install jupyter repo2docker $ jupyter-repo2docker “.” ixek | https:!//bit.ly/ato-ml-docker

Delegate to your continuous integration Tool Set Continuous integration (Travis,
GitHub Actions, whatever you prefer). And delegate your build - also build often. ixek | https:!//bit.ly/ato-ml-docker

This workflow Docker Docker - Code in version control -
Trigger on tag / Also scheduled trigger - Build image - Push image ixek | https:!//bit.ly/ato-ml-docker

Top tips My top recommendations 04

Top tips 1.Rebuild your images frequently - get security updates
for system packages 2.Never work as root / minimise the privileges 3.You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack) 4.Always know what you are expecting: pin / version EVERYTHING (use pip- tools, conda, poetry or pipenv) 5.Leverage build cache ixek | https:!//bit.ly/ato-ml-docker

Top tips 6.Use one Dockerfile per project 7.Use multi-stage builds
- need to compile code? Need to reduce your image size? 8.Make your images identifiable (test, production, R&D) - also be careful when accessing databases and using ENV variables / build variables 9.Do not reinvent the wheel! Use repo2docker 10.Automate - no need to build and push manually 11.Use a linter ixek | https:!//bit.ly/ato-ml-docker

THANKS Get in touch: trallard.dev

Docker and Python: making them play nicely and ...

Docker and Python: making them play nicely and securely for Ml and DS

More Decks by Tania Allard

Other Decks in Technology

Featured

Transcript