Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in the Cloud Using Python

Data Science in the Cloud Using Python

Graham Dumpleton

July 29, 2018
Tweet

Other Decks in Technology

Transcript

  1. Who am I? Author of mod_wsgi, the Apache module for

    hosting of Python web applications using the WSGI interface. Have a keen interest in Docker and Platform as a Service (PaaS) technologies. Currently a developer advocate for OpenShift at Red Hat.
  2. Topics for today • What is Jupyter & who is

    using it? • How to provide Jupyter notebooks to many users. • Deploying a single Jupyter notebook to OpenShift. • Deploying JupyterHub to OpenShift. • Deploying a data processing cluster using Dask.
  3. Who is using Jupyter? Individuals Collaborators Teachers Data-driven organisations use

    Jupyter to analyse their data, share their insights, and create dynamic, reproducible data science.
  4. Local installation (Pros) • Save notebooks/data locally. • Python virtual

    environments. • Select Python version. • Select package versions.
  5. Local installation (Cons) • Operating system differences. • Python distribution

    differences. • Python version differences. • Package index differences. • PyPi (pip) vs Anaconda (conda) • Effort to setup and maintain.
  6. Running in a container docker run -it --rm -p 8888:8888

    \ jupyter/minimal-notebook https://github.com/jupyter/docker-stacks
  7. Container images (Pros) • Pre-created images. • Bundled operating system

    packages. • Known Python distribution/vendor. • Bundled Python packages. • Docker images are read only. • Run in an isolated environment. • Don’t need to maintain the image.
  8. Container images (Cons #1) • More effort to customise experience.

    • Build a custom Docker image to extend. • Install extra packages each time you run it. • Images can be very large. • Multiple Python versions. • Packages that you do not need.
  9. Container images (Cons #2) • Access to and saving your

    notebooks/data. • Need to mount persistent storage volumes. • Configure a contents manager for external storage. • Ensuring access is done securely.
  10. JupyterHub images OpenShift Image Registry
 openshift/python GitHub
 jupyter-on-openshift/jupyterhub-quickstart OpenShift Image

    Registry
 jupyterhub-s2i S2I S2I GitHub
 jupyter-on-openshift/poc-hub-dask-cluster jupyterhub OpenShift Image Registry
 jupyterhub-img Base JupyterHub
 Image JupyterHub
 Image
  11. Notebook images OpenShift Image Registry
 openshift/python GitHub
 jupyter-on-openshift/jupyter-notebooks minimal-notebook OpenShift

    Image Registry
 jupyterhub-notebook-s2i S2I S2I GitHub
 jupyter-on-openshift/poc-hub-dask-cluster notebook OpenShift Image Registry
 jupyterhub-notebook-img Base
 Notebook
 Image Notebook
 Image Dask Scheduler/Worker Image
  12. Deployments JupyterHub Route Service KeyCloak Service Notebook Dask Scheduler Dask

    Worker Service Dask Worker Dask Worker PostgreSQL
 (KeyCloak) PostgreSQL
 (JupyterHub) Service Service Common Per User Route Service
  13. Jupyter on OpenShift • Provides simplified deployment using a template.

    • JupyterHub customisable through use of Source-to-Image. • Can pre-build images with notebooks and required Python packages. • Can attach storage to notebooks for persistence. • Integrate with KeyCloak for user authentication. • Connect to backend cluster for distributed data analytics.
  14. My other side projects • Using JupyterHub for teaching in

    university. • Using remote kernel gateway with Jupyter notebooks. • Working with Spark as backend data analytics cluster. • Utilising GPUs for high performance AI/ML tasks.