Data Science in the Cloud Using Python

Data science in the  cloud using Python Graham Dumpleton @GrahamDumpleton
[email protected]

Who am I? Author of mod_wsgi, the Apache module for
hosting of Python web applications using the WSGI interface. Have a keen interest in Docker and Platform as a Service (PaaS) technologies. Currently a developer advocate for OpenShift at Red Hat.

Topics for today • What is Jupyter & who is
using it? • How to provide Jupyter notebooks to many users. • Deploying a single Jupyter notebook to OpenShift. • Deploying JupyterHub to OpenShift. • Deploying a data processing cluster using Dask.

What is Jupyter? http://jupyter.org/

Who is using Jupyter? Individuals Collaborators Teachers Data-driven organisations use
Jupyter to analyse their data, share their insights, and create dynamic, reproducible data science.

https://blog.data.gov.sg/how-we-caught-the-circle-line-rogue-train-with-data-79405c86ab6a

https://jupyterhub.readthedocs.io/

Running a single  Jupyter notebook

Running on your local system pip3 install jupyter jupyter notebook

Local installation (Pros) • Save notebooks/data locally. • Python virtual
environments. • Select Python version. • Select package versions.

Local installation (Cons) • Operating system differences. • Python distribution
differences. • Python version differences. • Package index differences. • PyPi (pip) vs Anaconda (conda) • Effort to setup and maintain.

Running in a container docker run -it --rm -p 8888:8888
\ jupyter/minimal-notebook https://github.com/jupyter/docker-stacks

Container images (Pros) • Pre-created images. • Bundled operating system
packages. • Known Python distribution/vendor. • Bundled Python packages. • Docker images are read only. • Run in an isolated environment. • Don’t need to maintain the image.

Container images (Cons #1) • More eﬀort to customise experience.
• Build a custom Docker image to extend. • Install extra packages each time you run it. • Images can be very large. • Multiple Python versions. • Packages that you do not need.

Container images (Cons #2) • Access to and saving your
notebooks/data. • Need to mount persistent storage volumes. • Conﬁgure a contents manager for external storage. • Ensuring access is done securely.

Jupyter notebooks on OpenShift https://github.com/jupyter/docker-stacks/tree/master/examples/openshift

Source-to-Image enabled https://github.com/jupyter/docker-stacks/tree/master/examples/source-to-image

https://github.com/jakevdp/PythonDataScienceHandbook

Images built for OpenShift https://github.com/jupyter-on-openshift/jupyter-notebooks

Hosting multiple  Jupyter notebooks

JupyterHub on OpenShift https://github.com/jupyter-on-openshift/jupyterhub-quickstart

POC - JupyterHub with KeyCloak https://github.com/jupyter-on-openshift/poc-hub-keycloak-auth

POC - JupyterHub with Dask https://github.com/jupyter-on-openshift/poc-hub-dask-cluster

What is Dask? http://dask.pydata.org/

Deployments

User access managed by KeyCloak

Persistent workspace for notebooks

Access to Dask cluster

Created at login

Control panel

JupyterHub

Hub + Proxy

Managed services Idle Notebook Culler Dask Cluster Manager

Builds

KeyCloak image Docker Hub  jboss/keycloak-openshift:4.0.0.Final GitHub  jupyter-on-openshift/poc-hub-dask-cluster  keycloak OpenShift Image
Registry  keycloak-img S2I

JupyterHub images OpenShift Image Registry  openshift/python GitHub  jupyter-on-openshift/jupyterhub-quickstart OpenShift Image
Registry  jupyterhub-s2i S2I S2I GitHub  jupyter-on-openshift/poc-hub-dask-cluster jupyterhub OpenShift Image Registry  jupyterhub-img Base JupyterHub  Image JupyterHub  Image

Notebook images OpenShift Image Registry  openshift/python GitHub  jupyter-on-openshift/jupyter-notebooks minimal-notebook OpenShift
Image Registry  jupyterhub-notebook-s2i S2I S2I GitHub  jupyter-on-openshift/poc-hub-dask-cluster notebook OpenShift Image Registry  jupyterhub-notebook-img Base  Notebook  Image Notebook  Image Dask Scheduler/Worker Image

Deployments JupyterHub Route Service KeyCloak Service Notebook Dask Scheduler Dask
Worker Service Dask Worker Dask Worker PostgreSQL  (KeyCloak) PostgreSQL  (JupyterHub) Service Service Common Per User Route Service

Jupyter on OpenShift • Provides simpliﬁed deployment using a template.
• JupyterHub customisable through use of Source-to-Image. • Can pre-build images with notebooks and required Python packages. • Can attach storage to notebooks for persistence. • Integrate with KeyCloak for user authentication. • Connect to backend cluster for distributed data analytics.

My other side projects • Using JupyterHub for teaching in
university. • Using remote kernel gateway with Jupyter notebooks. • Working with Spark as backend data analytics cluster. • Utilising GPUs for high performance AI/ML tasks.

Resources • https://github.com/jupyter-on-openshift • https://www.jupyteronopenshift.org/ • https://github.com/jupyter/docker-stacks • examples/openshift &
examples/source-to-image

Community • https://commons.openshift.org/ • https://commons.openshift.org/sig/OpenshiftMachineLearning.html • https://commons.openshift.org/sig/OpenshiftBigData.html • https://commons.openshift.org/sig/OpenshiftEDU.html

Try it • https://www.openshift.org/minishift/ • https://www.openshift.com/products/online/

Contact me Graham Dumpleton @GrahamDumpleton [email protected]

Data Science in the Cloud Using Python

Data Science in the Cloud Using Python

Other Decks in Technology

Featured

Transcript