Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in the Cloud Using Python

Data Science in the Cloud Using Python

Graham Dumpleton

July 29, 2018

Other Decks in Technology


  1. Data science in the
 cloud using Python Graham Dumpleton @GrahamDumpleton

    [email protected]
  2. Who am I? Author of mod_wsgi, the Apache module for

    hosting of Python web applications using the WSGI interface. Have a keen interest in Docker and Platform as a Service (PaaS) technologies. Currently a developer advocate for OpenShift at Red Hat.
  3. Topics for today • What is Jupyter & who is

    using it? • How to provide Jupyter notebooks to many users. • Deploying a single Jupyter notebook to OpenShift. • Deploying JupyterHub to OpenShift. • Deploying a data processing cluster using Dask.
  4. What is Jupyter? http://jupyter.org/

  5. Who is using Jupyter? Individuals Collaborators Teachers Data-driven organisations use

    Jupyter to analyse their data, share their insights, and create dynamic, reproducible data science.
  6. https://blog.data.gov.sg/how-we-caught-the-circle-line-rogue-train-with-data-79405c86ab6a

  7. None
  8. https://jupyterhub.readthedocs.io/

  9. Running a single
 Jupyter notebook

  10. Running on your local system pip3 install jupyter jupyter notebook

  11. Local installation (Pros) • Save notebooks/data locally. • Python virtual

    environments. • Select Python version. • Select package versions.
  12. Local installation (Cons) • Operating system differences. • Python distribution

    differences. • Python version differences. • Package index differences. • PyPi (pip) vs Anaconda (conda) • Effort to setup and maintain.
  13. Running in a container docker run -it --rm -p 8888:8888

    \ jupyter/minimal-notebook https://github.com/jupyter/docker-stacks
  14. Container images (Pros) • Pre-created images. • Bundled operating system

    packages. • Known Python distribution/vendor. • Bundled Python packages. • Docker images are read only. • Run in an isolated environment. • Don’t need to maintain the image.
  15. Container images (Cons #1) • More effort to customise experience.

    • Build a custom Docker image to extend. • Install extra packages each time you run it. • Images can be very large. • Multiple Python versions. • Packages that you do not need.
  16. Container images (Cons #2) • Access to and saving your

    notebooks/data. • Need to mount persistent storage volumes. • Configure a contents manager for external storage. • Ensuring access is done securely.
  17. None
  18. Jupyter notebooks on OpenShift https://github.com/jupyter/docker-stacks/tree/master/examples/openshift

  19. None
  20. Source-to-Image enabled https://github.com/jupyter/docker-stacks/tree/master/examples/source-to-image

  21. None
  22. https://github.com/jakevdp/PythonDataScienceHandbook

  23. Images built for OpenShift https://github.com/jupyter-on-openshift/jupyter-notebooks

  24. Hosting multiple
 Jupyter notebooks

  25. None
  26. JupyterHub on OpenShift https://github.com/jupyter-on-openshift/jupyterhub-quickstart

  27. POC - JupyterHub with KeyCloak https://github.com/jupyter-on-openshift/poc-hub-keycloak-auth

  28. POC - JupyterHub with Dask https://github.com/jupyter-on-openshift/poc-hub-dask-cluster

  29. What is Dask? http://dask.pydata.org/

  30. None
  31. None
  32. Deployments

  33. User access managed by KeyCloak

  34. Persistent workspace for notebooks

  35. Access to Dask cluster

  36. Created at login

  37. Control panel

  38. JupyterHub

  39. Hub + Proxy

  40. Managed services Idle Notebook Culler Dask Cluster Manager

  41. Builds

  42. KeyCloak image Docker Hub
 jboss/keycloak-openshift:4.0.0.Final GitHub
 keycloak OpenShift Image

 keycloak-img S2I
  43. JupyterHub images OpenShift Image Registry
 openshift/python GitHub
 jupyter-on-openshift/jupyterhub-quickstart OpenShift Image

 jupyterhub-s2i S2I S2I GitHub
 jupyter-on-openshift/poc-hub-dask-cluster jupyterhub OpenShift Image Registry
 jupyterhub-img Base JupyterHub
 Image JupyterHub
  44. Notebook images OpenShift Image Registry
 openshift/python GitHub
 jupyter-on-openshift/jupyter-notebooks minimal-notebook OpenShift

    Image Registry
 jupyterhub-notebook-s2i S2I S2I GitHub
 jupyter-on-openshift/poc-hub-dask-cluster notebook OpenShift Image Registry
 jupyterhub-notebook-img Base
 Image Notebook
 Image Dask Scheduler/Worker Image
  45. Deployments JupyterHub Route Service KeyCloak Service Notebook Dask Scheduler Dask

    Worker Service Dask Worker Dask Worker PostgreSQL
 (KeyCloak) PostgreSQL
 (JupyterHub) Service Service Common Per User Route Service
  46. Jupyter on OpenShift • Provides simplified deployment using a template.

    • JupyterHub customisable through use of Source-to-Image. • Can pre-build images with notebooks and required Python packages. • Can attach storage to notebooks for persistence. • Integrate with KeyCloak for user authentication. • Connect to backend cluster for distributed data analytics.
  47. My other side projects • Using JupyterHub for teaching in

    university. • Using remote kernel gateway with Jupyter notebooks. • Working with Spark as backend data analytics cluster. • Utilising GPUs for high performance AI/ML tasks.
  48. Resources • https://github.com/jupyter-on-openshift • https://www.jupyteronopenshift.org/ • https://github.com/jupyter/docker-stacks • examples/openshift &

  49. Community • https://commons.openshift.org/ • https://commons.openshift.org/sig/OpenshiftMachineLearning.html • https://commons.openshift.org/sig/OpenshiftBigData.html • https://commons.openshift.org/sig/OpenshiftEDU.html

  50. Try it • https://www.openshift.org/minishift/ • https://www.openshift.com/products/online/

  51. Contact me Graham Dumpleton @GrahamDumpleton [email protected]