Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science

Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science

Lecture presented at BIDS, the Berkeley Institute for Data Science on April 17, 2018. Description and video of the talk:

https://bids.berkeley.edu/resources/videos/project-jupyter-architecture-and-evolution-open-platform-modern-data-science

Code for Polyglot DS demo: http://nbviewer.jupyter.org/gist/fperez/5b49246af4e340c37549265a90894ce6/polyglot-ds.ipynb

95198572b00e5fbcd97fb5315215bf7a?s=128

Fernando Perez

April 17, 2018
Tweet

Transcript

  1. Fernando Pérez fernando.perez@berkeley.edu Building an open platform for research and

    education in data science Project Jupyter
  2. A few bits about me Medellín, Colombia University of Colorado,

    Boulder Physics Applied Math Computation
  3. Statistics & me: then and now If your result needs

    a statistician then you should design a better experiment (prob. mis-attributed) E. Rutherford PhD: Lattice QCD Simulations
  4. Why?

  5. Why? ❖ Ethical: openness as fairness ❖ Human/social: openness fosters

    collaboration. ❖ Epistemological: proprietary science is an oxymoron. ❖ Technical: Python was cool :)
  6. Python - The Beginning the most important lesson I learned

    was about sharing – Guido van Rossum http://neopythonic.blogspot.com/2016/04/kings-day-speech.html Slide credit: C. Willing
  7. Designed for Learning In reality, programming languages are how programmers

    express and communicate ideas — and the audience for those ideas is other programmers, not computers. http://neopythonic.blogspot.com/2016/04/kings-day-speech.html – Guido van Rossum Slide credit: C. Willing
  8. What?

  9. None
  10. IPython: Interactive Python, 2001 A humble start: IPython 0.0.1, 259

    LOC “Just an afternoon hack” https://gist.github.com/fperez/1579699
  11. Team today: where all the credit goes Plus ~ 1500

    more Open source contributors!
  12. The IPython/Jupyter Notebook ❖ Rich web client ❖ Text &

    math ❖ Code ❖ Results ❖ Share, reproduce.
  13. Core ideas of the web: HTTP & HTML HTML: format

    to represent content HyperText Markup Language HTTP: protocol to connect clients and servers HyperText Transport Protocol Image credit: eviltester.com
  14. Core ideas of Jupyter Document Format https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers Interactive Computing Protocol

    SUB SUB DEAL Client SUB DEAL DEAL DEAL ROUT PUB ROUT ROUT Kernel ØMQ + JSON
  15. Jupyter Protocol web-age capture of the process of interactive computing

    any mime-type output ❖ text ❖ svg, png, jpeg ❖ latex, pdf ❖ html, javascript ❖ interactive widgets
  16. Jupyter Protocol is language agnostic u a l j i

    ~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
  17. None
  18. Classic ‘Notebook’…

  19. JupyterLab: a grand unified theory of Jupyter Huge Team Effort!

    C. Colbert, S. Corlay, A. Darian, B. Granger, J. Grout, P. Ivanov, I. Rose, S. Silvester, C. Willing, J. Zosa-Forde …
  20. Live Demo!

  21. Reproducible Research An article about computational science in a scientific

    publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. Buckheit and Donoho, WaveLab and Reproducible Research, 1995
  22. None
  23. None
  24. None
  25. None
  26. JupyterHub: multiuser support

  27. CODING ENVIRONMENT AUTHENTICATION Slides credit: C. Holdgraf

  28. What does this mean for science + education? ❖ Can

    utilize… ❖ ...shared hardware/compute for running code ❖ ...shared data storage for big datasets ❖ ...shared environments for doing work ❖ ...shared workflows, ideas, and results
  29. CODING ENVIRONMENT AUTHENTICATION FANCY HARDWARE

  30. CODING ENVIRONMENT AUTHENTICATION CONTENT ON THE WEB

  31. mybinder.org: shareable reproducibility github.com/freeman-lab Explicit Dependencies + + Origins

  32. CONTENT ON THE WEB ON-DEMAND ENVIRONMENTS BinderHub

  33. None
  34. A long time ago in a galaxy far, far away…

    Rµ⌫ 1 2 R gµ⌫ + ⇤gµ⌫ = 8⇡G c4 Tµ⌫ <latexit sha1_base64="YC1B4aBScjwbH91PFK5cn2nrvCY=">AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=</latexit> <latexit sha1_base64="YC1B4aBScjwbH91PFK5cn2nrvCY=">AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=</latexit> <latexit sha1_base64="YC1B4aBScjwbH91PFK5cn2nrvCY=">AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=</latexit> <latexit sha1_base64="YC1B4aBScjwbH91PFK5cn2nrvCY=">AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=</latexit> Einstein’s Field Equations of General Relativity Annalen der Physik, 1916
  35. Two identical detectors: Hanford, WA and Livingston, LA LIGO: a

    feat of science & engineering Detection problem: • ~ 1/1000 proton over 4 km. • Sensitivity ~ 1e-21 • Milky Way: 1e+21m across!
  36. September 14, 2015

  37. The song of the universe Using the IPython.display.Audio object

  38. LIGO: Open Science with Jupyter

  39. Binder: reproducible, executable scholarship from averaging ~150 people per week

    to averaging ~2,900 people per week Berkeley: Yuvi Panda, Chris Holdgraf Cal Poly: Carol Willing Simula: Min Ragan-Kelley Jessica Zosa-Forde, Tim Head
  40. A tool FOR research, a subject OF research

  41. Anatomy of a notebook http://adamrule.com/files/papers/chi_2018_computational_notebooks_final_web.pdf https://blog.jupyter.org/we-analyzed-1-million-jupyter-notebooks-now-you-can-too- guest-post-8116a964b536 Structure and design

    • Adam Rule et al. (UCSD) • analyzed 1 million notebooks • design opportunities • Dataset is PUBLIC! Slide credit: C. Willing
  42. Education

  43. Berkeley’s Data Science Courses http://data8.org ❖ Freshmen & upper division

    ❖ Interactive textbooks: Jupyter Notebooks ❖ Course deployment: JupyterHub http://ds100.org
  44. DataHub datahub.berkeley.edu Supporting 2,500+ users Being used for Data 8,

    as well as several other courses Requires @berkeley.edu to access Running on Azure with almost zero maintenance Slide: C. Holdgraf
  45. Data 8 & Data100: massive uptake D100 Sp18: ~650 students

    D8 Sp18: ~1,100 students
  46. Fastest growing courses in Berkeley history Thanks to
 Yuvi Panda

    (DSEP), Ryan Lovett (Statistics), DSEP team
  47. Berkeley in a few years… “We are witnessing a monumental

    phase shift in data science knowledge on campus - undergrads are extremely well trained…” Ciera Martinez, BIDS Fellow
  48. Today! (April 17, 2018)

  49. From K-12 to HPC !

  50. Wide industrial adoption

  51. 2018! Save 20% with PJ20 jupytercon.com @JupyterCon

  52. You may have seen this last week :) https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676

  53. The world of science and education wants open platforms https://github.com/parente/nbestimate

    ~1.7M notebooks on GitHub in Jan 2018
  54. Back to openness: ethics and inclusion

  55. Jupyter @ Berkeley and LBNL

  56. Funding and resources

  57. Thank You!