$30 off During Our Annual Pro Sale. View Details »

Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science

Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science

Lecture presented at BIDS, the Berkeley Institute for Data Science on April 17, 2018. Description and video of the talk:

https://bids.berkeley.edu/resources/videos/project-jupyter-architecture-and-evolution-open-platform-modern-data-science

Code for Polyglot DS demo: http://nbviewer.jupyter.org/gist/fperez/5b49246af4e340c37549265a90894ce6/polyglot-ds.ipynb

Fernando Perez

April 17, 2018
Tweet

More Decks by Fernando Perez

Other Decks in Technology

Transcript

  1. Fernando Pérez
    [email protected]
    Building an open platform for
    research and education in data
    science
    Project Jupyter

    View Slide

  2. A few bits about me
    Medellín, Colombia
    University of Colorado, Boulder
    Physics
    Applied Math
    Computation

    View Slide

  3. Statistics & me: then and now
    If your result needs a statistician then
    you should design a better experiment
    (prob. mis-attributed)
    E. Rutherford
    PhD: Lattice QCD
    Simulations

    View Slide

  4. Why?

    View Slide

  5. Why?
    ❖ Ethical: openness as fairness
    ❖ Human/social: openness fosters collaboration.
    ❖ Epistemological: proprietary science is an oxymoron.
    ❖ Technical: Python was cool :)

    View Slide

  6. Python - The Beginning
    the most important
    lesson I learned
    was about sharing
    – Guido van Rossum
    http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
    Slide credit: C. Willing

    View Slide

  7. Designed for Learning
    In reality, programming languages
    are how programmers express and
    communicate ideas — and the
    audience for those ideas is other
    programmers, not computers.
    http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
    – Guido van Rossum
    Slide credit: C. Willing

    View Slide

  8. What?

    View Slide

  9. View Slide

  10. IPython: Interactive Python, 2001
    A humble start:
    IPython 0.0.1, 259 LOC
    “Just an afternoon hack”
    https://gist.github.com/fperez/1579699

    View Slide

  11. Team today: where all the credit goes
    Plus ~ 1500 more Open source contributors!

    View Slide

  12. The IPython/Jupyter Notebook
    ❖ Rich web client
    ❖ Text & math
    ❖ Code
    ❖ Results
    ❖ Share, reproduce.

    View Slide

  13. Core ideas of the web: HTTP & HTML
    HTML: format to represent content
    HyperText Markup Language
    HTTP: protocol to connect clients and servers
    HyperText Transport Protocol
    Image credit: eviltester.com

    View Slide

  14. Core ideas of Jupyter
    Document Format
    https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
    Interactive Computing Protocol
    SUB SUB DEAL
    Client
    SUB
    DEAL
    DEAL
    DEAL
    ROUT
    PUB ROUT
    ROUT
    Kernel
    ØMQ + JSON

    View Slide

  15. Jupyter Protocol
    web-age capture of the process of interactive computing
    any mime-type output
    ❖ text
    ❖ svg, png, jpeg
    ❖ latex, pdf
    ❖ html, javascript
    ❖ interactive widgets

    View Slide

  16. Jupyter Protocol
    is language agnostic
    u a
    l
    j i
    ~100 different kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels

    View Slide

  17. View Slide

  18. Classic ‘Notebook’…

    View Slide

  19. JupyterLab: a grand unified theory of Jupyter
    Huge Team Effort!
    C. Colbert, S. Corlay, A. Darian, B. Granger, J. Grout, P.
    Ivanov, I. Rose, S. Silvester, C. Willing, J. Zosa-Forde …

    View Slide

  20. Live Demo!

    View Slide

  21. Reproducible Research
    An article about computational science in a scientific
    publication is not the scholarship itself, it is merely
    advertising of the scholarship. The actual scholarship is
    the complete software development environment and the
    complete set of instructions which generated the figures.
    Buckheit and Donoho, WaveLab and Reproducible Research, 1995

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. JupyterHub: multiuser support

    View Slide

  27. CODING
    ENVIRONMENT
    AUTHENTICATION
    Slides credit: C. Holdgraf

    View Slide

  28. What does this mean for science + education?
    ❖ Can utilize…
    ❖ ...shared hardware/compute for running code
    ❖ ...shared data storage for big datasets
    ❖ ...shared environments for doing work
    ❖ ...shared workflows, ideas, and results

    View Slide

  29. CODING
    ENVIRONMENT
    AUTHENTICATION
    FANCY HARDWARE

    View Slide

  30. CODING
    ENVIRONMENT
    AUTHENTICATION
    CONTENT ON
    THE WEB

    View Slide

  31. mybinder.org: shareable reproducibility
    github.com/freeman-lab
    Explicit Dependencies
    +
    +
    Origins

    View Slide

  32. CONTENT ON
    THE WEB
    ON-DEMAND
    ENVIRONMENTS
    BinderHub

    View Slide

  33. View Slide

  34. A long time ago in a galaxy far, far away…
    Rµ⌫
    1
    2
    R gµ⌫ + ⇤gµ⌫ =
    8⇡G
    c4
    Tµ⌫
    AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=
    AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=
    AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=
    AAACXHicbVFLSysxGM2Meq9WvbcquHETLIJwtcyIoC4EwYUuXKi0KjS9JZPJ1GAmMyTfCEPIn3SnG/+K6WNRHx8EDufBl5wkpRQGoug1COfmF379XlxqLK+s/vnbXFu/M0WlGe+yQhb6IaGGS6F4FwRI/lBqTvNE8vvk6Xyk3z9zbUShOlCXvJ/ToRKZYBQ8NWiCJakwpaS1gVpyfDuwJK8wURV2+5ZApinDNnb2wLlbsjeckf+RK78npXiWPLVkEjkmpcAXzrL/9tA515nxuEGzFbWj8eDvIJ6CFprO9aD5QtKCVTlXwCQ1phdHJfQt1SCY5K5BKsNLyp7okPc8VDTnpm/H7Ti845kUZ4X2RwEes7MJS3Nj6jzxzpzCo/mqjciftF4F2XHfClVWwBWbLMoqiaHAo6pxKjRnIGsPKNPC3xWzR+rbAf8hDV9C/PXJ30H3oH3Sjm8OW2edaRuLaAtto10UoyN0hi7RNeoiht4CFCwFjeA9XAiXw9WJNQymmQ30acLNDwCLtUM=
    Einstein’s Field Equations of General Relativity
    Annalen der Physik, 1916

    View Slide

  35. Two identical detectors: Hanford, WA and Livingston, LA
    LIGO: a feat of science &
    engineering
    Detection problem:
    • ~ 1/1000 proton over 4 km.
    • Sensitivity ~ 1e-21
    • Milky Way: 1e+21m across!

    View Slide

  36. September 14, 2015

    View Slide

  37. The song of the universe
    Using the IPython.display.Audio object

    View Slide

  38. LIGO: Open Science with Jupyter

    View Slide

  39. Binder: reproducible, executable scholarship
    from averaging ~150 people per week to averaging ~2,900 people per week
    Berkeley: Yuvi Panda, Chris Holdgraf
    Cal Poly: Carol Willing
    Simula: Min Ragan-Kelley
    Jessica Zosa-Forde, Tim Head

    View Slide

  40. A tool FOR research,
    a subject OF research

    View Slide

  41. Anatomy of
    a notebook
    http://adamrule.com/files/papers/chi_2018_computational_notebooks_final_web.pdf
    https://blog.jupyter.org/we-analyzed-1-million-jupyter-notebooks-now-you-can-too-
    guest-post-8116a964b536
    Structure and design
    • Adam Rule et al. (UCSD)
    • analyzed 1 million
    notebooks
    • design opportunities
    • Dataset is PUBLIC!
    Slide credit: C. Willing

    View Slide

  42. Education

    View Slide

  43. Berkeley’s Data Science Courses
    http://data8.org
    ❖ Freshmen & upper
    division
    ❖ Interactive textbooks:
    Jupyter Notebooks
    ❖ Course deployment:
    JupyterHub
    http://ds100.org

    View Slide

  44. DataHub
    datahub.berkeley.edu
    Supporting 2,500+ users
    Being used for Data 8, as well
    as several other courses
    Requires @berkeley.edu to
    access
    Running on Azure with almost
    zero maintenance
    Slide: C. Holdgraf

    View Slide

  45. Data 8 & Data100: massive uptake
    D100 Sp18: ~650
    students
    D8 Sp18: ~1,100 students

    View Slide

  46. Fastest growing courses
    in Berkeley history
    Thanks to

    Yuvi Panda (DSEP), Ryan Lovett (Statistics),
    DSEP team

    View Slide

  47. Berkeley in a few years…
    “We are witnessing a monumental phase shift in data science knowledge on campus -
    undergrads are extremely well trained…”
    Ciera Martinez, BIDS Fellow

    View Slide

  48. Today! (April 17, 2018)

    View Slide

  49. From K-12 to HPC
    !

    View Slide

  50. Wide industrial adoption

    View Slide

  51. 2018!
    Save 20% with PJ20
    jupytercon.com @JupyterCon

    View Slide

  52. You may have seen this last week :)
    https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676

    View Slide

  53. The world of science and education
    wants open platforms
    https://github.com/parente/nbestimate
    ~1.7M notebooks
    on GitHub in Jan 2018

    View Slide

  54. Back to openness: ethics and inclusion

    View Slide

  55. Jupyter @ Berkeley and LBNL

    View Slide

  56. Funding and resources

    View Slide

  57. Thank You!

    View Slide