Project Jupyter as a Foundation for Open Data Science

Project Jupyter as a Foundation for Open Data Science

I gave this talk as a keynote at the Open Data Science Conference in SF in November of 2015.

Cae6da8d1ad0014fe4ac1d1e5acec7a3?s=128

Brian E. Granger

November 14, 2015
Tweet

Transcript

  1. Project Jupyter as a Foundation for Open Data Science Brian

    Granger Open Data Science Conference, SF 2015 @ellisonbg
  2. About Me • Started to work on IPython in 2004.

    • Wrote the first version of the Jupyter/IPython Notebook in the summer of 2011. • Continue as a core developer and leader of Jupyter/IPython. • Physics Professor @ Cal Poly: • 20%: Teach Data Science • 80%: Jupyter/IPython • Board member of NumFOCUS
  3. Project Jupyter: The Big Picture

  4. Computational Narratives 1. Computers are optimized for producing, consuming and

    processing data. 2. Humans are optimized for producing, consuming and processing narratives/stories. 3. For code and data to be useful to humans, we need tools for creating and sharing narratives that involve code and data.
  5. Computational Narratives Narrative Code Data The Jupyter Notebook is a

    tool for creating and sharing computational narratives
  6. Live Demo

  7. 10 Things you should know about Project Jupyter

  8. 1) Jupyter is developed by an all star team Core

    team 100s of open-source contributors Industry collaborators
  9. 2) The Jupyter Notebook works with over 40 languages https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages

  10. 3) Jupyter Notebooks render on GitHub

  11. This enables Open Data Science • Jeremy-Singer Vine and Jon

    Templon are data journalists at BuzzFeedNews. • For each article they publish on BuzzFeeNews, they share their data and code on a GitHub repo. • Code is shared as Jupyter Notebooks. • https://github.com/BuzzFeedNews/everything
  12. Article

  13. Repository+Notebook

  14. 4) With binder users can run notebooks hosted on GitHub

    http://mybinder.org
  15. 5) Instantly spin up a temporary Jupyter Notebook by visiting

    https://tmpnb.org
  16. 6) Companies are building products that incorporate the Jupyter Notebook

    or Kernels • Microsoft AzureML and HD Insight • Quantopian Research Platform • Google Cloud Datalab • IBM Knowledge Anyhow • yhat Rodeo • Dataiku Data Science Studio • Domino Data Lab • O’Reilly Media • DataRobot • Dato GraphLab Create
  17. 7) JupyterHub brings the Jupyter Notebook to teams and organizations

  18. 8) The 4.1 release of the Jupyter Notebook is coming

    soon… • Q4 2015 (soon). • Multi-cell selection and actions (cut, copy, paste, run, etc.). • Notebook-wide find and replace. • Atom-style command palette.
  19. 9) We are building a new frontend • Tentative codename:

    Jupyter Workbench • All frontend code being refactored into a set of loosely coupled npm packages with well defined models and views that you can use in building your own apps. • All based on a plug-in system to make it easy for third party developers to extend. • Flexible layout to freely mix notebooks, text editors, terminals, output areas, interactive visualizations, and dashboards in a single browser tab. • Real-time collaboration. • Funded by the Moore, Sloan, and Helmsley Foundations in collaboration with Continuum Analytics and Bloomberg. • Will start to show up in Q1/Q2 2016.
  20. 10) We are Hiring! • UC Berkeley • Postdocs. •

    Project Manager. • Cal Poly • UI/UX Designer. • Software Engineers. • This is an incredible opportunity to work on open-source software as your day job. • Contact me here at ODSC or ellisonbg at gmail
  21. What is Data Science? …and how is that question related

    to Jupyter?
  22. Skill/Knowledge/Tools The most common way of defining or talking about

    Data Science is in terms of the skills, knowledge, and tools used by its practitioners. O’Reilly Data Science Survey, 2015 Drew Conway
  23. How useful is this approach? • Helpful for hiring, teaching,

    learning. • But let’s try to define something else using this approach and see how it goes. • What is cooking? • What skills/knowledge/tools are required? • Chopping, measuring, mixing, stirring, boiling, frying, operating a stove, purchasing ingredients, tasting, etc. • Knives, spoons, forks, cups, plates, pots, pans, measuring cups, oven, microwave, stove, etc. • These things are all true about cooking, but miss the fundamental point. • What is the fundamental point of cooking?
  24. This is cooking: Cooking produces food that is nutritional and

    pleasant to eat.
  25. The Questions of Cooking • Embedded in the end goal

    of cooking are a set of questions that capture its essence and must be answered by anyone attempting to cook: • What meals/dishes are nutritional, pleasant and affordable to eat? • How long do those meals take to prepare? • What do I feel like eating? • What raw ingredients are required to prepare those meals? • What is the cost and availability of those raw ingredients? • What processes are required to prepare the meals from the raw ingredients? • What tools/skills are required to prepare the meals?
  26. Assertion Data Science is best defined by identifying the underlying

    fundamental questions in common to its diverse realizations
  27. The Questions of Data Science • How/where do we get

    data? • What is the raw format of the data? • How much data and how often? • What variables/fields are present in the data and what are their types? • What relevant variables/fields are not present in the data? • What relationships are present in the data and how are they expressed? • Is the data observational or collected in a controlled manner? • What practical questions can we, or would we like to answer with the data? • How is the data stored after collection and how does that relate to the practical questions we are interested in answering? • What in memory data structures are appropriate for answering those practical questions efficiently? • What can we predict with the data? • What can we understand with the data? • What hypotheses can be supported or rejected with the data? • What statistical or machine learning methods are needed to answer these questions? • What user interfaces are required for humans to work with the data efficiently and productively? • How can the data be visualized effectively? • How can code, data and visualizations be embedded into narratives used to communicate results? • What software is needed to support the activities around these questions? • What computational infrastructure is needed? • How can organizations leverage data to meet their goals? • What organizational structures are needed to best take advantage of data? • What are the economic benefits of pursuing these questions? • What are the social benefits of pursuing these questions? • Where do these questions and the activities in pursuit of them intersect important ethical issues. • And many more…
  28. These Questions… • Are, in many cases, nothing more than

    the traditional scientific method. That is the Science in Data Science. • Can be pursued at any level (primary, secondary, undergraduate, graduate, professional, fun). • Open the door for a “data literacy” that extends far beyond the current hyper-technical realization of Data Science: • We can’t be the only ones who understand the fundamental questions related to data. • See, for example, the current national “discussion” surrounding global warming and vaccinations. • Chris Mooney, “The Science of Why We Don’t Believe Science,” Mother Jones (http:// www.motherjones.com/politics/2011/03/denial-science-chris-mooney). • Allow for a healthy understanding of the relationship between Data Science and the skills/knowledge of its practitioners: • The questions are fundamental, skills/knowledge are secondary — “implementation details” • Different individuals will have different skills/knowledge. Everyone will not know everything. • Are silent about education level, gender, race, academic degree, years of experience, etc.
  29. How is this related to Jupyter? • The Jupyter Notebook

    is a tool that allows us to explore the fundamental questions of Data Science • …with a particular dataset • …with code and data • …in a manner that produces a computational narrative • …that can be shared, reproduced, modified, and extended. • At the end of it all, those computational narratives encapsulate the goal or end point of Data Science. • The character of the narrative (prediction, inference, data generation, insight, etc.) will vary from case to case.
  30. This is Data Science: https://github.com/jakevdp/ProntoData The end point of Data

    Science is encoding as computational narratives
  31. Thank You