Project Jupyter as a Foundation for Open Data Science

Slide 1

Slide 1 text

Project Jupyter as a Foundation for Open Data Science Brian Granger Open Data Science Conference, SF 2015 @ellisonbg

Slide 2

Slide 2 text

About Me • Started to work on IPython in 2004. • Wrote the ﬁrst version of the Jupyter/IPython Notebook in the summer of 2011. • Continue as a core developer and leader of Jupyter/IPython. • Physics Professor @ Cal Poly: • 20%: Teach Data Science • 80%: Jupyter/IPython • Board member of NumFOCUS

Slide 3

Slide 3 text

Project Jupyter: The Big Picture

Slide 4

Slide 4 text

Computational Narratives 1. Computers are optimized for producing, consuming and processing data. 2. Humans are optimized for producing, consuming and processing narratives/stories. 3. For code and data to be useful to humans, we need tools for creating and sharing narratives that involve code and data.

Slide 5

Slide 5 text

Computational Narratives Narrative Code Data The Jupyter Notebook is a tool for creating and sharing computational narratives

Slide 6

Slide 6 text

Live Demo

Slide 7

Slide 7 text

10 Things you should know about Project Jupyter

Slide 8

Slide 8 text

1) Jupyter is developed by an all star team Core team 100s of open-source contributors Industry collaborators

Slide 9

Slide 9 text

2) The Jupyter Notebook works with over 40 languages https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages

Slide 10

Slide 10 text

3) Jupyter Notebooks render on GitHub

Slide 11

Slide 11 text

This enables Open Data Science • Jeremy-Singer Vine and Jon Templon are data journalists at BuzzFeedNews. • For each article they publish on BuzzFeeNews, they share their data and code on a GitHub repo. • Code is shared as Jupyter Notebooks. • https://github.com/BuzzFeedNews/everything

Slide 12

Slide 12 text

Article

Slide 13

Slide 13 text

Repository+Notebook

Slide 14

Slide 14 text

4) With binder users can run notebooks hosted on GitHub http://mybinder.org

Slide 15

Slide 15 text

5) Instantly spin up a temporary Jupyter Notebook by visiting https://tmpnb.org

Slide 16

Slide 16 text

6) Companies are building products that incorporate the Jupyter Notebook or Kernels • Microsoft AzureML and HD Insight • Quantopian Research Platform • Google Cloud Datalab • IBM Knowledge Anyhow • yhat Rodeo • Dataiku Data Science Studio • Domino Data Lab • O’Reilly Media • DataRobot • Dato GraphLab Create

Slide 17

Slide 17 text

7) JupyterHub brings the Jupyter Notebook to teams and organizations

Slide 18

Slide 18 text

8) The 4.1 release of the Jupyter Notebook is coming soon… • Q4 2015 (soon). • Multi-cell selection and actions (cut, copy, paste, run, etc.). • Notebook-wide ﬁnd and replace. • Atom-style command palette.

Slide 19

Slide 19 text

9) We are building a new frontend • Tentative codename: Jupyter Workbench • All frontend code being refactored into a set of loosely coupled npm packages with well deﬁned models and views that you can use in building your own apps. • All based on a plug-in system to make it easy for third party developers to extend. • Flexible layout to freely mix notebooks, text editors, terminals, output areas, interactive visualizations, and dashboards in a single browser tab. • Real-time collaboration. • Funded by the Moore, Sloan, and Helmsley Foundations in collaboration with Continuum Analytics and Bloomberg. • Will start to show up in Q1/Q2 2016.

Slide 20

Slide 20 text

10) We are Hiring! • UC Berkeley • Postdocs. • Project Manager. • Cal Poly • UI/UX Designer. • Software Engineers. • This is an incredible opportunity to work on open-source software as your day job. • Contact me here at ODSC or ellisonbg at gmail

Slide 21

Slide 21 text

What is Data Science? …and how is that question related to Jupyter?

Slide 22

Slide 22 text

Skill/Knowledge/Tools The most common way of deﬁning or talking about Data Science is in terms of the skills, knowledge, and tools used by its practitioners. O’Reilly Data Science Survey, 2015 Drew Conway

Slide 23

Slide 23 text

How useful is this approach? • Helpful for hiring, teaching, learning. • But let’s try to deﬁne something else using this approach and see how it goes. • What is cooking? • What skills/knowledge/tools are required? • Chopping, measuring, mixing, stirring, boiling, frying, operating a stove, purchasing ingredients, tasting, etc. • Knives, spoons, forks, cups, plates, pots, pans, measuring cups, oven, microwave, stove, etc. • These things are all true about cooking, but miss the fundamental point. • What is the fundamental point of cooking?

Slide 24

Slide 24 text

This is cooking: Cooking produces food that is nutritional and pleasant to eat.

Slide 25

Slide 25 text

The Questions of Cooking • Embedded in the end goal of cooking are a set of questions that capture its essence and must be answered by anyone attempting to cook: • What meals/dishes are nutritional, pleasant and affordable to eat? • How long do those meals take to prepare? • What do I feel like eating? • What raw ingredients are required to prepare those meals? • What is the cost and availability of those raw ingredients? • What processes are required to prepare the meals from the raw ingredients? • What tools/skills are required to prepare the meals?

Slide 26

Slide 26 text

Assertion Data Science is best deﬁned by identifying the underlying fundamental questions in common to its diverse realizations

Slide 27

Slide 27 text

The Questions of Data Science • How/where do we get data? • What is the raw format of the data? • How much data and how often? • What variables/fields are present in the data and what are their types? • What relevant variables/fields are not present in the data? • What relationships are present in the data and how are they expressed? • Is the data observational or collected in a controlled manner? • What practical questions can we, or would we like to answer with the data? • How is the data stored after collection and how does that relate to the practical questions we are interested in answering? • What in memory data structures are appropriate for answering those practical questions efficiently? • What can we predict with the data? • What can we understand with the data? • What hypotheses can be supported or rejected with the data? • What statistical or machine learning methods are needed to answer these questions? • What user interfaces are required for humans to work with the data efficiently and productively? • How can the data be visualized effectively? • How can code, data and visualizations be embedded into narratives used to communicate results? • What software is needed to support the activities around these questions? • What computational infrastructure is needed? • How can organizations leverage data to meet their goals? • What organizational structures are needed to best take advantage of data? • What are the economic benefits of pursuing these questions? • What are the social benefits of pursuing these questions? • Where do these questions and the activities in pursuit of them intersect important ethical issues. • And many more…

Slide 28

Slide 28 text

These Questions… • Are, in many cases, nothing more than the traditional scientiﬁc method. That is the Science in Data Science. • Can be pursued at any level (primary, secondary, undergraduate, graduate, professional, fun). • Open the door for a “data literacy” that extends far beyond the current hyper-technical realization of Data Science: • We can’t be the only ones who understand the fundamental questions related to data. • See, for example, the current national “discussion” surrounding global warming and vaccinations. • Chris Mooney, “The Science of Why We Don’t Believe Science,” Mother Jones (http:// www.motherjones.com/politics/2011/03/denial-science-chris-mooney). • Allow for a healthy understanding of the relationship between Data Science and the skills/knowledge of its practitioners: • The questions are fundamental, skills/knowledge are secondary — “implementation details” • Different individuals will have different skills/knowledge. Everyone will not know everything. • Are silent about education level, gender, race, academic degree, years of experience, etc.

Slide 29

Slide 29 text

How is this related to Jupyter? • The Jupyter Notebook is a tool that allows us to explore the fundamental questions of Data Science • …with a particular dataset • …with code and data • …in a manner that produces a computational narrative • …that can be shared, reproduced, modiﬁed, and extended. • At the end of it all, those computational narratives encapsulate the goal or end point of Data Science. • The character of the narrative (prediction, inference, data generation, insight, etc.) will vary from case to case.

Slide 30

Slide 30 text

This is Data Science: https://github.com/jakevdp/ProntoData The end point of Data Science is encoding as computational narratives

Slide 31

Slide 31 text

Thank You