Jupyter hearts PixieDust: Making Jupyter Notebooks Faster, Flexible, and Easier to use

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh The data problems of tomorrow cannot be solved by data scientists alone Courtesy of Quinn Dumbrowski • https://www.flickr.com/photos/quinnanya/2722672659

Slide 4

Slide 4 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh How do we blur the lines between developers and data scientists? Disclaimer: All characters and events depicted in this story are entirely fictitious. Any similarity to actual use cases, events or persons is actually intentional.

Slide 5

Slide 5 text

Slide 6

Slide 6 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh • Hold a master degree in computer science • 10 years experience, 6 years with the company • Languages of choice: Java, Node.js, HTML5/CSS3 • Data: No SQL (Cloudant, Mongo), relational • No major experience with Big Data T H E F U L L S T A C K D E V E L O P E R “The best line of code is the one I didn't have to write!” MEET BEN

Slide 7

Slide 7 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh • Hold a PHD in mathematics • 5 years experience, 2 years with the company • Proficient in Python and R • Expert in Machine Learning and Data visualization • Software engineering is not her thing T H E D A T A S C I E N T I S T “In God we trust. All others bring data.” – W. Edwards Deming MEET NATASHA

Slide 8

Slide 8 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh “We have an urgent need to build an application for marketing that can provide real-time sentiment analysis on Twitter data.” Surprise meeting with the VP of Development! Courtesy of Charles Forerunner • https://unsplash.com/photos/3fPXt37X6UQ

Slide 9

Slide 9 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh KEY CONSTRAINTS • You only have 6 weeks to build the application • Target consumer is marketing staff, so it must be easy to use • It must scale out of the box – look at using Apache Spark

Slide 10

Slide 10 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Ben & Natasha start brainstorming • I’ll work on data acquisition from Twitter and enrichment with sentiment analysis scores using Spark Streaming • I know Java very well, but I don’t have time to learn Python. • However, I am willing to learn Scala if that helps improve my productivity • I’ll perform the data exploration and analysis • I know Python and R, but I am not familiar enough with Java or Scala • I like pandas and numpy. I’m ok to learn Spark but expect the same level of apis • I need to work iteratively with the data I’ll need to do some data exploration too. I’ll need APIs to access my data.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Text Annotations Code Data Visualizations Widgets Output Open source notebooks • Web based UI for running Apache Spark console commands • Easy, no install Spark accelerator • Best way to start working with Apache Spark • Multiple flavors • Jupyter • Zeppelin • Local or cloud hosted • IBM Data Science Experience • Databricks

Slide 13

Slide 13 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Browser Kernel Code Output https://www.bluetrack.com/uploads/items_images/kernel-of-corn-stress-balls1_thumb.jpg?r=1 What is Jupyter? • "Open source, interactive data science and scientific computing" • Formerly IPython • Large, open, growing community and ecosystem • Very popular • ~2 million users for IPython • $6m in funding in 2015 • 200 contributors to notebook subproject alone • 275,000 public notebooks on GitHub

Slide 14

Slide 14 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Batch Job (spark-submit) Interactive Notebook Spark Application (driver) Master (cluster manager) Spark Cluster Worker Node Worker Node ... Notebook Server Browser Kernel Master (cluster manager) Spark Cluster Worker Node Worker Node ... RDD Partitioning Task packaging and dispatching Worker node scheduling What is Spark?

Slide 15

Slide 15 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh code results Kernel with Spark support Services: Congitive, … Libraries: Statistics, Math, Machine Learning, Plotting, Data (flat files, relational database, NoSQL database, …) Worker Worker ... Worker Big Data Analysis

Slide 16

Slide 16 text

Slide 17

Slide 17 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Enter PixieDust… • Visualize data (e.g., Table, Charts, Map, etc) • Full stack app development with PixieApps • Download/export data • Use Scala directly in a Notebook • Install packages into Notebook • Spark job progress monitor • Extensible Open Source Python helper library for Jupyter Notebooks https://github.com/ibm-watson-data-lab/pixiedust

Slide 18

Slide 18 text

Slide 19

Slide 19 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Enter PixieApps • Python classes that extend PixieDust, letting you write UI for your analytics • Easy to build: mostly HTML and CSS with some custom attributes (micro-format style) • With PixieApps you can: • Create different html views with routes to invoke them • Invoke Python Scripts from user interactions • Run in the notebook cell output or in a Dialog • Use cases: • Dashboards • Data Browsers • Data Pipeline Management

Slide 20

Slide 20 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Twitter Sentiment analysis with Watson Tone Analyzer and Watson Personality Insights https://github.com/ibm-watson-data-lab/pixiedust_incubator/tree/master/twitterdemo

Slide 21

Slide 21 text

Slide 22

Slide 22 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh PixieDust demo Twitter Sentiment with Watson and PixieDust https://github.com/ibm-watson-data-lab/pixiedust/blob/master/notebook/Twitter%20Sentiment%20with%20Watson%20and%20Pixiedust.ipynb

Slide 23

Slide 23 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh “This is great, but C-Suite executives need to be able to select filters and see real-time charts without writing code!” Updating the VP Courtesy of Charles Forerunner • https://unsplash.com/photos/3fPXt37X6UQ

Slide 24

Slide 24 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh PixieApp demo Sentiment Analysis of Twitter Hashtags with Spark https://github.com/ibm-watson-data-lab/pixiedust/blob/master/notebook/Twitter%20Sentiment%20with%20Watson%20and%20Pixiedust.ipynb https://medium.com/ibm-watson-data-lab/real-time-sentiment-analysis-of-twitter-hashtags-with-spark-7ee6ca5c1585

Slide 25

Slide 25 text

Slide 26

Slide 26 text

© 2017 IBM Corp. Watson Data Platform @rajrsingh Thanks • Pixiedust • https://github.com/ibm-watson-data-lab/pixiedust • Project Jupyter • http://jupyter.org/ • IBM Data Science Experience • http://datascience.ibm.com • free 30-day trial • Me • [email protected] • Tweet @rajrsingh • Resources • https://github.com/ibm-watson-data-lab/pixiedust • https://ibm-watson-data- lab.github.io/pixiedusthttps://medium.com/ibm- watson-data-lab/i-am-not-a-data-scientist- efe7ca6ceba2 • https://spark.apache.org • https://www.ibm.com/us-en/marketplace/spark- as-a-service • http://datascience.ibm.com • https://www.ibm.com/watson/developercloud/ton e-analyzer.html • https://medium.com/ibm-watson-data-lab/real- time-sentiment-analysis-of-twitter-hashtags-with- spark-7ee6ca5c1585 • https://gist.github.com/vabarbosa/76d08b1cc6f80 d5fc80856a1f3f32014 • https://gist.github.com/vabarbosa/dca176c3a68f0 c101cbe475571e56bf7 • https://ibm.biz/pixiedustvis • https://ibm.biz/pixiedustlab