Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jupyter hearts PixieDust: Making Jupyter Notebooks Faster, Flexible, and Easier to use

Jupyter hearts PixieDust: Making Jupyter Notebooks Faster, Flexible, and Easier to use

PixieDust is a new Python open source library that helps data scientists and
developers working in Jupyter Notebooks and Apache Spark to be more efficient.
PixieDust speeds up data manipulation and display with features like:
Automated local install of Python and Scala kernels running with Spark Realtime
Spark Job progress monitoring directly from the Notebook
Use Scala directly in your Python notebook.
Variables are automatically transferred from Python to Scala and vice-versa
Auto-visualisation of Spark DataFrames using popular chart engines like Matplotlib, Seaborn, Bokeh, or MapBox.

Seamless integration to cloud services
Create embedded apps with your own visualisations or apps using the
PixieDust extensibility APIs. Come along and learn how you can use this tool in your own projects
to visualise and explore data effortlessly with no coding.
If you prefer working with a Scala Notebook, this session is also for you,
as PixieDust can also run on a Scala Kernel.
Imagine being able to visualise your favourite Python chart engines
from a Scala Notebook! This session will end with a demo combining Twitter,
Watson Tone Analyser, Spark Streaming, and some fun real-time visualisations -
all running within a Notebook.

Ee996d54796a7772b3a0c08b5adf6b7b?s=128

Raj Singh

July 06, 2017
Tweet

Transcript

  1. © 2017 IBM Corp. Watson Data Platform @rajrsingh Jupyter ❤

    Pixiedust Making Jupyter Notebooks Faster, Flexible, and Easier to use Raj Singh, PhD <rrsingh@us.ibm.com> Developer Advocate IBM Watson Data Platform
  2. © 2017 IBM Corp. Watson Data Platform @rajrsingh https://hbr.org/2016/02/the-rise-of-data-driven-decision-making-is-real-but-uneven

  3. © 2017 IBM Corp. Watson Data Platform @rajrsingh The data

    problems of tomorrow cannot be solved by data scientists alone Courtesy of Quinn Dumbrowski • https://www.flickr.com/photos/quinnanya/2722672659
  4. © 2017 IBM Corp. Watson Data Platform @rajrsingh How do

    we blur the lines between developers and data scientists? Disclaimer: All characters and events depicted in this story are entirely fictitious. Any similarity to actual use cases, events or persons is actually intentional.
  5. © 2017 IBM Corp. Watson Data Platform @rajrsingh Organizations are

    Systems of Systems SYSTEMS OF ORCHESTRATION Systems of Operation Systems of Record Systems of Engagement
  6. © 2017 IBM Corp. Watson Data Platform @rajrsingh • Hold

    a master degree in computer science • 10 years experience, 6 years with the company • Languages of choice: Java, Node.js, HTML5/CSS3 • Data: No SQL (Cloudant, Mongo), relational • No major experience with Big Data T H E F U L L S T A C K D E V E L O P E R “The best line of code is the one I didn't have to write!” MEET BEN
  7. © 2017 IBM Corp. Watson Data Platform @rajrsingh • Hold

    a PHD in mathematics • 5 years experience, 2 years with the company • Proficient in Python and R • Expert in Machine Learning and Data visualization • Software engineering is not her thing T H E D A T A S C I E N T I S T “In God we trust. All others bring data.” – W. Edwards Deming MEET NATASHA
  8. © 2017 IBM Corp. Watson Data Platform @rajrsingh “We have

    an urgent need to build an application for marketing that can provide real-time sentiment analysis on Twitter data.” Surprise meeting with the VP of Development! Courtesy of Charles Forerunner • https://unsplash.com/photos/3fPXt37X6UQ
  9. © 2017 IBM Corp. Watson Data Platform @rajrsingh KEY CONSTRAINTS

    • You only have 6 weeks to build the application • Target consumer is marketing staff, so it must be easy to use • It must scale out of the box – look at using Apache Spark
  10. © 2017 IBM Corp. Watson Data Platform @rajrsingh Ben &

    Natasha start brainstorming • I’ll work on data acquisition from Twitter and enrichment with sentiment analysis scores using Spark Streaming • I know Java very well, but I don’t have time to learn Python. • However, I am willing to learn Scala if that helps improve my productivity • I’ll perform the data exploration and analysis • I know Python and R, but I am not familiar enough with Java or Scala • I like pandas and numpy. I’m ok to learn Spark but expect the same level of apis • I need to work iteratively with the data I’ll need to do some data exploration too. I’ll need APIs to access my data.
  11. © 2017 IBM Corp. Watson Data Platform @rajrsingh How can

    we collaborate? Notebooks?
  12. © 2017 IBM Corp. Watson Data Platform @rajrsingh Text Annotations

    Code Data Visualizations Widgets Output Open source notebooks • Web based UI for running Apache Spark console commands • Easy, no install Spark accelerator • Best way to start working with Apache Spark • Multiple flavors • Jupyter • Zeppelin • Local or cloud hosted • IBM Data Science Experience • Databricks
  13. © 2017 IBM Corp. Watson Data Platform @rajrsingh Browser Kernel

    Code Output https://www.bluetrack.com/uploads/items_images/kernel-of-corn-stress-balls1_thumb.jpg?r=1 What is Jupyter? • "Open source, interactive data science and scientific computing" • Formerly IPython • Large, open, growing community and ecosystem • Very popular • ~2 million users for IPython • $6m in funding in 2015 • 200 contributors to notebook subproject alone • 275,000 public notebooks on GitHub
  14. © 2017 IBM Corp. Watson Data Platform @rajrsingh Batch Job

    (spark-submit) Interactive Notebook Spark Application (driver) Master (cluster manager) Spark Cluster Worker Node Worker Node ... Notebook Server Browser Kernel Master (cluster manager) Spark Cluster Worker Node Worker Node ... RDD Partitioning Task packaging and dispatching Worker node scheduling What is Spark?
  15. © 2017 IBM Corp. Watson Data Platform @rajrsingh code results

    Kernel with Spark support Services: Congitive, … Libraries: Statistics, Math, Machine Learning, Plotting, Data (flat files, relational database, NoSQL database, …) Worker Worker ... Worker Big Data Analysis
  16. © 2017 IBM Corp. Watson Data Platform @rajrsingh — BEN

    “But they seem complicated for developers like me” Notebooks are powerful data science tools
  17. © 2017 IBM Corp. Watson Data Platform @rajrsingh Enter PixieDust…

    • Visualize data (e.g., Table, Charts, Map, etc) • Full stack app development with PixieApps • Download/export data • Use Scala directly in a Notebook • Install packages into Notebook • Spark job progress monitor • Extensible Open Source Python helper library for Jupyter Notebooks https://github.com/ibm-watson-data-lab/pixiedust
  18. © 2017 IBM Corp. Watson Data Platform @rajrsingh — NATASHA

    “Expressing everything in code is nice, but LOB users don’t want to run code” What about the Line of Business User?
  19. © 2017 IBM Corp. Watson Data Platform @rajrsingh Enter PixieApps

    • Python classes that extend PixieDust, letting you write UI for your analytics • Easy to build: mostly HTML and CSS with some custom attributes (micro-format style) • With PixieApps you can: • Create different html views with routes to invoke them • Invoke Python Scripts from user interactions • Run in the notebook cell output or in a Dialog • Use cases: • Dashboards • Data Browsers • Data Pipeline Management
  20. © 2017 IBM Corp. Watson Data Platform @rajrsingh Twitter Sentiment

    analysis with Watson Tone Analyzer and Watson Personality Insights https://github.com/ibm-watson-data-lab/pixiedust_incubator/tree/master/twitterdemo
  21. © 2017 IBM Corp. Watson Data Platform @rajrsingh Architecture

  22. © 2017 IBM Corp. Watson Data Platform @rajrsingh PixieDust demo

    Twitter Sentiment with Watson and PixieDust https://github.com/ibm-watson-data-lab/pixiedust/blob/master/notebook/Twitter%20Sentiment%20with%20Watson%20and%20Pixiedust.ipynb
  23. © 2017 IBM Corp. Watson Data Platform @rajrsingh “This is

    great, but C-Suite executives need to be able to select filters and see real-time charts without writing code!” Updating the VP Courtesy of Charles Forerunner • https://unsplash.com/photos/3fPXt37X6UQ
  24. © 2017 IBM Corp. Watson Data Platform @rajrsingh PixieApp demo

    Sentiment Analysis of Twitter Hashtags with Spark https://github.com/ibm-watson-data-lab/pixiedust/blob/master/notebook/Twitter%20Sentiment%20with%20Watson%20and%20Pixiedust.ipynb https://medium.com/ibm-watson-data-lab/real-time-sentiment-analysis-of-twitter-hashtags-with-spark-7ee6ca5c1585
  25. © 2017 IBM Corp. Watson Data Platform @rajrsingh

  26. © 2017 IBM Corp. Watson Data Platform @rajrsingh Thanks •

    Pixiedust • https://github.com/ibm-watson-data-lab/pixiedust • Project Jupyter • http://jupyter.org/ • IBM Data Science Experience • http://datascience.ibm.com • free 30-day trial • Me • rrsingh@us.ibm.com • Tweet @rajrsingh • Resources • https://github.com/ibm-watson-data-lab/pixiedust • https://ibm-watson-data- lab.github.io/pixiedusthttps://medium.com/ibm- watson-data-lab/i-am-not-a-data-scientist- efe7ca6ceba2 • https://spark.apache.org • https://www.ibm.com/us-en/marketplace/spark- as-a-service • http://datascience.ibm.com • https://www.ibm.com/watson/developercloud/ton e-analyzer.html • https://medium.com/ibm-watson-data-lab/real- time-sentiment-analysis-of-twitter-hashtags-with- spark-7ee6ca5c1585 • https://gist.github.com/vabarbosa/76d08b1cc6f80 d5fc80856a1f3f32014 • https://gist.github.com/vabarbosa/dca176c3a68f0 c101cbe475571e56bf7 • https://ibm.biz/pixiedustvis • https://ibm.biz/pixiedustlab