Slide 1

Slide 1 text

Introduction to large scale data analytics and interactive visualizations in the browser with Blaze and Bokeh Christine Doig, PyConES 2014

Slide 2

Slide 2 text

Introduction About me, Continuum Analytics and this talk

Slide 3

Slide 3 text

Intro Large scale data analytics Interactive data visualization About me Christine Doig @ch_doig Data Scientist, Continuum Analytics ! [email protected] ! Education background: • Industrial Engineering, UPC. • Quantitative Techniques for Financial Markets, UPC • Data Mining and Business Intelligence, UPC. ! Professional experience: • Energy - E.ON • Manufacturing - A&A, P&G • Banking - La Caixa • Social media ! Experience analyzing diverse datasets using a diverse set of tools: Matlab, Excel, SAS, SQL, R, Python ! Talks/events: PyLadiesATX, APUG, PyBCN, PyLadiesBCN, PyTexas, PyConES… Development environment

Slide 4

Slide 4 text

About Continuum Analytics http://continuum.io/ ! ! Moving Expertise to Data Committed to Open Source • Anaconda: Free Python distribution • Projects: Conda, Blaze, Numba, Bokeh • Contributors: NumPy, SciPy, Chaco, SymPy • Sponsor: PyTexas, Scipy… Commercial • Anaconda Server: Enterprise deployment • Wakari: Cloud Data Analytics • Add-ons: IOPro, NumbaPro, Accelerate • Consulting and Training Services Intro Large scale data analytics Interactive data visualization Development environment

Slide 5

Slide 5 text

About this talk 0. Intro - About me - About Continuum Analytics - About this talk ! 1. Development environment - Conda - Binstar Introduction to large scale data analytics and interactive visualizations in the browser Objective Structure 2. Large scale data analytics - Overview - Intro to Blaze - Examples ! 3. Interactive data visualization - Overview - Intro to Bokeh - Examples Intro Large scale data analytics Interactive data visualization Development environment

Slide 6

Slide 6 text

Development environment Conda and Binstar

Slide 7

Slide 7 text

Intro Large scale data analytics Interactive data visualization Development environment Conda http://conda.pydata.org/ • A cross-platform Python-agnostic binary package manager: $ conda install scipy $ conda install julia $ conda install scala $ conda install nodejs $ conda install mongodb $ conda install python=3.4 • homebrew + pip + virtualenv -> conda • Available: Anaconda: https://store.continuum.io/cshop/anaconda/ Miniconda (conda + python): http://conda.pydata.org/miniconda.html ! • Using Conda with Travis CI: http://conda.pydata.org/docs/travis.html

Slide 8

Slide 8 text

Intro Large scale data analytics Interactive data visualization Development environment Binstar https://binstar.org/ Package management service to make software development, release, and maintenance easy. http://docs.binstar.org/

Slide 9

Slide 9 text

Intro Large scale data analytics Interactive data visualization Development environment Binstar https://binstar.org/ Package management service to make software development, release, and maintenance easy. http://docs.binstar.org/

Slide 10

Slide 10 text

Large scale data analytics Overview, Intro to Blaze and Examples

Slide 11

Slide 11 text

What’s a Data Scientist? Intro Large scale data analytics Interactive data visualization Development environment

Slide 12

Slide 12 text

• Solid hands-on experience in developing analytical solutions using statistical tools (e.g. R, SAS, or similar) • Experience in implementing Machine Learning systems which may include classification, clustering, natural language processing and time series analysis. • Hands-on experience in database management (MS SQL, MySQL, PostgreSQL…) • Solid hands-on coding experience in Python, Java, C++, or similar • Experience in dealing with large data sets and a solid understanding of Big Data technologies and applications (AWS, Hadoop, MapReduce, Hive, Hbase, etc). • Sound presentation skills, visualizing complicated data science results in Tableau, Microstrategy, or similar • Comfortable working with front-end development technologies, including: HTML, CSS, JavaScript, D3.js, Django, etc. What’s a Data Scientist? Intro Large scale data analytics Interactive data visualization Development environment

Slide 13

Slide 13 text

Process diagram CRISP-DM: Cross Industry Standard Process for Data Mining. Source: Wikipedia [1] http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html “Data Mining”: It’s not just about modeling... “CRISP-DM, still the top methodology for analytics, data mining, or data science projects” [1]. …it’s also about business understanding, data understanding, data preparation, evaluation and deployment. Intro Large scale data analytics Interactive data visualization Development environment

Slide 14

Slide 14 text

Intro Large scale data analytics Interactive data visualization Development environment

Slide 15

Slide 15 text

Let’s make it easier for users to explore and extract useful insights out of data. Intro Large scale data analytics Interactive data visualization Development environment

Slide 16

Slide 16 text

Let’s make it easier for users to explore and extract useful insights out of data. Free enterprise-ready Python distribution Intro Large scale data analytics Interactive data visualization Development environment

Slide 17

Slide 17 text

Let’s make it easier for users to explore and extract useful insights out of data. Free enterprise-ready Python distribution Anaconda Intro Large scale data analytics Interactive data visualization Development environment

Slide 18

Slide 18 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Intro Large scale data analytics Interactive data visualization Development environment

Slide 19

Slide 19 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Intro Large scale data analytics Interactive data visualization Development environment

Slide 20

Slide 20 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Power to speed up Intro Large scale data analytics Interactive data visualization Development environment

Slide 21

Slide 21 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Numba Power to speed up Intro Large scale data analytics Interactive data visualization Development environment

Slide 22

Slide 22 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Numba Power to speed up Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 23

Slide 23 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Numba Power to speed up Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 24

Slide 24 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Numba Power to speed up Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 25

Slide 25 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Power to speed up Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 26

Slide 26 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Power to speed up Share and deploy Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 27

Slide 27 text

Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Wakari Power to speed up Share and deploy Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment

Slide 28

Slide 28 text

Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data “Big Data”: It’s not just about volume... …it’s also about variety: ! - storage mechanisms - processing engines - data structures - data formats - data location - data sizes - user skills - … Intro Large scale data analytics Interactive data visualization Development environment

Slide 29

Slide 29 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment

Slide 30

Slide 30 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment

Slide 31

Slide 31 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment

Slide 32

Slide 32 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment

Slide 33

Slide 33 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment

Slide 34

Slide 34 text

Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment Analysts?

Slide 35

Slide 35 text

Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Intro Large scale data analytics Interactive data visualization Development environment

Slide 36

Slide 36 text

Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Intro Large scale data analytics Interactive data visualization Development environment

Slide 37

Slide 37 text

Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Intro Large scale data analytics Interactive data visualization Development environment

Slide 38

Slide 38 text

Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Intro Large scale data analytics Interactive data visualization Development environment

Slide 39

Slide 39 text

Intro Large scale data analytics Interactive data visualization Development environment Blaze Blaze is a NumPy/Pandas interface to big data systems like SQL, HDFS, and Spark. ! Motivation: • NumPy/Pandas limited by memory. • Picking up new projects/technologies is costly. ! Usability: - A common interface to a variety of backends - Serve data - Interactive exploration - Data migrations

Slide 40

Slide 40 text

Distributed Systems Scientific Computing BI - DB DM/Stats/ML Blaze Connecting technologies to users Connecting technologies to each other Blaze hdf5 Intro Large scale data analytics Interactive data visualization Development environment

Slide 41

Slide 41 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame Intro Large scale data analytics Interactive data visualization A practical example HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze

Slide 42

Slide 42 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.expressions Intro Large scale data analytics Interactive data visualization Development environment TableSymbol -> Symbol (Array, nested structures… not just Tables)

Slide 43

Slide 43 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment

Slide 44

Slide 44 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment

Slide 45

Slide 45 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment

Slide 46

Slide 46 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment

Slide 47

Slide 47 text

Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.compute Intro Large scale data analytics Interactive data visualization Development environment

Slide 48

Slide 48 text

Blaze.API Table -> Data Using the interactive Table -> Data object we can interact with a variety of computational backends with the familiarity of a local DataFrame Intro Large scale data analytics Interactive data visualization Development environment

Slide 49

Slide 49 text

Blaze.API Table -> Data Intro Large scale data analytics Interactive data visualization Development environment

Slide 50

Slide 50 text

Blaze.API Table -> Data Intro Large scale data analytics Interactive data visualization Development environment

Slide 51

Slide 51 text

Blaze.API Table -> Data Intro Large scale data analytics Interactive data visualization Development environment

Slide 52

Slide 52 text

Blaze.API Table -> Data Intro Large scale data analytics Interactive data visualization Development environment

Slide 53

Slide 53 text

Blaze.API Migrations - into the into function makes it easy to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment

Slide 54

Slide 54 text

Blaze.API Migrations - into the into function makes it easy to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment

Slide 55

Slide 55 text

Blaze.API Migrations - into the into function makes it easy to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment

Slide 56

Slide 56 text

Blaze.API Migrations - into the into function makes it easy to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment

Slide 57

Slide 57 text

Blaze notebooks Intro Large scale data analytics Interactive data visualization Development environment

Slide 58

Slide 58 text

Why I like using Blaze? ! - Syntax is very similar to Pandas - Easy to scale - Easy to find best computational backend to a particular dataset - Easy to adapt my code if someone handles me a dataset in a different format/ backend - Usability Intro Large scale data analytics Interactive data visualization Development environment

Slide 59

Slide 59 text

Interactive data visualizations Overview, Intro to Bokeh and Examples

Slide 60

Slide 60 text

Data visualization - An Overview Results presentation Visual analytics Static Interactive Small datasets Large datasets Traditional plots Novel graphics Intro Large scale data analytics Interactive data visualization Development environment

Slide 61

Slide 61 text

Bokeh • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • Matplotlib compatibility • No need to write Javascript http://bokeh.pydata.org/ https://github.com/bokeh/bokeh Intro Large scale data analytics Interactive data visualization Development environment

Slide 62

Slide 62 text

Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel Zoom, Save, Resize, Select, Reset View) Intro Large scale data analytics Interactive data visualization Development environment

Slide 63

Slide 63 text

Bokeh - Interactive, Visual analytics • Widgets and dashboards Intro Large scale data analytics Interactive data visualization Development environment

Slide 64

Slide 64 text

Bokeh - Interactive, Visual analytics • Crossfilter Intro Large scale data analytics Interactive data visualization Development environment

Slide 65

Slide 65 text

Bokeh - Large datasets Server-side downsampling and abstract rendering Intro Large scale data analytics Interactive data visualization Development environment

Slide 66

Slide 66 text

Bokeh - No JavaScript Intro Large scale data analytics Interactive data visualization Development environment

Slide 67

Slide 67 text

Bokeh - No JavaScript Intro Large scale data analytics Interactive data visualization Development environment

Slide 68

Slide 68 text

Bokeh examples Intro Large scale data analytics Interactive data visualization Development environment

Slide 69

Slide 69 text

Questions?

Slide 70

Slide 70 text

Thank you! :)