Slide 1

Slide 1 text

Python in Hadoop Ecosystem Blaze and Bokeh Presented by: Andy R. Terrel

Slide 2

Slide 2 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 About Continuum Analytics Intro Large scale data analytics Interactive data visualization A practical example http://continuum.io/ We build technologies that enable analysts and data scientist to answer questions from the data all around us. Committed to Open Source Areas of Focus • Software solutions • Consulting • Training • Anaconda: Free Python distribution • Numba, Conda, Blaze, Bokeh, dynd • Sponsor

Slide 3

Slide 3 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 About Andy Andy R. Terrel @aterrel Chief Scientist, Continuum Analytics President, NumFOCUS Background: • High Performance Computing • Computational Mathematics • President, NumFOCUS foundation Experience analyzing diverse datasets: • Finance • Simulations • Web data • Social media

Slide 4

Slide 4 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 About this talk Visualizing Data with Blaze and Bokeh 1. Discussion of Hadoop
 2. Large scale data analytics - Blaze 3. Interactive data visualization - Bokeh Intro Large scale data analytics Interactive data visualization A practical example Introduction to large-scale data analytics and interactive visualization Objective Structure

Slide 5

Slide 5 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz RHadoop

Slide 6

Slide 6 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 6

Slide 7

Slide 7 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 7 Base Hadoop Stack

Slide 8

Slide 8 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 8 Berkeley Data Science Stack

Slide 9

Slide 9 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 9 Where is Python? Hadoop Streaming PySpark Pig

Slide 10

Slide 10 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 10

Slide 11

Slide 11 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example “ At my company X, we have peta/terabytes of data, just lying around, waiting for someone to explore it” - someone at PyTexas Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Wakari Power to speed up Share and deploy Interactive data visualizations Scale

Slide 12

Slide 12 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 12 Blaze

Slide 13

Slide 13 text

• Dealing with data applications has numerous pain points
 - Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain

Slide 14

Slide 14 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/

Slide 15

Slide 15 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Distributed Systems Scientific Computing BI - DB DM/Stats/ML Blaze bcolz Connecting technologies to users Connecting technologies to each other Blaze hdf5

Slide 16

Slide 16 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame Intro Large scale data analytics Interactive data visualization A practical example HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze

Slide 17

Slide 17 text

Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture • Flexible architecture to accommodate exploration
 • Use compilation of deferred expressions to optimize data interactions

Slide 18

Slide 18 text

Deferred Expr Blaze Expr temps.hdf5 nasdaq.sql tweets.json Join by date Select NYC Find Tech Selloff Plot • Lazy computation to minimize data movement
 • Simple DAG for
 compilation to • parallel application • distributed memory • static optimizations

Slide 19

Slide 19 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.expressions

Slide 20

Slide 20 text

Blaze Data • Single interface for data layers
 • Composition of different
 formats
 • Simple api to add 
 custom data formats SQL CSV HDFS JSON Mem Custom HDF5 Data

Slide 21

Slide 21 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data

Slide 22

Slide 22 text

Blaze Compute Compute DyND Pandas PyTables Spark • Computation abstraction over numerous data libraries
 • Simple multi-dispatched visitors to implement new backends
 • Allows plumbing between stacks to be seamless to user

Slide 23

Slide 23 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.compute

Slide 24

Slide 24 text

Blaze Example - Counting Weblinks Common Blaze Code #  Expr   t_idx  =  TableSymbol('{name:  string,                                              node_id:  int32}')   t_arc  =  TableSymbol('{node_out:  int32,                                              node_id:  int32}')   joined  =  Join(t_arc,  t_idx,  "node_id")   t  =  By(joined,  joined['name'],                  joined['node_id'].count())   #  Data  Load   idx,  arc  =  load_data()
 #  Computations   ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})
 in_deg  =  dict(ans)   in_deg[u'blogspot.com']

Slide 25

Slide 25 text

Blaze Example - Counting Weblinks Using Spark + HDFS load_data sc  =  SparkContext("local",  "Simple  App")   idx  =  sc.textFile(“hdfs://master.continuum.io/example_index.txt”)   idx  =  idx.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [x[0],  int(x[1])])   arc  =  sc.textFile("hdfs://master.continuum.io/example_arcs.txt")   arc  =  arc.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [int(x[0]),  int(x[1])])   Using Pandas + Local Disc with  open("example_index.txt")  as  f:          idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   idx  =  DataFrame(idx,  columns=['name',  'node_id'])   with  open("example_arcs.txt")  as  f:          arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])

Slide 26

Slide 26 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Blaze.API Table Using the interactive Table object we can interact with a variety of computational backends with the familiarity of a local DataFrame

Slide 27

Slide 27 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 27 Intro Large scale data analytics Interactive data visualization A practical example Blaze.API Table

Slide 28

Slide 28 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Blaze.API Migrations - into the into function makes it easy to moves data from one container type to another

Slide 29

Slide 29 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 29 Blaze notebooks Intro Large scale data analytics Interactive data visualization A practical example

Slide 30

Slide 30 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Why I like using Blaze? - Syntax is very similar to Pandas - Easy to scale - Easy to find best computational backend to a particular dataset - Easy to adapt my code if someone handles me a dataset in a different format/ backend

Slide 31

Slide 31 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Want to learn more about Blaze? Free Webinar: http://www.continuum.io/webinars/getting-started-with-blaze Blogpost: http://continuum.io/blog/blaze-expressions http://continuum.io/blog/blaze-migrations http://continuum.io/blog/blaze-hmda Docs and source code: http://blaze.pydata.org/ https://github.com/ContinuumIO/blaze

Slide 32

Slide 32 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Data visualization - An Overview Results presentation Visual analytics Static Interactive Small datasets Large datasets Traditional plots Novel graphics

Slide 33

Slide 33 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Bokeh • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • Matplotlib compatibility • No need to write Javascript http://bokeh.pydata.org/ https://github.com/ContinuumIO/bokeh

Slide 34

Slide 34 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Intro Large scale data analytics Interactive data visualization A practical example Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel Zoom, Save, Resize, Select, Reset View)

Slide 35

Slide 35 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 35 Intro Large scale data analytics Interactive data visualization A practical example Bokeh - Interactive, Visual analytics • Widgets and dashboards

Slide 36

Slide 36 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 36 Bokeh - Interactive, Visual analytics Intro Large scale data analytics Interactive data visualization A practical example • Crossfilter

Slide 37

Slide 37 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 37 Bokeh - Large datasets Server-side downsampling and abstract rendering Intro Large scale data analytics Interactive data visualization A practical example

Slide 38

Slide 38 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 38 Bokeh - No JavaScript Intro Large scale data analytics Interactive data visualization A practical example

Slide 39

Slide 39 text

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 Thank you! :)