Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Introduction to large scale data analytics and ...

Christine Doig
November 09, 2014

Introduction to large scale data analytics and interactive visualizations in the browser

Introduction to Large Scale Data Analytics and Interactive Visualization in the Browser with Blaze and Bokeh.

http://bokeh.pydata.org/
http://blaze.pydata.org/

PyConES 2014, Zaragoza

Christine Doig

November 09, 2014
Tweet

More Decks by Christine Doig

Other Decks in Technology

Transcript

  1. Introduction to large scale data analytics and interactive visualizations in

    the browser with Blaze and Bokeh Christine Doig, PyConES 2014
  2. Intro Large scale data analytics Interactive data visualization About me

    Christine Doig @ch_doig Data Scientist, Continuum Analytics ! [email protected] ! Education background: • Industrial Engineering, UPC. • Quantitative Techniques for Financial Markets, UPC • Data Mining and Business Intelligence, UPC. ! Professional experience: • Energy - E.ON • Manufacturing - A&A, P&G • Banking - La Caixa • Social media ! Experience analyzing diverse datasets using a diverse set of tools: Matlab, Excel, SAS, SQL, R, Python ! Talks/events: PyLadiesATX, APUG, PyBCN, PyLadiesBCN, PyTexas, PyConES… Development environment
  3. About Continuum Analytics http://continuum.io/ ! ! Moving Expertise to Data

    Committed to Open Source • Anaconda: Free Python distribution • Projects: Conda, Blaze, Numba, Bokeh • Contributors: NumPy, SciPy, Chaco, SymPy • Sponsor: PyTexas, Scipy… Commercial • Anaconda Server: Enterprise deployment • Wakari: Cloud Data Analytics • Add-ons: IOPro, NumbaPro, Accelerate • Consulting and Training Services Intro Large scale data analytics Interactive data visualization Development environment
  4. About this talk 0. Intro - About me - About

    Continuum Analytics - About this talk ! 1. Development environment - Conda - Binstar Introduction to large scale data analytics and interactive visualizations in the browser Objective Structure 2. Large scale data analytics - Overview - Intro to Blaze - Examples ! 3. Interactive data visualization - Overview - Intro to Bokeh - Examples Intro Large scale data analytics Interactive data visualization Development environment
  5. Intro Large scale data analytics Interactive data visualization Development environment

    Conda http://conda.pydata.org/ • A cross-platform Python-agnostic binary package manager: $ conda install scipy $ conda install julia $ conda install scala $ conda install nodejs $ conda install mongodb $ conda install python=3.4 • homebrew + pip + virtualenv -> conda • Available: Anaconda: https://store.continuum.io/cshop/anaconda/ Miniconda (conda + python): http://conda.pydata.org/miniconda.html ! • Using Conda with Travis CI: http://conda.pydata.org/docs/travis.html
  6. Intro Large scale data analytics Interactive data visualization Development environment

    Binstar https://binstar.org/ Package management service to make software development, release, and maintenance easy. http://docs.binstar.org/
  7. Intro Large scale data analytics Interactive data visualization Development environment

    Binstar https://binstar.org/ Package management service to make software development, release, and maintenance easy. http://docs.binstar.org/
  8. • Solid hands-on experience in developing analytical solutions using statistical

    tools (e.g. R, SAS, or similar) • Experience in implementing Machine Learning systems which may include classification, clustering, natural language processing and time series analysis. • Hands-on experience in database management (MS SQL, MySQL, PostgreSQL…) • Solid hands-on coding experience in Python, Java, C++, or similar • Experience in dealing with large data sets and a solid understanding of Big Data technologies and applications (AWS, Hadoop, MapReduce, Hive, Hbase, etc). • Sound presentation skills, visualizing complicated data science results in Tableau, Microstrategy, or similar • Comfortable working with front-end development technologies, including: HTML, CSS, JavaScript, D3.js, Django, etc. What’s a Data Scientist? Intro Large scale data analytics Interactive data visualization Development environment
  9. Process diagram CRISP-DM: Cross Industry Standard Process for Data Mining.

    Source: Wikipedia [1] http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html “Data Mining”: It’s not just about modeling... “CRISP-DM, still the top methodology for analytics, data mining, or data science projects” [1]. …it’s also about business understanding, data understanding, data preparation, evaluation and deployment. Intro Large scale data analytics Interactive data visualization Development environment
  10. Let’s make it easier for users to explore and extract

    useful insights out of data. Intro Large scale data analytics Interactive data visualization Development environment
  11. Let’s make it easier for users to explore and extract

    useful insights out of data. Free enterprise-ready Python distribution Intro Large scale data analytics Interactive data visualization Development environment
  12. Let’s make it easier for users to explore and extract

    useful insights out of data. Free enterprise-ready Python distribution Anaconda Intro Large scale data analytics Interactive data visualization Development environment
  13. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Intro Large scale data analytics Interactive data visualization Development environment
  14. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Intro Large scale data analytics Interactive data visualization Development environment
  15. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Power to speed up Intro Large scale data analytics Interactive data visualization Development environment
  16. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Numba Power to speed up Intro Large scale data analytics Interactive data visualization Development environment
  17. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Numba Power to speed up Scale Intro Large scale data analytics Interactive data visualization Development environment
  18. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Numba Power to speed up Scale Intro Large scale data analytics Interactive data visualization Development environment
  19. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Numba Power to speed up Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment
  20. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Power to speed up Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment
  21. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Power to speed up Share and deploy Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment
  22. Let’s make it easier for users to explore and extract

    useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Wakari Power to speed up Share and deploy Interactive data visualizations Scale Intro Large scale data analytics Interactive data visualization Development environment
  23. Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data “Big Data”: It’s not just about volume... …it’s

    also about variety: ! - storage mechanisms - processing engines - data structures - data formats - data location - data sizes - user skills - … Intro Large scale data analytics Interactive data visualization Development environment
  24. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment
  25. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment
  26. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Intro Large scale data analytics Interactive data visualization Development environment
  27. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment
  28. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment
  29. Large scale data analytics - An Overview BI - DB

    DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz Intro Large scale data analytics Interactive data visualization Development environment Analysts?
  30. Intro Large scale data analytics Interactive data visualization Development environment

    Blaze Blaze is a NumPy/Pandas interface to big data systems like SQL, HDFS, and Spark. ! Motivation: • NumPy/Pandas limited by memory. • Picking up new projects/technologies is costly. ! Usability: - A common interface to a variety of backends - Serve data - Interactive exploration - Data migrations
  31. Distributed Systems Scientific Computing BI - DB DM/Stats/ML Blaze Connecting

    technologies to users Connecting technologies to each other Blaze hdf5 Intro Large scale data analytics Interactive data visualization Development environment
  32. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    Intro Large scale data analytics Interactive data visualization A practical example HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze
  33. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.expressions Intro Large scale data analytics Interactive data visualization Development environment TableSymbol -> Symbol (Array, nested structures… not just Tables)
  34. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment
  35. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment
  36. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment
  37. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.data Intro Large scale data analytics Interactive data visualization Development environment
  38. Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame

    HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze.compute Intro Large scale data analytics Interactive data visualization Development environment
  39. Blaze.API Table -> Data Using the interactive Table -> Data

    object we can interact with a variety of computational backends with the familiarity of a local DataFrame Intro Large scale data analytics Interactive data visualization Development environment
  40. Blaze.API Migrations - into the into function makes it easy

    to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment
  41. Blaze.API Migrations - into the into function makes it easy

    to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment
  42. Blaze.API Migrations - into the into function makes it easy

    to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment
  43. Blaze.API Migrations - into the into function makes it easy

    to moves data from one container type to another Intro Large scale data analytics Interactive data visualization Development environment
  44. Why I like using Blaze? ! - Syntax is very

    similar to Pandas - Easy to scale - Easy to find best computational backend to a particular dataset - Easy to adapt my code if someone handles me a dataset in a different format/ backend - Usability Intro Large scale data analytics Interactive data visualization Development environment
  45. Data visualization - An Overview Results presentation Visual analytics Static

    Interactive Small datasets Large datasets Traditional plots Novel graphics Intro Large scale data analytics Interactive data visualization Development environment
  46. Bokeh • Interactive visualization • Novel graphics • Streaming, dynamic,

    large data • For the browser, with or without a server • Matplotlib compatibility • No need to write Javascript http://bokeh.pydata.org/ https://github.com/bokeh/bokeh Intro Large scale data analytics Interactive data visualization Development environment
  47. Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel

    Zoom, Save, Resize, Select, Reset View) Intro Large scale data analytics Interactive data visualization Development environment
  48. Bokeh - Interactive, Visual analytics • Widgets and dashboards Intro

    Large scale data analytics Interactive data visualization Development environment
  49. Bokeh - Interactive, Visual analytics • Crossfilter Intro Large scale

    data analytics Interactive data visualization Development environment
  50. Bokeh - Large datasets Server-side downsampling and abstract rendering Intro

    Large scale data analytics Interactive data visualization Development environment
  51. Bokeh - No JavaScript Intro Large scale data analytics Interactive

    data visualization Development environment
  52. Bokeh - No JavaScript Intro Large scale data analytics Interactive

    data visualization Development environment