Unlocking the True Value of Big Data with Open Data Science

Unlocking the True Value of Big Data with Open Data
Science Kristopher Overholt Solution Architect Big Techday 9 Munich June 3, 2016

Overview 2 • Overview of Open Data Science • Open
Data Science Ecosystem • Packaging - Conda • Optimization - Numba • Visualization - Bokeh and Datashader • Parallelization - Dask • Example Parallel Workflows with Python • Summary

Overview of Open Data Science

Open Data Science Needs 4 Collaboration • Iterate on analysis
• Share discoveries with team • Interact with teams across the globe Interactivity • Interact with data • Build high performance models • Visualize results in context Integration • Work with open source and legacy data systems • Leverage data science languages: Python, R, Matlab, SAS, SPSS, Excel, Java, C,  C++, C#, .NET, Fortran  and more Predict Share Deploy with Open Data Science

Data science is not just machine learning… 5 Distributed Systems
Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC

Data science is interdisciplinary 6 Distributed Systems Hadoop, Spark Scientific
Computing / HPC GPUs, multi-core Machine Learning / Statistics Classification, deep learning Regression, PCA Web Web crawling, scraping, third-party data and API providers, predictive services and APIs Business Intelligence Data warehouse, querying, reporting

Open Data Science Team 7 Data Scientist • Hadoop /
Spark • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization Biz Analyst • Spreadsheets • Visualization • Notebooks • Analytic Development Environment Data Engineer • Database / Data Warehouse • ETL Developer • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization DevOps • Database / Data Warehouse • Middleware • Programming Languages Right technology and tools for the problem

Open Data Science is … 8 an inclusive movement that
makes open source tools of data science — data, analytics, and computation — easily work together as a connected ecosystem

Open Data Science means…. 9 - Availability - Innovation -
Interoperability - Transparency for everyone on the data science team Open Data Science is the foundation to modernization

Open Data Science Ecosystem

Open Source Communities Create Powerful Technology for Data Science 11
Numba xlwings Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh Lasagne

Python is the Common Language 12 Numba xlwings Airflow Blaze
Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh Lasagne

Not the Only One… 13 SQL Distributed Systems Business Intelligence
Machine Learning / Statistics Web Scientific Computing / HPC

Python is also a great glue language 14 SQL Distributed
Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC

Anaconda is the Open Data Science Platform Bringing Technology Together…
15 Numba Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh SQL Lasagne

Continuum-Supported Foundational Open-Source Components 16 Package Management Data Analysis Optimization
Visualization Parallelization Conda Anaconda NumPy SciPy Pandas Numba Bokeh Datashader matplotlib HoloViews Dask

Packaging

• Open source, cross-platform package and environment management • Install
multiple versions of software packages and their dependencies • Easily create multiple environments and switch between them • Powered by Python, but can package anything! • http://conda.pydata.org Conda 18 Windows Mac Linux Python R Java Scala

• 720+ Popular Packages • Optimized & Compiled • Free
for Everyone • Flexible conda package manager • Sandboxed packages and libraries • Cross-platform - Windows, Linux, Mac • Not just Python - over 230 R packages Anaconda 19 Leading Open Data Science platform powered by Python

conda-forge 20

conda-forge 21 • Community-driven repository and build/CI framework • Builds
on: • Windows (AppVeyor) • Mac (Travis CI) • Linux (CircleCI) • https://conda-forge.github.io conda config --add-channels conda-forge conda install <package-name>

Optimization

Python for High Performance 23 • Easy to use •
Simple, easy to read/write syntax • Batteries included • Ships with lots of basic functions • Innovations from open source • Open access to a huge variety of existing libraries and algorithms • Very easy to get high performance when you need it…

Numba 24 • Speed up your applications with high performance
Python functions • Compiles Python scripts into machine code (CPUs and GPUs) • Increases performance from 2-200X, near C/C++/Fortran speeds • Utilize data, code and in-notebook profilers to identify bottlenecks • numba.pydata.org

Numba 25

Numba 26

Numba 27 4 CPU cores, mobile GPU 8 CPU cores,
midrange GPU 16 CPU cores per node, high end GPUs 64 cores, No GPU Same code, different devices, maximum performance

Visualization

Bokeh 29 • Interactive visualization framework for web browsers •
No need to write JavaScript • Python, R, Scala, and Lua bindings • Easy to embed in web applications • Easy to develop interactive applications and dashboards • bokeh.pydata.org

Visualizing Large Datasets 30 27 Large data visualizations

Visualizing Large Datasets 31 Overplotting Undersampling

Datashader 32 • Graphics pipeline system for creating meaningful representations
of large amounts of data • Handles very large datasets in and out of core (e.g., billions of data points) • datashader.readthedocs.io NYC Census data by race

Datashader 33 US Census Data NYC Taxi Pickups and Drop
Offs

Parallelization

Parallelizing the PyData ecosystem… 35 #SciPy2015 Jake VanderPlas Python’s Scientific
Ecosystem (and many, many more) The State of the Stack, Jake VanderPlas, SciPy 2015  https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote … without rewriting everything

36 Dask complements the Python ecosystem. It was developed with
NumPy, Pandas, and scikit-learn developers.

Overview of Dask 37 Dask is a Python parallel computing
library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science

Spectrum of Parallelization 38 Threads Processes MPI ZeroMQ Dask Hadoop
Spark SQL: Hive Pig Impala Implicit control: Restrictive but easy Explicit control: Fast but hard

Dask: From User Interaction to Execution 39

Dask Collections: Familiar Expressions and API 40 x.T - x.mean(axis=0)
df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask delayed (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data)

Example Parallel Workflows with Python

Jupyter 42 • Open source, interactive data science and scientific
computing across over 40 programming languages. • Allows you to create and share documents that contain live code, equations, visualizations and explanatory text. The Jupyter Notebook is a web application that allows you and share documents that contain live code, equatio visualizations and explanatory text. Open source, interactive data science and scientific computing across over 40 programming languages. Jupyter The Jupyter Notebook is a web application that allows you and share documents that contain live code, equati visualizations and explanatory text. Open source, interactive data science and scientific computing across over 40 programming languages. Jupyter

Examples 43 Analyzing NYC Taxi CSV data using distributed Dask
DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Delayed • Deal with messy situations • Learn about scheduling 1 2 3 4

Example 1: Distributed DataFrames 44 • Built from Pandas DataFrames
• Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame } Dask DataFrame }

Example 2: Natural Language Processing 45 • Distributed natural language
processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function

Example 3: Distributed Arrays 46 • Built from NumPy  n-dimensional
arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms NumPy Array } }Dask Array

Example 4: Custom workflows 47 • Manually handle functions to
support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution

Summary

Scale Up and Scale Out with Python 49 Large memory
and multi-core (CPU and GPU) machines Best of both (e.g., GPU Cluster) Multiple nodes in a cluster Scale Up (Bigger Nodes) Scale Out (More Nodes)

Python and Big Data 50 YARN JVM Bottom Line  10-100X
faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 50

Open Data Science Stack 51 Application Analytics Data and  Resource
Management Server Jupyter/IPython Notebook pandas, NumPy, SciPy, Numba, NLTK, scikit-learn, scikit-image,  and more from Anaconda … HDFS, YARN, SGE, Slurm or other distributed systems Bare-metal or Cloud-based Cluster Anaconda Parallel Computation Dask Spark Hive, Impala Cluster

Continuum-Supported Foundational Open-Source Components 52 Package Management Data Analysis Optimization
Visualization Parallelization Conda Anaconda NumPy SciPy Pandas Numba Bokeh Datashader matplotlib HoloViews Dask

Additional Resources 53 • Packaging - Conda - conda.pydata.org •
Optimization - Numba - numba.pydata.org • Visualization - Bokeh - bokeh.pydata.org • Visualization - Datashader - datashader.readthedocs.io • Parallelization - Dask - dask.pydata.org • Notebooks - anaconda.org/dask/notebooks • Anaconda - continuum.io/downloads

Thank you 54 Kristopher Overholt Twitter: @koverholt Web: www.continuum.io Twitter:
@ContinuumIO

Unlocking the True Value of Big Data with Open ...

Unlocking the True Value of Big Data with Open Data Science

Other Decks in Technology

Featured

Transcript