Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking the True Value of Big Data with Open ...

Unlocking the True Value of Big Data with Open Data Science

Big Techday 9
Munich, Germany
June 3, 2016

Kristopher Overholt

June 03, 2016
Tweet

Other Decks in Technology

Transcript

  1. Unlocking the True Value of Big Data with Open Data

    Science Kristopher Overholt Solution Architect Big Techday 9 Munich June 3, 2016
  2. Overview 2 • Overview of Open Data Science • Open

    Data Science Ecosystem • Packaging - Conda • Optimization - Numba • Visualization - Bokeh and Datashader • Parallelization - Dask • Example Parallel Workflows with Python • Summary
  3. Open Data Science Needs 4 Collaboration • Iterate on analysis

    • Share discoveries with team • Interact with teams across the globe Interactivity • Interact with data • Build high performance models • Visualize results in context Integration • Work with open source and legacy data systems • Leverage data science languages: Python, R, Matlab, SAS, SPSS, Excel, Java, C,
 C++, C#, .NET, Fortran
 and more Predict Share Deploy with Open Data Science
  4. Data science is not just machine learning… 5 Distributed Systems

    Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  5. Data science is interdisciplinary 6 Distributed Systems Hadoop, Spark Scientific

    Computing / HPC GPUs, multi-core Machine Learning / Statistics Classification, deep learning Regression, PCA Web Web crawling, scraping, third-party data and API providers, predictive services and APIs Business Intelligence Data warehouse, querying, reporting
  6. Open Data Science Team 7 Data Scientist • Hadoop /

    Spark • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization Biz Analyst • Spreadsheets • Visualization • Notebooks • Analytic Development Environment Data Engineer • Database / Data Warehouse • ETL Developer • Programming Languages • Analytic Libraries • IDE • Notebooks • Visualization DevOps • Database / Data Warehouse • Middleware • Programming Languages Right technology and tools for the problem
  7. Open Data Science is … 8 an inclusive movement that

    makes open source tools of data science — data, analytics, and computation — easily work together as a connected ecosystem
  8. Open Data Science means…. 9 - Availability - Innovation -

    Interoperability - Transparency for everyone on the data science team Open Data Science is the foundation to modernization
  9. Open Source Communities Create Powerful Technology for Data Science 11

    Numba xlwings Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh Lasagne
  10. Python is the Common Language 12 Numba xlwings Airflow Blaze

    Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh Lasagne
  11. Not the Only One… 13 SQL Distributed Systems Business Intelligence

    Machine Learning / Statistics Web Scientific Computing / HPC
  12. Python is also a great glue language 14 SQL Distributed

    Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  13. Anaconda is the Open Data Science Platform Bringing Technology Together…

    15 Numba Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC Bokeh SQL Lasagne
  14. Continuum-Supported Foundational Open-Source Components 16 Package Management Data Analysis Optimization

    Visualization Parallelization Conda Anaconda NumPy SciPy Pandas Numba Bokeh Datashader matplotlib HoloViews Dask
  15. • Open source, cross-platform package and environment management • Install

    multiple versions of software packages and their dependencies • Easily create multiple environments and switch between them • Powered by Python, but can package anything! • http://conda.pydata.org Conda 18 Windows Mac Linux Python R Java Scala
  16. • 720+ Popular Packages • Optimized & Compiled • Free

    for Everyone • Flexible conda package manager • Sandboxed packages and libraries • Cross-platform - Windows, Linux, Mac • Not just Python - over 230 R packages Anaconda 19 Leading Open Data Science platform powered by Python
  17. conda-forge 21 • Community-driven repository and build/CI framework • Builds

    on: • Windows (AppVeyor) • Mac (Travis CI) • Linux (CircleCI) • https://conda-forge.github.io conda config --add-channels conda-forge conda install <package-name>
  18. Python for High Performance 23 • Easy to use •

    Simple, easy to read/write syntax • Batteries included • Ships with lots of basic functions • Innovations from open source • Open access to a huge variety of existing libraries and algorithms • Very easy to get high performance when you need it…
  19. Numba 24 • Speed up your applications with high performance

    Python functions • Compiles Python scripts into machine code (CPUs and GPUs) • Increases performance from 2-200X, near C/C++/Fortran speeds • Utilize data, code and in-notebook profilers to identify bottlenecks • numba.pydata.org
  20. Numba 27 4 CPU cores, mobile GPU 8 CPU cores,

    midrange GPU 16 CPU cores per node, high end GPUs 64 cores, No GPU Same code, different devices, maximum performance
  21. Bokeh 29 • Interactive visualization framework for web browsers •

    No need to write JavaScript • Python, R, Scala, and Lua bindings • Easy to embed in web applications • Easy to develop interactive applications and dashboards • bokeh.pydata.org
  22. Datashader 32 • Graphics pipeline system for creating meaningful representations

    of large amounts of data • Handles very large datasets in and out of core (e.g., billions of data points) • datashader.readthedocs.io NYC Census data by race
  23. Parallelizing the PyData ecosystem… 35 #SciPy2015 Jake VanderPlas Python’s Scientific

    Ecosystem (and many, many more) The State of the Stack, Jake VanderPlas, SciPy 2015
 https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote … without rewriting everything
  24. 36 Dask complements the Python ecosystem. It was developed with

    NumPy, Pandas, and scikit-learn developers.
  25. Overview of Dask 37 Dask is a Python parallel computing

    library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science
  26. Spectrum of Parallelization 38 Threads Processes MPI ZeroMQ Dask Hadoop

    Spark SQL: Hive Pig Impala Implicit control: Restrictive but easy Explicit control: Fast but hard
  27. Dask Collections: Familiar Expressions and API 40 x.T - x.mean(axis=0)

    df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask delayed (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data)
  28. Jupyter 42 • Open source, interactive data science and scientific

    computing across over 40 programming languages. • Allows you to create and share documents that contain live code, equations, visualizations and explanatory text. The Jupyter Notebook is a web application that allows you and share documents that contain live code, equatio visualizations and explanatory text. Open source, interactive data science and scientific computing across over 40 programming languages. Jupyter The Jupyter Notebook is a web application that allows you and share documents that contain live code, equati visualizations and explanatory text. Open source, interactive data science and scientific computing across over 40 programming languages. Jupyter
  29. Examples 43 Analyzing NYC Taxi CSV data using distributed Dask

    DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Delayed • Deal with messy situations • Learn about scheduling 1 2 3 4
  30. Example 1: Distributed DataFrames 44 • Built from Pandas DataFrames

    • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame } Dask DataFrame }
  31. Example 2: Natural Language Processing 45 • Distributed natural language

    processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function
  32. Example 3: Distributed Arrays 46 • Built from NumPy
 n-dimensional

    arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms NumPy Array } }Dask Array
  33. Example 4: Custom workflows 47 • Manually handle functions to

    support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution
  34. Scale Up and Scale Out with Python 49 Large memory

    and multi-core (CPU and GPU) machines Best of both (e.g., GPU Cluster) Multiple nodes in a cluster Scale Up (Bigger Nodes) Scale Out (More Nodes)
  35. Python and Big Data 50 YARN JVM Bottom Line
 10-100X

    faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 50
  36. Open Data Science Stack 51 Application Analytics Data and
 Resource

    Management Server Jupyter/IPython Notebook pandas, NumPy, SciPy, Numba, NLTK, scikit-learn, scikit-image,
 and more from Anaconda … HDFS, YARN, SGE, Slurm or other distributed systems Bare-metal or Cloud-based Cluster Anaconda Parallel Computation Dask Spark Hive, Impala Cluster
  37. Continuum-Supported Foundational Open-Source Components 52 Package Management Data Analysis Optimization

    Visualization Parallelization Conda Anaconda NumPy SciPy Pandas Numba Bokeh Datashader matplotlib HoloViews Dask
  38. Additional Resources 53 • Packaging - Conda - conda.pydata.org •

    Optimization - Numba - numba.pydata.org • Visualization - Bokeh - bokeh.pydata.org • Visualization - Datashader - datashader.readthedocs.io • Parallelization - Dask - dask.pydata.org • Notebooks - anaconda.org/dask/notebooks • Anaconda - continuum.io/downloads