$30 off During Our Annual Pro Sale. View Details »

How Apache Arrow and Parquet boost cross-language interoperability

How Apache Arrow and Parquet boost cross-language interoperability

PyData Paris 2016 talk about the importance and recent developments on the Python side of Apache Arrow and Apache Parquet.

Uwe L. Korn

June 14, 2016
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. Uwe L. Korn
    PyData Paris 14th June 2016
    How Apache Arrow and Parquet
    boost cross-language interop

    View Slide

  2. About me
    • Data Scientist at Blue Yonder (@BlueYonderTech)
    • We optimize Replenishment and Pricing for the Retail
    industry with Predictive Analytics
    • Contributor to Apache {Arrow, Parquet}
    • Work in Python, Cython, C++11 and SQL

    View Slide

  3. Agenda
    The Problem
    Arrow
    Parquet
    Outlook

    View Slide

  4. Why is columnar better?
    Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

    View Slide

  5. Different Systems - Varying
    Python Support
    • Various levels of Python Support
    • Build in Python
    • Python API
    • No Python at all
    • Each tool/algorithm works on
    columnar data
    • Separate conversion routines for
    each pair
    • causes overhead
    • there’s no one-size-fits-all solution
    Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )

    View Slide

  6. Apache Arrow
    • Specification for in-memory
    columnar data layout
    • No overhead for cross-system /
    cross-language communication
    • Designed for efficiency (exploit
    SIMD, cache locality, ..)
    • Supports nested data structures
    Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )

    View Slide

  7. Apache Arrow - The Impact
    • An example: Retrieve a dataset from an MPP database
    and analyze it in Pandas
    • Run a query in the DB
    • Pass it in columnar form to the DB driver
    • The OBDC layer transform it into row-wise form
    • Pandas makes it columnar again
    • Ugly real-life solution: export as CSV, bypass ODBC
    • In future: Use Arrow as interface between the DB and
    Pandas

    View Slide

  8. Apache Arrow
    • Top-level Apache project from the beginning
    • Not only a specification: also includes C++ / Java /
    Python / .. code.
    • Arrow structures / classes
    • RPC (upcoming) & IPC (alpha) support
    • Conversion code for Parquet, Pandas, ..
    • Combined effort from developer of over 13 major OSS
    projects
    • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
    • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md

    View Slide

  9. Arrow in Action: Feather
    • Language-agnostic file format for
    binary data frame storage
    • Read performance close to raw
    disk I/O
    • by Wes McKinney (Python) and
    Hadley Wickham (R)
    • Julia Support in progress
    Arrow Arrays
    Feather Metadata
    (flatbuffers)

    View Slide

  10. Apache Parquet

    View Slide

  11. Apache Parquet
    • Binary file format for nested columnar data
    • Inspired from Google Dremel paper
    • space and query efficient
    • multiple encodings
    • predicate pushdown
    • column-wise compression
    • many tools use Parquet as the default input format
    • very popular in the JVM/Hadoop-based world

    View Slide

  12. The Basics
    • 1 File, includes metadata
    • Several row groups
    • all with the same number of column chunks
    • n pages per column chunk
    • Benefits:
    • pre-partitioned for fast distributed access
    • statistics in the metadata for predicate pushdown
    Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-
    simple-with-parquet
    File
    Row Group
    Column Chunk
    Page

    View Slide

  13. Using Parquet in Python
    • You can use it already today with Python:
    • sqlContext.read.parquet(“..“).toPandas()
    • Needs to pass through Spark, very slow
    • Native Python support on its way:
    • Parquet I/O to Arrow
    • Arrow provides NumPy conversion

    View Slide

  14. State of Arrow & Parquet
    Arrow
    in-memory spec for columnar data
    • Java (beta)
    • C++ (in progress)
    • Python (in progress)
    • Planned:
    • Julia
    • R
    Parquet
    columnar on-disk storage
    • Java (mature)
    • C++ (in progress)
    • Python (in progress)
    • Planned:
    • Julia
    • R

    View Slide

  15. Upcoming
    • Parquet <-Arrow-> Pandas
    • IPC on its way
    • alpha implementation using memory mapped files
    • JVM <-> native with shared reference counting

    View Slide

  16. Get Involved!
    [email protected] & [email protected]
    • https://apachearrowslackin.herokuapp.com/
    • https://arrow.apache.org/
    • https://parquet.apache.org/
    • @ApacheArrow & @ApacheParquet

    View Slide

  17. Questions ?!

    View Slide