Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Apache Arrow and Parquet boost cross-language interoperability

How Apache Arrow and Parquet boost cross-language interoperability

PyData Paris 2016 talk about the importance and recent developments on the Python side of Apache Arrow and Apache Parquet.

Uwe L. Korn

June 14, 2016
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. Uwe L. Korn PyData Paris 14th June 2016 How Apache

    Arrow and Parquet boost cross-language interop
  2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) •

    We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  3. Different Systems - Varying Python Support • Various levels of

    Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  4. Apache Arrow • Specification for in-memory columnar data layout •

    No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  5. Apache Arrow - The Impact • An example: Retrieve a

    dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  6. Apache Arrow • Top-level Apache project from the beginning •

    Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  7. Arrow in Action: Feather • Language-agnostic file format for binary

    data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  8. Apache Parquet • Binary file format for nested columnar data

    • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  9. The Basics • 1 File, includes metadata • Several row

    groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  10. Using Parquet in Python • You can use it already

    today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  11. State of Arrow & Parquet Arrow in-memory spec for columnar

    data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  12. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way

    • alpha implementation using memory mapped files • JVM <-> native with shared reference counting