How Apache Arrow and Parquet boost cross-language interoperability

How Apache Arrow and Parquet boost cross-language interoperability

PyData Paris 2016 talk about the importance and recent developments on the Python side of Apache Arrow and Apache Parquet.

D6fcc16462fbe93673342da3ff5d8121?s=128

Uwe L. Korn

June 14, 2016
Tweet

Transcript

  1. Uwe L. Korn PyData Paris 14th June 2016 How Apache

    Arrow and Parquet boost cross-language interop
  2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) •

    We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  3. Agenda The Problem Arrow Parquet Outlook

  4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

  5. Different Systems - Varying Python Support • Various levels of

    Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  6. Apache Arrow • Specification for in-memory columnar data layout •

    No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  7. Apache Arrow - The Impact • An example: Retrieve a

    dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  8. Apache Arrow • Top-level Apache project from the beginning •

    Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  9. Arrow in Action: Feather • Language-agnostic file format for binary

    data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  10. Apache Parquet

  11. Apache Parquet • Binary file format for nested columnar data

    • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  12. The Basics • 1 File, includes metadata • Several row

    groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  13. Using Parquet in Python • You can use it already

    today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  14. State of Arrow & Parquet Arrow in-memory spec for columnar

    data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way

    • alpha implementation using memory mapped files • JVM <-> native with shared reference counting
  16. Get Involved! • dev@arrow.apache.org & dev@parquet.apache.org • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/

    • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet
  17. Questions ?!