How Apache Arrow and Parquet boost cross-language interoperability

Uwe L. Korn PyData Paris 14th June 2016 How Apache
Arrow and Parquet boost cross-language interop

About me • Data Scientist at Blue Yonder (@BlueYonderTech) •
We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL

Agenda The Problem Arrow Parquet Outlook

Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

Diﬀerent Systems - Varying Python Support • Various levels of
Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )

Apache Arrow • Specification for in-memory columnar data layout •
No overhead for cross-system / cross-language communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )

Apache Arrow - The Impact • An example: Retrieve a
dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas

Apache Arrow • Top-level Apache project from the beginning •
Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined eﬀort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md

Arrow in Action: Feather • Language-agnostic file format for binary
data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuﬀers)

Apache Parquet

Apache Parquet • Binary file format for nested columnar data
• Inspired from Google Dremel paper • space and query eﬃcient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world

The Basics • 1 File, includes metadata • Several row
groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page

Using Parquet in Python • You can use it already
today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion

State of Arrow & Parquet Arrow in-memory spec for columnar
data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R

Upcoming • Parquet <-Arrow-> Pandas • IPC on its way
• alpha implementation using memory mapped files • JVM <-> native with shared reference counting

Get Involved! • [email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/
• https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet

Questions ?!

How Apache Arrow and Parquet boost cross-langua...

How Apache Arrow and Parquet boost cross-language interoperability

Uwe L. Korn

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript

Uwe L. Korn PyData Paris 14th June 2016 How Apache

About me • Data Scientist at Blue Yonder (@BlueYonderTech) •

Agenda The Problem Arrow Parquet Outlook

Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

Diﬀerent Systems - Varying Python Support • Various levels of

Apache Arrow • Specification for in-memory columnar data layout •

Apache Arrow - The Impact • An example: Retrieve a

Apache Arrow • Top-level Apache project from the beginning •

Arrow in Action: Feather • Language-agnostic file format for binary

Apache Parquet

Apache Parquet • Binary file format for nested columnar data

The Basics • 1 File, includes metadata • Several row

Using Parquet in Python • You can use it already

State of Arrow & Parquet Arrow in-memory spec for columnar

Upcoming • Parquet <-Arrow-> Pandas • IPC on its way

Get Involved! • [email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/

Questions ?!