ApacheCon Europe Big Data 2016 – Parquet Format in Practice & Detail

ApacheCon Europe Big Data 2016 – Parquet Format in Practice & Detail

Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks.

As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

D6fcc16462fbe93673342da3ff5d8121?s=128

Uwe L. Korn

November 16, 2016
Tweet