Intro to Parquet (June 2015)

Sam BESSALAH @samklr http://parquet.apache.org

Typical Data workflow

Multiple Data Format

Big Data Data Format Zoo - Sequence Files

these formats provide

Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance

Columnar Storage 101

Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine

Columnar Storage 101

Parquet Model

Example Parquet Schema

Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.

Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/

simple bench with HIVE

Disk Space usage on HDFS with 128 MB blocks

Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project

BACKUP SLIDES

Intro to Parquet (June 2015)

Intro to Parquet (June 2015)

Sam Bessalah

More Decks by Sam Bessalah

Other Decks in Technology

Featured

Transcript

Sam BESSALAH @samklr http://parquet.apache.org

Typical Data workflow

Typical Data workflow

Typical Data workflow

Typical Data workflow

Multiple Data Format

Big Data Data Format Zoo - Sequence Files

these formats provide

Binary, columnar storage format for big data analytics workloads, inspired

Columnar Storage 101

Columnar Storage 101

Columnar Storage 101

Columnar Storage 101 Advantages : - Limits I/O to the

Columnar Storage 101

Parquet Model

Example Parquet Schema

Definition and Repetition Levels Definition Level : Stores the level

Numbers Example: Appnexus 2 MM Logs of Ads impressions 270

simple bench with HIVE

Disk Space usage on HDFS with 128 MB blocks

Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of

BACKUP SLIDES