Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Parquet (June 2015)

Intro to Parquet (June 2015)

Sam Bessalah

April 06, 2016
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Binary, columnar storage format for big data analytics workloads, inspired

    by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
  2. Columnar Storage 101 Advantages : - Limits I/O to the

    data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
  3. Definition and Repetition Levels Definition Level : Stores the level

    for which the field is null Repetition Level : Store levels when new lists are starting in column values.
  4. Numbers Example: Appnexus 2 MM Logs of Ads impressions 270

    TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/