Intro to Parquet (June 2015)

Intro to Parquet (June 2015)

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

April 06, 2016
Tweet

Transcript

  1. 9.
  2. 10.

    Binary, columnar storage format for big data analytics workloads, inspired

    by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
  3. 14.

    Columnar Storage 101 Advantages : - Limits I/O to the

    data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
  4. 16.
  5. 19.
  6. 20.
  7. 21.

    Definition and Repetition Levels Definition Level : Stores the level

    for which the field is null Repetition Level : Store levels when new lists are starting in column values.
  8. 22.
  9. 23.
  10. 24.
  11. 25.
  12. 26.
  13. 27.
  14. 28.

    Numbers Example: Appnexus 2 MM Logs of Ads impressions 270

    TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
  15. 30.
  16. 31.
  17. 33.
  18. 34.
  19. 35.
  20. 36.
  21. 37.
  22. 38.
  23. 41.
  24. 42.
  25. 43.
  26. 44.
  27. 45.
  28. 46.
  29. 47.
  30. 48.
  31. 49.
  32. 50.
  33. 51.
  34. 52.