Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Parquet (June 2015)

Intro to Parquet (June 2015)

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

April 06, 2016
Tweet

Transcript

  1. Sam BESSALAH @samklr http://parquet.apache.org

  2. Typical Data workflow

  3. Typical Data workflow

  4. Typical Data workflow

  5. Typical Data workflow

  6. Multiple Data Format

  7. Big Data Data Format Zoo - Sequence Files

  8. these formats provide

  9. None
  10. Binary, columnar storage format for big data analytics workloads, inspired

    by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
  11. Columnar Storage 101

  12. Columnar Storage 101

  13. Columnar Storage 101

  14. Columnar Storage 101 Advantages : - Limits I/O to the

    data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
  15. Columnar Storage 101

  16. None
  17. Parquet Model

  18. Example Parquet Schema

  19. None
  20. None
  21. Definition and Repetition Levels Definition Level : Stores the level

    for which the field is null Repetition Level : Store levels when new lists are starting in column values.
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. Numbers Example: Appnexus 2 MM Logs of Ads impressions 270

    TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
  29. simple bench with HIVE

  30. None
  31. None
  32. Disk Space usage on HDFS with 128 MB blocks

  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of

    the Apache Parquet Project
  40. BACKUP SLIDES

  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None