$30 off During Our Annual Pro Sale. View Details »

Intro to Parquet (June 2015)

Intro to Parquet (June 2015)

Sam Bessalah

April 06, 2016
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Sam BESSALAH
    @samklr
    http://parquet.apache.org

    View Slide

  2. Typical Data workflow

    View Slide

  3. Typical Data workflow

    View Slide

  4. Typical Data workflow

    View Slide

  5. Typical Data workflow

    View Slide

  6. Multiple Data Format

    View Slide

  7. Big Data Data Format Zoo
    - Sequence Files

    View Slide

  8. these formats provide

    View Slide

  9. View Slide

  10. Binary, columnar storage format for big data analytics workloads, inspired by
    the Google Dremel Paper.
    - Language independent
    - Processing framework independent
    - Formally specified
    - More than a columnar storage : Dynamic partionning, automatic predicate
    and projections push down
    - Awesome performance

    View Slide

  11. Columnar Storage 101

    View Slide

  12. Columnar Storage 101

    View Slide

  13. Columnar Storage 101

    View Slide

  14. Columnar Storage 101
    Advantages :
    - Limits I/O to the data only needed
    - Big Space savings, better compression, and faster and low
    overhead encodings
    - Enables vectorized engine

    View Slide

  15. Columnar Storage 101

    View Slide

  16. View Slide

  17. Parquet Model

    View Slide

  18. Example Parquet Schema

    View Slide

  19. View Slide

  20. View Slide

  21. Definition and Repetition Levels
    Definition Level :
    Stores the level for which the field is null
    Repetition Level :
    Store levels when new lists are starting in
    column values.

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. Numbers
    Example: Appnexus
    2 MM Logs of Ads impressions
    270 TB of Log Data in Protobuf on HDFS
    http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/

    View Slide

  29. simple bench with HIVE

    View Slide

  30. View Slide

  31. View Slide

  32. Disk Space usage on HDFS with 128 MB blocks

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of the Apache Parquet Project

    View Slide

  40. BACKUP SLIDES

    View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide