Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parquet - from a file format to a system archit...

Parquet - from a file format to a system architecture

Norbert Orzechowicz

October 27, 2024
Tweet

More Decks by Norbert Orzechowicz

Other Decks in Programming

Transcript

  1. Key Goals • Reduced I/O (Input/Output) for analytical queries •

    Optimized data compression to save storage space • Efficient data encoding to save storage space • Support for complex data structures (nested fields) • Strict schema with schema evolution support https://flow-php.com
  2. CSV Advantages • Simple, human-readable, easy to generate and parse

    • Supported almost universally across programming languages and tools • Quickly importable to any spreadsheet application • Simple to modify with just a text editor Disadvantages • Inefficient in storage and querying • No support for nested data structures or metadata • No support for metadata or schema • No standard for escaping or formatting • Data integrity issues https://flow-php.com
  3. XML Advantages • Human-readable, text-based and can be easily read

    by humans and machines alike • Supports complex and nested data • Validation and schema enforcement, through XSD and DTD Disadvantages • Extremely large file sizes due to heavy tagging, leading to inefficient storage. • Complexity increases significantly for deeply nested structures, making it harder to manage and debug • Not designed for efficient querying and processing of large datasets in analytical systems https://flow-php.com
  4. JSON Advantages • Supports nested structures, arrays, and complex types

    natively • Native support in most modern programming languages • Easily readable and writable for humans, making debugging and manual inspection straightforward • Universally accepted in web APIs, configuration files, and data interchange formats Disadvantages • Lack of compression and indexing • No strict schema enforcement • JSON's structure isn't well-suited for analytical queries and partial data access https://flow-php.com
  5. Avro Advantages • Designed for efficient binary serialization • Support

    for schema and schema evolution • Support for nested structures • Good for streaming use cases Disadvantages • Not human-readable, due to it’s binary nature • Not as performant for analytical queries since it requires scanning entire rows, even if only a few columns are needed https://flow-php.com
  6. Parquet Advantages • Excellent compression, reducing storage costs • Handles

    nested structures and data types, including maps, lists and structures • Strict schema and analytical metadata • Supports evolution • Optimized for Parallel Processing Disadvantages • Not human-readable, due to it’s binary nature • Not easily editable • Not as performant for real-time streaming or transactional systems where row based approach is needed https://flow-php.com
  7. Key Differences Format Type Schema Nested Structures Compression Encoding CSV

    Text / Row No No No No JSON Text / Row Partial Yes No No XML Text / Row Yes Yes No No Avro Binary / Row Yes Yes Yes Yes Parquet Binary / Column Yes Yes Yes Yes https://flow-php.com
  8. Users File Schema • id – int32 • name –

    string • email – string • active – bool https://flow-php.com
  9. Rows id name email active 1 user-01 user- [email protected] true

    2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false https://flow-php.com
  10. Rows id name email active 1 user-01 user- [email protected] true

    2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false Columns id 1 2 3 4 5 name user-01 user-02 user-03 user-04 user-05 email user- [email protected] user- [email protected] user- [email protected] user- [email protected] user- [email protected] active true true true true false https://flow-php.com
  11. File Footer • schema • row group metadata • column

    metadata • key-value metadata • created by / version / total rows • encryption metadata • footer length • magic number https://flow-php.com
  12. Metadata in real world Using metadata in systems architecture can

    significantly improve data accessibility and processing. Metadata Storage: Data Storage: https://flow-php.com
  13. Benefits • cross platform data structures • 50-70% size reduction

    • allows for additional encoding (like RLE/BitPack) • support for multiple protocols like binary/compact/json https://flow-php.com
  14. Physical Types • BOOLEAN • INT32 • INT64 • INT96

    (deprecated) • FLOAT • DOUBLE • BYTE_ARRAY • FIXED_LEN_BYTE_ARRAY Logical Types • STRING • MAP • LIST • ENUM • DECIMAL • DATE • TIME • TIMESTAMP • INTEGER • BSON • JSON • UUID • FLOAT16 https://flow-php.com
  15. Logical types are physical types with metadata Logical • STRING

    • DATE • TIMESTAMP • UUID Physical • BYTE_ARRAY • INT32 • INT64 (used to be INT96) • FIXED_LEN_BYTE_ARRAY https://flow-php.com
  16. Types are used by two types of Columns • Flat

    Column • All physical types • Logical types except Map and List • Nested Column • Map • List • Structure* • Nested columns are collection of flat / nested columns https://flow-php.com
  17. But wait a moment can’t we do the same in

    JSON or XML? https://flow-php.com
  18. It’s year 2010 "Dremel: Interactive Analysis of Web-Scale Datasets” was

    presented by Google at 36th International Conference on Very Large Data Bases (VLDB) https://flow-php.com
  19. Dremel is a scalable, interactive ad-hoc query system for analysis

    of read-only nested data https://flow-php.com
  20. Dremel Shred • Column – Int32 • Rows – [1,

    2, 3, null, 4, 5] https://flow-php.com
  21. Dremel Shred • Column – Int32 • Rows – [1,

    2, 3, null, 4, 5] What are Repetition and Definition Levels? https://flow-php.com
  22. Dremel Shred • Column – Int32 • Rows – [1,

    2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com
  23. Dremel Shred • Column – Int32 • Rows – [1,

    2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com
  24. Dremel Shred • Column – Int32 • Rows – [1,

    2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com
  25. Dremel Shred • Column – list<Int32> • Rows – [[1,

    2, 3], [1, 4, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [2, 2, 2, 2, 2, 2] • Values: [1, 2, 3, 1, 4, 5] https://flow-php.com
  26. Dremel Shred • Column – list<Int32> • Rows – [[1,

    2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com
  27. Where parquet stores repetitions definitions and values • row group

    • column chunk • page (header & data) • repetitions • definitions • values https://flow-php.com
  28. Dremel Shred • Column – list<Int32> • Rows – [[1,

    2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com
  29. For highly repetitive data, the Run-Length Encoding (RLE) / Bit-

    Packing - Hybrid is extremely effective. https://flow-php.com
  30. Plain • BOOLEAN - 1 bit per value. 0 is

    false; 1 is true. • INT32 - 4 bytes per value. Stored as little-endian. • INT64 - 8 bytes per value. Stored as little-endian. • FLOAT - 4 bytes per value. IEEE. Stored as little-endian. • DOUBLE - 8 bytes per value. IEEE. Stored as little-endian. • BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes. • FIXED_LEN_BYTE_ARRAY - Just the bytes. Note: Default encoding https://flow-php.com
  31. Plain Dictionary Dictionary encoding. The values in the dictionary are

    encoded in the plain type. Note: Deprecated • in a data page use RLE_DICTIONARY instead • in a Dictionary page use PLAIN instead https://flow-php.com
  32. RLE Group packed run length encoding. Usable for definition/repetition levels

    encoding and Booleans (on one bit: 0 is false; 1 is true.) https://flow-php.com
  33. Bit Packed Bit packed encoding. This can only be used

    if the data has a known max width. Usable for definition/repetition levels encoding https://flow-php.com
  34. Delta Binary Packed Encoding for byte arrays to separate the

    length values and the data. The lengths are encoded using DELTA_BINARY_PACKED https://flow-php.com
  35. Delta Byte Array Incremental-encoded byte array. Prefix lengths are encoded

    using DELTA_BINARY_PACKED. Suffixes are stored as delta length byte arrays. https://flow-php.com
  36. RLE Dictionary Dictionary encoding: the ids are encoded using the

    RLE encoding row value 1 apple 2 orange 3 banana 4 apple 5 orange 6 banana 7 orange Raw Values Dictionary index value 0 apple 1 orange 2 banana Indices row value 1 0 2 1 3 2 4 0 5 1 6 2 7 1 https://flow-php.com
  37. Byte Stream Split Encoding for fixed-width data (FLOAT, DOUBLE, INT32,

    INT64, FIXED_LEN_BYTE_ARRAY). K byte-streams are created where K is the size in bytes of the data type. The individual bytes of a value are scattered to the corresponding stream and the streams are concatenated. This itself does not reduce the size of the data but can lead to better compression afterwards. Note: Added in 2.8 for FLOAT and DOUBLE. Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. https://flow-php.com
  38. Why parquet is so efficient? • Flat storage for nested

    structures • Advanced encodings system chosen by the data type • Compression applied on each page https://flow-php.com
  39. Flat columnar storage is easier to query but can we

    optimize it even more? https://flow-php.com
  40. Row Group - configuration • Most parquet libraries will let

    developer to define the size of the Row Group. • Size of the Row Group defines how much data writer will hold in memory before dumping to a file • Size of the Row Group also defines how much data reader will put into the memory at once row group size rows 1 256Mb 1 000 000 2 256Mb 1 000 000 3 256Mb 1 000 000 4 128Mb 500 000 Parquet File https://flow-php.com
  41. Row group needs to be written at once but we

    can select columns to read https://flow-php.com
  42. Row Group Row Group column: id id name email active

    1 user-01 user- [email protected] true 2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false column: name column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data For simplicity let’s look at it from a rows point of view https://flow-php.com
  43. Read only selected columns Row Group column: id column: name

    column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data https://flow-php.com
  44. Parquet Metadata Structure • Row Group • Column Chunk (one

    chunk per column) • Metadata for column in a row group https://flow-php.com
  45. Column Chunk Statistics • number of nulls • unique values

    count • max value • min value https://flow-php.com
  46. How to query parquet file Parquet File Metadata Row Group

    1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 https://flow-php.com
  47. How to query parquet file Parquet File Metadata Row Group

    1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 This is a technique known from RDBMs world Partition Pruning Filter out out unnecessary data partitions based on their statistics https://flow-php.com
  48. Row Group Row Group Immutability - Can’t be “modified” Row

    Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
  49. Row Group Row Group Immutability To modify parquet single cell,

    we need to rewrite the entire file: 1. localize row group Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
  50. Row Group Immutability To modify parquet single cell, we need

    to rewrite the entire file: 1. localize row group 2. load row group to memory and modify column: id column: name column: email column: active page page 0 page page https://flow-php.com
  51. Immutability To modify parquet single cell, we need to rewrite

    the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file Row Group column: id column: name column: email column: active page page 0 page page https://flow-php.com
  52. Row Group Immutability To modify parquet single cell, we need

    to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
  53. Row Group Row Group Immutability To modify parquet single cell,

    we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
  54. Row Group Row Group Immutability To modify parquet single cell,

    we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata 6. dump metadata to file footer Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com