Slide 1

Slide 1 text

Parquet from a file format to a system architecture https://flow-php.com

Slide 2

Slide 2 text

Parquet Overview https://flow-php.com

Slide 3

Slide 3 text

Developed in 2013 by Twitter and Cloudera https://flow-php.com

Slide 4

Slide 4 text

Key Goals • Reduced I/O (Input/Output) for analytical queries • Optimized data compression to save storage space • Efficient data encoding to save storage space • Support for complex data structures (nested fields) • Strict schema with schema evolution support https://flow-php.com

Slide 5

Slide 5 text

Widely adopted in modern big data stacks https://flow-php.com

Slide 6

Slide 6 text

Comparison with other file formats https://flow-php.com

Slide 7

Slide 7 text

CSV Advantages • Simple, human-readable, easy to generate and parse • Supported almost universally across programming languages and tools • Quickly importable to any spreadsheet application • Simple to modify with just a text editor Disadvantages • Inefficient in storage and querying • No support for nested data structures or metadata • No support for metadata or schema • No standard for escaping or formatting • Data integrity issues https://flow-php.com

Slide 8

Slide 8 text

XML Advantages • Human-readable, text-based and can be easily read by humans and machines alike • Supports complex and nested data • Validation and schema enforcement, through XSD and DTD Disadvantages • Extremely large file sizes due to heavy tagging, leading to inefficient storage. • Complexity increases significantly for deeply nested structures, making it harder to manage and debug • Not designed for efficient querying and processing of large datasets in analytical systems https://flow-php.com

Slide 9

Slide 9 text

JSON Advantages • Supports nested structures, arrays, and complex types natively • Native support in most modern programming languages • Easily readable and writable for humans, making debugging and manual inspection straightforward • Universally accepted in web APIs, configuration files, and data interchange formats Disadvantages • Lack of compression and indexing • No strict schema enforcement • JSON's structure isn't well-suited for analytical queries and partial data access https://flow-php.com

Slide 10

Slide 10 text

Avro Advantages • Designed for efficient binary serialization • Support for schema and schema evolution • Support for nested structures • Good for streaming use cases Disadvantages • Not human-readable, due to it’s binary nature • Not as performant for analytical queries since it requires scanning entire rows, even if only a few columns are needed https://flow-php.com

Slide 11

Slide 11 text

Parquet Advantages • Excellent compression, reducing storage costs • Handles nested structures and data types, including maps, lists and structures • Strict schema and analytical metadata • Supports evolution • Optimized for Parallel Processing Disadvantages • Not human-readable, due to it’s binary nature • Not easily editable • Not as performant for real-time streaming or transactional systems where row based approach is needed https://flow-php.com

Slide 12

Slide 12 text

Key Differences Format Type Schema Nested Structures Compression Encoding CSV Text / Row No No No No JSON Text / Row Partial Yes No No XML Text / Row Yes Yes No No Avro Binary / Row Yes Yes Yes Yes Parquet Binary / Column Yes Yes Yes Yes https://flow-php.com

Slide 13

Slide 13 text

Row vs Colum formats https://flow-php.com

Slide 14

Slide 14 text

Users File Schema • id – int32 • name – string • email – string • active – bool https://flow-php.com

Slide 15

Slide 15 text

Rows id name email active 1 user-01 user- [email protected] true 2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false https://flow-php.com

Slide 16

Slide 16 text

Rows id name email active 1 user-01 user- [email protected] true 2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false Columns id 1 2 3 4 5 name user-01 user-02 user-03 user-04 user-05 email user- [email protected] user- [email protected] user- [email protected] user- [email protected] user- [email protected] active true true true true false https://flow-php.com

Slide 17

Slide 17 text

Parquet Building Blocks https://flow-php.com

Slide 18

Slide 18 text

File Footer • schema • row group metadata • column metadata • key-value metadata • created by / version / total rows • encryption metadata • footer length • magic number https://flow-php.com

Slide 19

Slide 19 text

File Metadata https://flow-php.com

Slide 20

Slide 20 text

Page Header Metadata https://flow-php.com

Slide 21

Slide 21 text

File Content • row groups • columns • pages https://flow-php.com

Slide 22

Slide 22 text

Parquet File Structure https://flow-php.com

Slide 23

Slide 23 text

Metadata is like a book index https://flow-php.com

Slide 24

Slide 24 text

Metadata in real world Using metadata in systems architecture can significantly improve data accessibility and processing. Metadata Storage: Data Storage: https://flow-php.com

Slide 25

Slide 25 text

File Metadata https://flow-php.com

Slide 26

Slide 26 text

Binary Serialization with Thrift https://flow-php.com

Slide 27

Slide 27 text

Thrift definition to PHP https://flow-php.com

Slide 28

Slide 28 text

Reading Thrift https://flow-php.com

Slide 29

Slide 29 text

Benefits • cross platform data structures • 50-70% size reduction • allows for additional encoding (like RLE/BitPack) • support for multiple protocols like binary/compact/json https://flow-php.com

Slide 30

Slide 30 text

Apache Thrift Alternatives • gRPC • Protobuf • MessagePack • FlatBuffers https://flow-php.com

Slide 31

Slide 31 text

Parquet Column Types https://flow-php.com

Slide 32

Slide 32 text

Physical Types • BOOLEAN • INT32 • INT64 • INT96 (deprecated) • FLOAT • DOUBLE • BYTE_ARRAY • FIXED_LEN_BYTE_ARRAY Logical Types • STRING • MAP • LIST • ENUM • DECIMAL • DATE • TIME • TIMESTAMP • INTEGER • BSON • JSON • UUID • FLOAT16 https://flow-php.com

Slide 33

Slide 33 text

Logical types are physical types with metadata Logical • STRING • DATE • TIMESTAMP • UUID Physical • BYTE_ARRAY • INT32 • INT64 (used to be INT96) • FIXED_LEN_BYTE_ARRAY https://flow-php.com

Slide 34

Slide 34 text

Types are used by two types of Columns • Flat Column • All physical types • Logical types except Map and List • Nested Column • Map • List • Structure* • Nested columns are collection of flat / nested columns https://flow-php.com

Slide 35

Slide 35 text

Advanced Structures https://flow-php.com

Slide 36

Slide 36 text

Structure – collection of columns https://flow-php.com

Slide 37

Slide 37 text

Parquet column Repetition https://flow-php.com

Slide 38

Slide 38 text

Repetition Explained https://flow-php.com

Slide 39

Slide 39 text

Examples https://flow-php.com

Slide 40

Slide 40 text

Parquet Schema https://flow-php.com

Slide 41

Slide 41 text

Schema DDL (Data Definition Language) https://flow-php.com

Slide 42

Slide 42 text

But wait a moment can’t we do the same in JSON or XML? https://flow-php.com

Slide 43

Slide 43 text

Well, yes, but also no... https://flow-php.com

Slide 44

Slide 44 text

Parquet is more efficient thanks to how the data is stored https://flow-php.com

Slide 45

Slide 45 text

It’s year 2010 "Dremel: Interactive Analysis of Web-Scale Datasets” was presented by Google at 36th International Conference on Very Large Data Bases (VLDB) https://flow-php.com

Slide 46

Slide 46 text

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data https://flow-php.com

Slide 47

Slide 47 text

Orders Schema DDL https://flow-php.com

Slide 48

Slide 48 text

JSON Order https://flow-php.com

Slide 49

Slide 49 text

XML Order https://flow-php.com

Slide 50

Slide 50 text

Parquet Order Preview generated with: flow-php/cli https://flow-php.com

Slide 51

Slide 51 text

Parquet uses the columnar storage approach that was first introduced by Google Dremel https://flow-php.com

Slide 52

Slide 52 text

Address – Structure{street:string, city:string, zip: string, country:string } Json Parquet https://flow-php.com

Slide 53

Slide 53 text

Notes – List Json Parquet https://flow-php.com

Slide 54

Slide 54 text

Items – List Json Parquet https://flow-php.com

Slide 55

Slide 55 text

Dremel Shred • Column – Int32 • Rows – [1, 2, 3, null, 4, 5] https://flow-php.com

Slide 56

Slide 56 text

Dremel Shred • Column – Int32 • Rows – [1, 2, 3, null, 4, 5] What are Repetition and Definition Levels? https://flow-php.com

Slide 57

Slide 57 text

Repetitions and Definition levels are taken from file schema https://flow-php.com

Slide 58

Slide 58 text

Parquet Order Preview generated with: flow-php/cli https://flow-php.com

Slide 59

Slide 59 text

Dremel Shred • Column – Int32 • Rows – [1, 2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com

Slide 60

Slide 60 text

Dremel Shred • Column – Int32 • Rows – [1, 2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com

Slide 61

Slide 61 text

Dremel Shred • Column – Int32 • Rows – [1, 2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com

Slide 62

Slide 62 text

Dremel Shred • Column – list • Rows – [[1, 2, 3], [1, 4, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [2, 2, 2, 2, 2, 2] • Values: [1, 2, 3, 1, 4, 5] https://flow-php.com

Slide 63

Slide 63 text

Dremel Shred • Column – list • Rows – [[1, 2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com

Slide 64

Slide 64 text

Where parquet stores repetitions definitions and values • row group • column chunk • page (header & data) • repetitions • definitions • values https://flow-php.com

Slide 65

Slide 65 text

Repetitions and Definitions are encoded Run Length Encoding / Bit-Packing Hybrid https://flow-php.com

Slide 66

Slide 66 text

Dremel Shred • Column – list • Rows – [[1, 2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com

Slide 67

Slide 67 text

Run Length Encoding / Bit-Packing - Hybrid https://flow-php.com

Slide 68

Slide 68 text

For highly repetitive data, the Run-Length Encoding (RLE) / Bit- Packing - Hybrid is extremely effective. https://flow-php.com

Slide 69

Slide 69 text

Encoding Visualization Before After https://flow-php.com

Slide 70

Slide 70 text

RLE Bit Packing Before After https://flow-php.com

Slide 71

Slide 71 text

Values are also encoded encoding algorithm is used based on a column type https://flow-php.com

Slide 72

Slide 72 text

Parquet supported encodings https://flow-php.com

Slide 73

Slide 73 text

Plain • BOOLEAN - 1 bit per value. 0 is false; 1 is true. • INT32 - 4 bytes per value. Stored as little-endian. • INT64 - 8 bytes per value. Stored as little-endian. • FLOAT - 4 bytes per value. IEEE. Stored as little-endian. • DOUBLE - 8 bytes per value. IEEE. Stored as little-endian. • BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes. • FIXED_LEN_BYTE_ARRAY - Just the bytes. Note: Default encoding https://flow-php.com

Slide 74

Slide 74 text

Plain Dictionary Dictionary encoding. The values in the dictionary are encoded in the plain type. Note: Deprecated • in a data page use RLE_DICTIONARY instead • in a Dictionary page use PLAIN instead https://flow-php.com

Slide 75

Slide 75 text

RLE Group packed run length encoding. Usable for definition/repetition levels encoding and Booleans (on one bit: 0 is false; 1 is true.) https://flow-php.com

Slide 76

Slide 76 text

Bit Packed Bit packed encoding. This can only be used if the data has a known max width. Usable for definition/repetition levels encoding https://flow-php.com

Slide 77

Slide 77 text

Delta Binary Packed Encoding for byte arrays to separate the length values and the data. The lengths are encoded using DELTA_BINARY_PACKED https://flow-php.com

Slide 78

Slide 78 text

Delta Byte Array Incremental-encoded byte array. Prefix lengths are encoded using DELTA_BINARY_PACKED. Suffixes are stored as delta length byte arrays. https://flow-php.com

Slide 79

Slide 79 text

RLE Dictionary Dictionary encoding: the ids are encoded using the RLE encoding https://flow-php.com

Slide 80

Slide 80 text

RLE Dictionary Dictionary encoding: the ids are encoded using the RLE encoding row value 1 apple 2 orange 3 banana 4 apple 5 orange 6 banana 7 orange Raw Values Dictionary index value 0 apple 1 orange 2 banana Indices row value 1 0 2 1 3 2 4 0 5 1 6 2 7 1 https://flow-php.com

Slide 81

Slide 81 text

Byte Stream Split Encoding for fixed-width data (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). K byte-streams are created where K is the size in bytes of the data type. The individual bytes of a value are scattered to the corresponding stream and the streams are concatenated. This itself does not reduce the size of the data but can lead to better compression afterwards. Note: Added in 2.8 for FLOAT and DOUBLE. Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. https://flow-php.com

Slide 82

Slide 82 text

Pages are also compressed each page is compressed independently https://flow-php.com

Slide 83

Slide 83 text

Compression Algorithms https://flow-php.com

Slide 84

Slide 84 text

Oversimplified comparison https://flow-php.com

Slide 85

Slide 85 text

Google Snappy is the most popular due to speed https://flow-php.com

Slide 86

Slide 86 text

Why parquet is so efficient? • Flat storage for nested structures • Advanced encodings system chosen by the data type • Compression applied on each page https://flow-php.com

Slide 87

Slide 87 text

Repetitive data can be encoded and then compressed reducing storage space https://flow-php.com

Slide 88

Slide 88 text

Flat columnar storage is easier to query but can we optimize it even more? https://flow-php.com

Slide 89

Slide 89 text

Row Group - configuration • Most parquet libraries will let developer to define the size of the Row Group. • Size of the Row Group defines how much data writer will hold in memory before dumping to a file • Size of the Row Group also defines how much data reader will put into the memory at once row group size rows 1 256Mb 1 000 000 2 256Mb 1 000 000 3 256Mb 1 000 000 4 128Mb 500 000 Parquet File https://flow-php.com

Slide 90

Slide 90 text

Bigger Row Group higher memory consumption lower I/O operations number https://flow-php.com

Slide 91

Slide 91 text

Smaller Row Group lower memory consumption higher I/O operations number https://flow-php.com

Slide 92

Slide 92 text

Row group needs to be written at once but we can select columns to read https://flow-php.com

Slide 93

Slide 93 text

Row Group Row Group column: id id name email active 1 user-01 user- [email protected] true 2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false column: name column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data For simplicity let’s look at it from a rows point of view https://flow-php.com

Slide 94

Slide 94 text

Parquet is ideal for parallel reading https://flow-php.com

Slide 95

Slide 95 text

Read only selected columns Row Group column: id column: name column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data https://flow-php.com

Slide 96

Slide 96 text

That’s not even all! remember column metadata? https://flow-php.com

Slide 97

Slide 97 text

Parquet Metadata Structure • Row Group • Column Chunk (one chunk per column) • Metadata for column in a row group https://flow-php.com

Slide 98

Slide 98 text

Column Chunk Statistics • number of nulls • unique values count • max value • min value https://flow-php.com

Slide 99

Slide 99 text

How can we benefit from that? https://flow-php.com

Slide 100

Slide 100 text

How to query parquet file Parquet File Metadata Row Group 1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 https://flow-php.com

Slide 101

Slide 101 text

How to query parquet file Parquet File Metadata Row Group 1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 This is a technique known from RDBMs world Partition Pruning Filter out out unnecessary data partitions based on their statistics https://flow-php.com

Slide 102

Slide 102 text

Can it get better? https://flow-php.com

Slide 103

Slide 103 text

Remember flat columnar storage? https://flow-php.com

Slide 104

Slide 104 text

Parquet can filter out row groups based on nested structures properties https://flow-php.com

Slide 105

Slide 105 text

What about the downsides? https://flow-php.com

Slide 106

Slide 106 text

All parquet viewers sucks! https://flow-php.com

Slide 107

Slide 107 text

Row Group Row Group Immutability - Can’t be “modified” Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 108

Slide 108 text

Row Group Row Group Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 109

Slide 109 text

Row Group Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 110

Slide 110 text

Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file Row Group column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 111

Slide 111 text

Row Group Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 112

Slide 112 text

Row Group Row Group Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 113

Slide 113 text

Row Group Row Group Immutability To modify parquet single cell, we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata 6. dump metadata to file footer Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Slide 114

Slide 114 text

Thank you! https://flow-php.com

Slide 115

Slide 115 text

https://flow-php.com Norbert Orzechowicz WHOAMI Software architect Consultant Open-source enthusiast