Parquet
from a file format to a system architecture
https://flow-php.com
Slide 2
Slide 2 text
Parquet Overview
https://flow-php.com
Slide 3
Slide 3 text
Developed in 2013
by Twitter and Cloudera
https://flow-php.com
Slide 4
Slide 4 text
Key Goals
• Reduced I/O (Input/Output) for analytical queries
• Optimized data compression to save storage space
• Efficient data encoding to save storage space
• Support for complex data structures (nested fields)
• Strict schema with schema evolution support
https://flow-php.com
Slide 5
Slide 5 text
Widely adopted in modern big
data stacks
https://flow-php.com
Slide 6
Slide 6 text
Comparison with other file
formats
https://flow-php.com
Slide 7
Slide 7 text
CSV
Advantages
• Simple, human-readable, easy to generate
and parse
• Supported almost universally across
programming languages and tools
• Quickly importable to any spreadsheet
application
• Simple to modify with just a text editor
Disadvantages
• Inefficient in storage and querying
• No support for nested data structures or
metadata
• No support for metadata or schema
• No standard for escaping or formatting
• Data integrity issues
https://flow-php.com
Slide 8
Slide 8 text
XML
Advantages
• Human-readable, text-based and can be
easily read by humans and machines alike
• Supports complex and nested data
• Validation and schema enforcement, through
XSD and DTD
Disadvantages
• Extremely large file sizes due to heavy
tagging, leading to inefficient storage.
• Complexity increases significantly for deeply
nested structures, making it harder to
manage and debug
• Not designed for efficient querying and
processing of large datasets in analytical
systems
https://flow-php.com
Slide 9
Slide 9 text
JSON
Advantages
• Supports nested structures, arrays, and
complex types natively
• Native support in most modern programming
languages
• Easily readable and writable for humans,
making debugging and manual inspection
straightforward
• Universally accepted in web APIs,
configuration files, and data interchange
formats
Disadvantages
• Lack of compression and indexing
• No strict schema enforcement
• JSON's structure isn't well-suited for analytical
queries and partial data access
https://flow-php.com
Slide 10
Slide 10 text
Avro
Advantages
• Designed for efficient binary serialization
• Support for schema and schema evolution
• Support for nested structures
• Good for streaming use cases
Disadvantages
• Not human-readable, due to it’s binary nature
• Not as performant for analytical queries since
it requires scanning entire rows, even if only a
few columns are needed
https://flow-php.com
Slide 11
Slide 11 text
Parquet
Advantages
• Excellent compression, reducing storage costs
• Handles nested structures and data types,
including maps, lists and structures
• Strict schema and analytical metadata
• Supports evolution
• Optimized for Parallel Processing
Disadvantages
• Not human-readable, due to it’s binary nature
• Not easily editable
• Not as performant for real-time streaming or
transactional systems where row based
approach is needed
https://flow-php.com
Slide 12
Slide 12 text
Key Differences
Format Type Schema Nested
Structures
Compression Encoding
CSV Text / Row No No No No
JSON Text / Row Partial Yes No No
XML Text / Row Yes Yes No No
Avro Binary / Row Yes Yes Yes Yes
Parquet Binary / Column Yes Yes Yes Yes
https://flow-php.com
Slide 13
Slide 13 text
Row vs Colum
formats
https://flow-php.com
Slide 14
Slide 14 text
Users File Schema
• id – int32
• name – string
• email – string
• active – bool
https://flow-php.com
File Footer
• schema
• row group metadata
• column metadata
• key-value metadata
• created by / version / total rows
• encryption metadata
• footer length
• magic number
https://flow-php.com
Slide 19
Slide 19 text
File Metadata
https://flow-php.com
Slide 20
Slide 20 text
Page Header Metadata
https://flow-php.com
Slide 21
Slide 21 text
File Content
• row groups
• columns
• pages
https://flow-php.com
Slide 22
Slide 22 text
Parquet File Structure
https://flow-php.com
Slide 23
Slide 23 text
Metadata is like a book index
https://flow-php.com
Slide 24
Slide 24 text
Metadata in real world
Using metadata in systems architecture can
significantly improve data accessibility and
processing.
Metadata Storage:
Data Storage:
https://flow-php.com
Slide 25
Slide 25 text
File Metadata
https://flow-php.com
Slide 26
Slide 26 text
Binary Serialization with Thrift
https://flow-php.com
Slide 27
Slide 27 text
Thrift definition to PHP
https://flow-php.com
Slide 28
Slide 28 text
Reading Thrift
https://flow-php.com
Slide 29
Slide 29 text
Benefits
• cross platform data structures
• 50-70% size reduction
• allows for additional encoding (like RLE/BitPack)
• support for multiple protocols like binary/compact/json
https://flow-php.com
Logical types are physical types with metadata
Logical
• STRING
• DATE
• TIMESTAMP
• UUID
Physical
• BYTE_ARRAY
• INT32
• INT64 (used to be INT96)
• FIXED_LEN_BYTE_ARRAY
https://flow-php.com
Slide 34
Slide 34 text
Types are used by two types of Columns
• Flat Column
• All physical types
• Logical types except Map and List
• Nested Column
• Map
• List
• Structure*
• Nested columns are collection of
flat / nested columns
https://flow-php.com
Slide 35
Slide 35 text
Advanced Structures
https://flow-php.com
Slide 36
Slide 36 text
Structure – collection of columns
https://flow-php.com
But wait a moment
can’t we do the same in JSON or XML?
https://flow-php.com
Slide 43
Slide 43 text
Well, yes, but also no...
https://flow-php.com
Slide 44
Slide 44 text
Parquet is more efficient
thanks to how the data is
stored
https://flow-php.com
Slide 45
Slide 45 text
It’s year 2010
"Dremel: Interactive Analysis of Web-Scale Datasets”
was presented by Google at
36th International Conference on Very Large Data Bases (VLDB)
https://flow-php.com
Slide 46
Slide 46 text
Dremel is a scalable, interactive
ad-hoc query system for analysis
of read-only nested data
https://flow-php.com
Slide 47
Slide 47 text
Orders Schema DDL
https://flow-php.com
Slide 48
Slide 48 text
JSON Order
https://flow-php.com
Slide 49
Slide 49 text
XML Order
https://flow-php.com
Slide 50
Slide 50 text
Parquet Order
Preview generated with:
flow-php/cli
https://flow-php.com
Slide 51
Slide 51 text
Parquet uses the columnar storage
approach that was first introduced
by Google Dremel
https://flow-php.com
Run Length Encoding / Bit-Packing - Hybrid
https://flow-php.com
Slide 68
Slide 68 text
For highly repetitive data, the
Run-Length Encoding (RLE) / Bit-
Packing - Hybrid is extremely
effective.
https://flow-php.com
Slide 69
Slide 69 text
Encoding Visualization
Before After
https://flow-php.com
Slide 70
Slide 70 text
RLE Bit Packing
Before After
https://flow-php.com
Slide 71
Slide 71 text
Values are also encoded
encoding algorithm is used based on a column type
https://flow-php.com
Slide 72
Slide 72 text
Parquet supported encodings
https://flow-php.com
Slide 73
Slide 73 text
Plain • BOOLEAN - 1 bit per value. 0 is false; 1 is true.
• INT32 - 4 bytes per value. Stored as little-endian.
• INT64 - 8 bytes per value. Stored as little-endian.
• FLOAT - 4 bytes per value. IEEE. Stored as little-endian.
• DOUBLE - 8 bytes per value. IEEE. Stored as little-endian.
• BYTE_ARRAY - 4 byte length stored as little endian, followed
by bytes.
• FIXED_LEN_BYTE_ARRAY - Just the bytes.
Note: Default encoding
https://flow-php.com
Slide 74
Slide 74 text
Plain Dictionary Dictionary encoding. The values in
the dictionary are encoded in the
plain type.
Note: Deprecated
• in a data page use RLE_DICTIONARY
instead
• in a Dictionary page use PLAIN instead
https://flow-php.com
Slide 75
Slide 75 text
RLE Group packed run length encoding.
Usable for definition/repetition levels encoding
and Booleans (on one bit: 0 is false; 1 is true.)
https://flow-php.com
Slide 76
Slide 76 text
Bit Packed Bit packed encoding.
This can only be used if the data has a known
max width.
Usable for definition/repetition levels encoding
https://flow-php.com
Slide 77
Slide 77 text
Delta Binary Packed Encoding for byte arrays to separate
the length values and the data.
The lengths are encoded using
DELTA_BINARY_PACKED
https://flow-php.com
Slide 78
Slide 78 text
Delta Byte Array Incremental-encoded byte array.
Prefix lengths are encoded using
DELTA_BINARY_PACKED.
Suffixes are stored as delta length byte arrays.
https://flow-php.com
Slide 79
Slide 79 text
RLE Dictionary Dictionary encoding: the ids are
encoded using the RLE encoding
https://flow-php.com
Slide 80
Slide 80 text
RLE Dictionary Dictionary encoding: the ids are
encoded using the RLE encoding
row value
1 apple
2 orange
3 banana
4 apple
5 orange
6 banana
7 orange
Raw Values Dictionary
index value
0 apple
1 orange
2 banana
Indices
row value
1 0
2 1
3 2
4 0
5 1
6 2
7 1
https://flow-php.com
Slide 81
Slide 81 text
Byte Stream Split Encoding for fixed-width data (FLOAT, DOUBLE, INT32, INT64,
FIXED_LEN_BYTE_ARRAY).
K byte-streams are created where K is the size in bytes of the
data type.
The individual bytes of a value are scattered to the
corresponding stream and the streams are concatenated.
This itself does not reduce the size of the data but can lead to
better compression afterwards.
Note: Added in 2.8 for FLOAT and DOUBLE.
Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY
added in 2.11.
https://flow-php.com
Slide 82
Slide 82 text
Pages are also compressed
each page is compressed independently
https://flow-php.com
Slide 83
Slide 83 text
Compression Algorithms
https://flow-php.com
Slide 84
Slide 84 text
Oversimplified comparison
https://flow-php.com
Slide 85
Slide 85 text
Google Snappy is the most popular
due to speed
https://flow-php.com
Slide 86
Slide 86 text
Why parquet is so efficient?
• Flat storage for nested structures
• Advanced encodings system chosen by the data type
• Compression applied on each page
https://flow-php.com
Slide 87
Slide 87 text
Repetitive data can be encoded and
then compressed
reducing storage space
https://flow-php.com
Slide 88
Slide 88 text
Flat columnar storage is easier to query
but can we optimize it even more?
https://flow-php.com
Slide 89
Slide 89 text
Row Group - configuration
• Most parquet libraries will let
developer to define the size of the
Row Group.
• Size of the Row Group defines how
much data writer will hold
in memory before dumping to a file
• Size of the Row Group also defines
how much data reader will put into
the memory at once
row group size rows
1 256Mb 1 000 000
2 256Mb 1 000 000
3 256Mb 1 000 000
4 128Mb 500 000
Parquet File
https://flow-php.com
Slide 90
Slide 90 text
Bigger Row Group
higher memory consumption
lower I/O operations number
https://flow-php.com
Slide 91
Slide 91 text
Smaller Row Group
lower memory consumption
higher I/O operations number
https://flow-php.com
Slide 92
Slide 92 text
Row group needs to be written at once
but we can select columns to read
https://flow-php.com
Slide 93
Slide 93 text
Row Group Row Group
column: id
id name email active
1 user-01 user-
[email protected]
true
2 user-02 user-
[email protected]
true
3 user-03 user-
[email protected]
true
4 user-04 user-
[email protected]
true
5 user-05 user-
[email protected]
false
column: name column: email column: active
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
For simplicity let’s look at it from a rows point of view
https://flow-php.com
Slide 94
Slide 94 text
Parquet is ideal for parallel reading
https://flow-php.com
Slide 95
Slide 95 text
Read only selected
columns
Row Group
column: id column: name column: email column: active
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
Page 0
===========
header
repetitions
definitions
data
Page 1
===========
header
repetitions
definitions
data
Page ... n
===========
header
repetitions
definitions
data
https://flow-php.com
Slide 96
Slide 96 text
That’s not even all!
remember column metadata?
https://flow-php.com
Slide 97
Slide 97 text
Parquet Metadata
Structure
• Row Group
• Column Chunk (one chunk per column)
• Metadata for column in a row group
https://flow-php.com
Slide 98
Slide 98 text
Column Chunk Statistics
• number of nulls
• unique values count
• max value
• min value
https://flow-php.com
Slide 99
Slide 99 text
How can we benefit from that?
https://flow-php.com
Slide 100
Slide 100 text
How to query parquet file Parquet File Metadata
Row Group 1 Metadata
Column “date” Statistics
=======
min: 2024-01-01
max: 2024-02-01
Row Group 2 Metadata
Column “date” Statistics
=======
min: 2024-03-01
max: 2024-04-01
Row Group 3 Metadata
Column “date” Statistics
=======
min: 2024-05-01
max: 2024-06-01
https://flow-php.com
Slide 101
Slide 101 text
How to query parquet file Parquet File Metadata
Row Group 1 Metadata
Column “date” Statistics
=======
min: 2024-01-01
max: 2024-02-01
Row Group 2 Metadata
Column “date” Statistics
=======
min: 2024-03-01
max: 2024-04-01
Row Group 3 Metadata
Column “date” Statistics
=======
min: 2024-05-01
max: 2024-06-01
This is a technique known from RDBMs world
Partition Pruning
Filter out out unnecessary data partitions
based on their statistics
https://flow-php.com
Parquet can filter out row
groups based on nested
structures properties
https://flow-php.com
Slide 105
Slide 105 text
What about the downsides?
https://flow-php.com
Slide 106
Slide 106 text
All parquet viewers sucks!
https://flow-php.com
Slide 107
Slide 107 text
Row Group Row Group
Immutability
- Can’t be “modified”
Row Group
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 108
Slide 108 text
Row Group Row Group
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
Row Group
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 109
Slide 109 text
Row Group
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
2. load row group to memory and modify
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 110
Slide 110 text
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
2. load row group to memory and modify
3. keep moving row groups to the new file
Row Group
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 111
Slide 111 text
Row Group
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
2. load row group to memory and modify
3. keep moving row groups to the new file
4. dump modified row group from memory
Row Group
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 112
Slide 112 text
Row Group Row Group
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
2. load row group to memory and modify
3. keep moving row groups to the new file
4. dump modified row group from memory
5. recalculate file metadata
Row Group
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com
Slide 113
Slide 113 text
Row Group Row Group
Immutability
To modify parquet single cell, we need to
rewrite the entire file:
1. localize row group
2. load row group to memory and modify
3. keep moving row groups to the new file
4. dump modified row group from memory
5. recalculate file metadata
6. dump metadata to file footer
Row Group
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
column: id column: name column: email column: active
page page 0 page page
https://flow-php.com