Parquet - from a file format to a system architecture

Parquet from a file format to a system architecture https://flow-php.com

Parquet Overview https://flow-php.com

Developed in 2013 by Twitter and Cloudera https://flow-php.com

Key Goals • Reduced I/O (Input/Output) for analytical queries •
Optimized data compression to save storage space • Efficient data encoding to save storage space • Support for complex data structures (nested fields) • Strict schema with schema evolution support https://flow-php.com

Widely adopted in modern big data stacks https://flow-php.com

Comparison with other file formats https://flow-php.com

CSV Advantages • Simple, human-readable, easy to generate and parse
• Supported almost universally across programming languages and tools • Quickly importable to any spreadsheet application • Simple to modify with just a text editor Disadvantages • Inefficient in storage and querying • No support for nested data structures or metadata • No support for metadata or schema • No standard for escaping or formatting • Data integrity issues https://flow-php.com

XML Advantages • Human-readable, text-based and can be easily read
by humans and machines alike • Supports complex and nested data • Validation and schema enforcement, through XSD and DTD Disadvantages • Extremely large file sizes due to heavy tagging, leading to inefficient storage. • Complexity increases significantly for deeply nested structures, making it harder to manage and debug • Not designed for efficient querying and processing of large datasets in analytical systems https://flow-php.com

JSON Advantages • Supports nested structures, arrays, and complex types
natively • Native support in most modern programming languages • Easily readable and writable for humans, making debugging and manual inspection straightforward • Universally accepted in web APIs, configuration files, and data interchange formats Disadvantages • Lack of compression and indexing • No strict schema enforcement • JSON's structure isn't well-suited for analytical queries and partial data access https://flow-php.com

Avro Advantages • Designed for efficient binary serialization • Support
for schema and schema evolution • Support for nested structures • Good for streaming use cases Disadvantages • Not human-readable, due to it’s binary nature • Not as performant for analytical queries since it requires scanning entire rows, even if only a few columns are needed https://flow-php.com

Parquet Advantages • Excellent compression, reducing storage costs • Handles
nested structures and data types, including maps, lists and structures • Strict schema and analytical metadata • Supports evolution • Optimized for Parallel Processing Disadvantages • Not human-readable, due to it’s binary nature • Not easily editable • Not as performant for real-time streaming or transactional systems where row based approach is needed https://flow-php.com

Key Differences Format Type Schema Nested Structures Compression Encoding CSV
Text / Row No No No No JSON Text / Row Partial Yes No No XML Text / Row Yes Yes No No Avro Binary / Row Yes Yes Yes Yes Parquet Binary / Column Yes Yes Yes Yes https://flow-php.com

Row vs Colum formats https://flow-php.com

Users File Schema • id – int32 • name –
string • email – string • active – bool https://flow-php.com

Rows id name email active 1 user-01 user- [email protected] true
2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false https://flow-php.com

Rows id name email active 1 user-01 user- [email protected] true
2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false Columns id 1 2 3 4 5 name user-01 user-02 user-03 user-04 user-05 email user- [email protected] user- [email protected] user- [email protected] user- [email protected] user- [email protected] active true true true true false https://flow-php.com

Parquet Building Blocks https://flow-php.com

File Footer • schema • row group metadata • column
metadata • key-value metadata • created by / version / total rows • encryption metadata • footer length • magic number https://flow-php.com

File Metadata https://flow-php.com

Page Header Metadata https://flow-php.com

File Content • row groups • columns • pages https://flow-php.com

Parquet File Structure https://flow-php.com

Metadata is like a book index https://flow-php.com

Metadata in real world Using metadata in systems architecture can
significantly improve data accessibility and processing. Metadata Storage: Data Storage: https://flow-php.com

File Metadata https://flow-php.com

Binary Serialization with Thrift https://flow-php.com

Thrift definition to PHP https://flow-php.com

Reading Thrift https://flow-php.com

Benefits • cross platform data structures • 50-70% size reduction
• allows for additional encoding (like RLE/BitPack) • support for multiple protocols like binary/compact/json https://flow-php.com

Apache Thrift Alternatives • gRPC • Protobuf • MessagePack •
FlatBuffers https://flow-php.com

Parquet Column Types https://flow-php.com

Physical Types • BOOLEAN • INT32 • INT64 • INT96
(deprecated) • FLOAT • DOUBLE • BYTE_ARRAY • FIXED_LEN_BYTE_ARRAY Logical Types • STRING • MAP • LIST • ENUM • DECIMAL • DATE • TIME • TIMESTAMP • INTEGER • BSON • JSON • UUID • FLOAT16 https://flow-php.com

Logical types are physical types with metadata Logical • STRING
• DATE • TIMESTAMP • UUID Physical • BYTE_ARRAY • INT32 • INT64 (used to be INT96) • FIXED_LEN_BYTE_ARRAY https://flow-php.com

Types are used by two types of Columns • Flat
Column • All physical types • Logical types except Map and List • Nested Column • Map • List • Structure* • Nested columns are collection of flat / nested columns https://flow-php.com

Advanced Structures https://flow-php.com

Structure – collection of columns https://flow-php.com

Parquet column Repetition https://flow-php.com

Repetition Explained https://flow-php.com

Examples https://flow-php.com

Parquet Schema https://flow-php.com

Schema DDL (Data Definition Language) https://flow-php.com

But wait a moment can’t we do the same in
JSON or XML? https://flow-php.com

Well, yes, but also no... https://flow-php.com

Parquet is more efficient thanks to how the data is
stored https://flow-php.com

It’s year 2010 "Dremel: Interactive Analysis of Web-Scale Datasets” was
presented by Google at 36th International Conference on Very Large Data Bases (VLDB) https://flow-php.com

Dremel is a scalable, interactive ad-hoc query system for analysis
of read-only nested data https://flow-php.com

Orders Schema DDL https://flow-php.com

JSON Order https://flow-php.com

XML Order https://flow-php.com

Parquet Order Preview generated with: flow-php/cli https://flow-php.com

Parquet uses the columnar storage approach that was first introduced
by Google Dremel https://flow-php.com

Address – Structure{street:string, city:string, zip: string, country:string } Json Parquet
https://flow-php.com

Notes – List<string> Json Parquet https://flow-php.com

Items – List<Structure{sku:string, quantity:int, price: float}> Json Parquet https://flow-php.com

Dremel Shred • Column – Int32 • Rows – [1,
2, 3, null, 4, 5] https://flow-php.com

2, 3, null, 4, 5] What are Repetition and Definition Levels? https://flow-php.com

Repetitions and Definition levels are taken from file schema https://flow-php.com

Parquet Order Preview generated with: flow-php/cli https://flow-php.com

2, 3, null, 4, 5] • Repetitions: [0, 0, 0, 0, 0, 0] • Definitions: [1, 1, 1, 0, 1, 1] • Values: [1, 2, 3, 4, 5] https://flow-php.com

Dremel Shred • Column – list<Int32> • Rows – [[1,
2, 3], [1, 4, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [2, 2, 2, 2, 2, 2] • Values: [1, 2, 3, 1, 4, 5] https://flow-php.com

2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com

Where parquet stores repetitions definitions and values • row group
• column chunk • page (header & data) • repetitions • definitions • values https://flow-php.com

Repetitions and Definitions are encoded Run Length Encoding / Bit-Packing
Hybrid https://flow-php.com

2, 3], [1, null, 5]] • Repetitions: [0, 1, 1, 0, 1, 1] • Definitions: [3, 3, 3, 3, 2, 3] • Values: [1, 2, 3, 1, 5] https://flow-php.com

Run Length Encoding / Bit-Packing - Hybrid https://flow-php.com

For highly repetitive data, the Run-Length Encoding (RLE) / Bit-
Packing - Hybrid is extremely effective. https://flow-php.com

Encoding Visualization Before After https://flow-php.com

RLE Bit Packing Before After https://flow-php.com

Values are also encoded encoding algorithm is used based on
a column type https://flow-php.com

Parquet supported encodings https://flow-php.com

Plain • BOOLEAN - 1 bit per value. 0 is
false; 1 is true. • INT32 - 4 bytes per value. Stored as little-endian. • INT64 - 8 bytes per value. Stored as little-endian. • FLOAT - 4 bytes per value. IEEE. Stored as little-endian. • DOUBLE - 8 bytes per value. IEEE. Stored as little-endian. • BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes. • FIXED_LEN_BYTE_ARRAY - Just the bytes. Note: Default encoding https://flow-php.com

Plain Dictionary Dictionary encoding. The values in the dictionary are
encoded in the plain type. Note: Deprecated • in a data page use RLE_DICTIONARY instead • in a Dictionary page use PLAIN instead https://flow-php.com

RLE Group packed run length encoding. Usable for definition/repetition levels
encoding and Booleans (on one bit: 0 is false; 1 is true.) https://flow-php.com

Bit Packed Bit packed encoding. This can only be used
if the data has a known max width. Usable for definition/repetition levels encoding https://flow-php.com

Delta Binary Packed Encoding for byte arrays to separate the
length values and the data. The lengths are encoded using DELTA_BINARY_PACKED https://flow-php.com

Delta Byte Array Incremental-encoded byte array. Prefix lengths are encoded
using DELTA_BINARY_PACKED. Suffixes are stored as delta length byte arrays. https://flow-php.com

RLE Dictionary Dictionary encoding: the ids are encoded using the
RLE encoding https://flow-php.com

RLE Dictionary Dictionary encoding: the ids are encoded using the
RLE encoding row value 1 apple 2 orange 3 banana 4 apple 5 orange 6 banana 7 orange Raw Values Dictionary index value 0 apple 1 orange 2 banana Indices row value 1 0 2 1 3 2 4 0 5 1 6 2 7 1 https://flow-php.com

Byte Stream Split Encoding for fixed-width data (FLOAT, DOUBLE, INT32,
INT64, FIXED_LEN_BYTE_ARRAY). K byte-streams are created where K is the size in bytes of the data type. The individual bytes of a value are scattered to the corresponding stream and the streams are concatenated. This itself does not reduce the size of the data but can lead to better compression afterwards. Note: Added in 2.8 for FLOAT and DOUBLE. Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. https://flow-php.com

Pages are also compressed each page is compressed independently https://flow-php.com

Compression Algorithms https://flow-php.com

Oversimplified comparison https://flow-php.com

Google Snappy is the most popular due to speed https://flow-php.com

Why parquet is so efficient? • Flat storage for nested
structures • Advanced encodings system chosen by the data type • Compression applied on each page https://flow-php.com

Repetitive data can be encoded and then compressed reducing storage
space https://flow-php.com

Flat columnar storage is easier to query but can we
optimize it even more? https://flow-php.com

Row Group - configuration • Most parquet libraries will let
developer to define the size of the Row Group. • Size of the Row Group defines how much data writer will hold in memory before dumping to a file • Size of the Row Group also defines how much data reader will put into the memory at once row group size rows 1 256Mb 1 000 000 2 256Mb 1 000 000 3 256Mb 1 000 000 4 128Mb 500 000 Parquet File https://flow-php.com

Bigger Row Group higher memory consumption lower I/O operations number

Smaller Row Group lower memory consumption higher I/O operations number

Row group needs to be written at once but we
can select columns to read https://flow-php.com

Row Group Row Group column: id id name email active
1 user-01 user- [email protected] true 2 user-02 user- [email protected] true 3 user-03 user- [email protected] true 4 user-04 user- [email protected] true 5 user-05 user- [email protected] false column: name column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data For simplicity let’s look at it from a rows point of view https://flow-php.com

Parquet is ideal for parallel reading https://flow-php.com

Read only selected columns Row Group column: id column: name
column: email column: active Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data Page 0 =========== header repetitions definitions data Page 1 =========== header repetitions definitions data Page ... n =========== header repetitions definitions data https://flow-php.com

That’s not even all! remember column metadata? https://flow-php.com

Parquet Metadata Structure • Row Group • Column Chunk (one
chunk per column) • Metadata for column in a row group https://flow-php.com

Column Chunk Statistics • number of nulls • unique values
count • max value • min value https://flow-php.com

How can we benefit from that? https://flow-php.com

How to query parquet file Parquet File Metadata Row Group
1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 https://flow-php.com

How to query parquet file Parquet File Metadata Row Group
1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 This is a technique known from RDBMs world Partition Pruning Filter out out unnecessary data partitions based on their statistics https://flow-php.com

Can it get better? https://flow-php.com

Remember flat columnar storage? https://flow-php.com

Parquet can filter out row groups based on nested structures
properties https://flow-php.com

What about the downsides? https://flow-php.com

All parquet viewers sucks! https://flow-php.com

Row Group Row Group Immutability - Can’t be “modified” Row
Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Row Group Row Group Immutability To modify parquet single cell,
we need to rewrite the entire file: 1. localize row group Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Row Group Immutability To modify parquet single cell, we need
to rewrite the entire file: 1. localize row group 2. load row group to memory and modify column: id column: name column: email column: active page page 0 page page https://flow-php.com

Immutability To modify parquet single cell, we need to rewrite
the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file Row Group column: id column: name column: email column: active page page 0 page page https://flow-php.com

Row Group Immutability To modify parquet single cell, we need
to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata 6. dump metadata to file footer Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com

Thank you! https://flow-php.com

https://flow-php.com Norbert Orzechowicz WHOAMI Software architect Consultant Open-source enthusiast

Parquet - from a file format to a system archit...

Parquet - from a file format to a system architecture

More Decks by Norbert Orzechowicz

Other Decks in Programming

Featured

Transcript