Optimized data compression to save storage space • Efficient data encoding to save storage space • Support for complex data structures (nested fields) • Strict schema with schema evolution support https://flow-php.com
• Supported almost universally across programming languages and tools • Quickly importable to any spreadsheet application • Simple to modify with just a text editor Disadvantages • Inefficient in storage and querying • No support for nested data structures or metadata • No support for metadata or schema • No standard for escaping or formatting • Data integrity issues https://flow-php.com
by humans and machines alike • Supports complex and nested data • Validation and schema enforcement, through XSD and DTD Disadvantages • Extremely large file sizes due to heavy tagging, leading to inefficient storage. • Complexity increases significantly for deeply nested structures, making it harder to manage and debug • Not designed for efficient querying and processing of large datasets in analytical systems https://flow-php.com
natively • Native support in most modern programming languages • Easily readable and writable for humans, making debugging and manual inspection straightforward • Universally accepted in web APIs, configuration files, and data interchange formats Disadvantages • Lack of compression and indexing • No strict schema enforcement • JSON's structure isn't well-suited for analytical queries and partial data access https://flow-php.com
for schema and schema evolution • Support for nested structures • Good for streaming use cases Disadvantages • Not human-readable, due to it’s binary nature • Not as performant for analytical queries since it requires scanning entire rows, even if only a few columns are needed https://flow-php.com
nested structures and data types, including maps, lists and structures • Strict schema and analytical metadata • Supports evolution • Optimized for Parallel Processing Disadvantages • Not human-readable, due to it’s binary nature • Not easily editable • Not as performant for real-time streaming or transactional systems where row based approach is needed https://flow-php.com
Text / Row No No No No JSON Text / Row Partial Yes No No XML Text / Row Yes Yes No No Avro Binary / Row Yes Yes Yes Yes Parquet Binary / Column Yes Yes Yes Yes https://flow-php.com
false; 1 is true. • INT32 - 4 bytes per value. Stored as little-endian. • INT64 - 8 bytes per value. Stored as little-endian. • FLOAT - 4 bytes per value. IEEE. Stored as little-endian. • DOUBLE - 8 bytes per value. IEEE. Stored as little-endian. • BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes. • FIXED_LEN_BYTE_ARRAY - Just the bytes. Note: Default encoding https://flow-php.com
INT64, FIXED_LEN_BYTE_ARRAY). K byte-streams are created where K is the size in bytes of the data type. The individual bytes of a value are scattered to the corresponding stream and the streams are concatenated. This itself does not reduce the size of the data but can lead to better compression afterwards. Note: Added in 2.8 for FLOAT and DOUBLE. Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. https://flow-php.com
developer to define the size of the Row Group. • Size of the Row Group defines how much data writer will hold in memory before dumping to a file • Size of the Row Group also defines how much data reader will put into the memory at once row group size rows 1 256Mb 1 000 000 2 256Mb 1 000 000 3 256Mb 1 000 000 4 128Mb 500 000 Parquet File https://flow-php.com
1 Metadata Column “date” Statistics ======= min: 2024-01-01 max: 2024-02-01 Row Group 2 Metadata Column “date” Statistics ======= min: 2024-03-01 max: 2024-04-01 Row Group 3 Metadata Column “date” Statistics ======= min: 2024-05-01 max: 2024-06-01 This is a technique known from RDBMs world Partition Pruning Filter out out unnecessary data partitions based on their statistics https://flow-php.com
Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
we need to rewrite the entire file: 1. localize row group Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
to rewrite the entire file: 1. localize row group 2. load row group to memory and modify column: id column: name column: email column: active page page 0 page page https://flow-php.com
the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file Row Group column: id column: name column: email column: active page page 0 page page https://flow-php.com
to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com
we need to rewrite the entire file: 1. localize row group 2. load row group to memory and modify 3. keep moving row groups to the new file 4. dump modified row group from memory 5. recalculate file metadata 6. dump metadata to file footer Row Group column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page column: id column: name column: email column: active page page 0 page page https://flow-php.com