About me
• Uwe Korn
https://mastodon.social/@xhochy / @xhochy
• CTO at Data Science startup QuantCo
• Previously worked as a Data Engineer
• A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
• PyData Südwest Co-Organizer
Slide 3
Slide 3 text
Apache Parquet
1. Data Frame storage? CSV? Why?
2. Use Parquet
Slide 4
Slide 4 text
Photo by Hansjörg Keller on Unsplash
Slide 5
Slide 5 text
Apache Parquet
1. Columnar, on-disk storage format
2. Started in 2012 by Cloudera and Twitter
3. Later, it became Apache Parquet
4. Fall 2016 brought full Python & C++ Support
5. State-of-the-art since the Hadoop era, still going strong
Slide 6
Slide 6 text
Clear bene
fi
ts
1. Columnar makes vectorized operations fast
2. E
ffi
cient encodings and compression make it small
3. Predicate-pushdown brings computation to the I/O layer
4. Language-independent and widespread; common exchange format
Data Types?
• Well, actually…
• …it doesn’t save much on disk.
Slide 21
Slide 21 text
Data Types?
• Well, actually…
• …it doesn’t save much on disk.
• By choosing the optimal types (lossless cast to e.g.
fl
oat32 or uint8) on a
month of New York Taxi trips:
Slide 22
Slide 22 text
Data Types?
• Well, actually…
• …it doesn’t save much on disk.
• By choosing the optimal types (lossless cast to e.g.
fl
oat32 or uint8) on a
month of New York Taxi trips:
Saves 963 bytes 😥 of 20.6 MiB
Slide 23
Slide 23 text
Compression
Photo by cafeconcetto on Unsplash
Slide 24
Slide 24 text
Compression Algorithm
Slide 25
Slide 25 text
Compression Algorithm
• Datasets:
Slide 26
Slide 26 text
Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
Slide 27
Slide 27 text
Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction
Slide 28
Slide 28 text
Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction
• gov.uk (House) Price Paid dataset
Slide 29
Slide 29 text
Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction
• gov.uk (House) Price Paid dataset
• COVID-19 Epidemiology
Slide 30
Slide 30 text
Compression Algorithm
• Datasets:
• New York Yellow Taxi Trips 2021-01
• New York Yellow Taxi Trips 2021-01 with a custom prediction
• gov.uk (House) Price Paid dataset
• COVID-19 Epidemiology
• Time measurements: Pick the median of
fi
ve runs
Slide 31
Slide 31 text
Compression Algorithm
Slide 32
Slide 32 text
Compression Algorithm
Slide 33
Slide 33 text
Compression Algorithm
Slide 34
Slide 34 text
Compression Level
1. For Brotli, ZStandard and GZIP, we can tune the level
2. Snappy and „none“ have a
fi
xed compression level.
Slide 35
Slide 35 text
GZIP
Slide 36
Slide 36 text
Brotli
Slide 37
Slide 37 text
ZStandard
Slide 38
Slide 38 text
ZStandard 🔬
Slide 39
Slide 39 text
ZStandard & Brotli 🔬
Slide 40
Slide 40 text
Compression
Slide 41
Slide 41 text
Compression
1. Let’s stick for now with ZStandard, as it seems a good tradeo
ff
between
speed and size.
Slide 42
Slide 42 text
Compression
1. Let’s stick for now with ZStandard, as it seems a good tradeo
ff
between
speed and size.
2. In some cases (e.g. slow network drives), it might be worth to also
considering Brotli
Slide 43
Slide 43 text
Compression
1. Let’s stick for now with ZStandard, as it seems a good tradeo
ff
between
speed and size.
2. In some cases (e.g. slow network drives), it might be worth to also
considering Brotli
• …but Brotli is relatively slow to decompress.
Slide 44
Slide 44 text
RowGroup size
1. If you plan to partially access the data, RowGroups are the common
place to
fi
lter.
2. If you want to read the whole data, less are better.
3. Compression & encoding also works better.
Slide 45
Slide 45 text
Single RowGroup
Slide 46
Slide 46 text
Encodings
1. https://parquet.apache.org/docs/
fi
le-format/data-pages/encodings/
2. We have been using RLE_DICTIONARY for all columns
3. DELTA_* encodings not implemented in pyarrow
4. Byte Stream Split a recent addition
Slide 47
Slide 47 text
Dictionary Encoding
Slide 48
Slide 48 text
RLE Encoding
Slide 49
Slide 49 text
Byte Stream Split Encoding
Slide 50
Slide 50 text
Encodings
1. Byte Stream Split sometimes is faster than dictionary encoding, but not
signi
fi
cantly
2. For high entropy columns, BSS shines
Slide 51
Slide 51 text
Hand-Crafted Delta
Slide 52
Slide 52 text
Hand-Crafted Delta
1. Let’s take the timestamps in NYC Taxi Trip
2. Sort by pickup date
3. Compute a delta column for both dates
4. 17.5% saving on the whole
fi
le.
Slide 53
Slide 53 text
Order your data
1. With our hand-crafted delta, it was worth sorting the data
2. This can help, but only worked for the Price Paid dataset in tests, there it
saved 25%, all others actually got larger
Slide 54
Slide 54 text
Summary
1. Adjusting your data types is helpful for in-memory, but have no signi
fi
cant
e
ff
ect on-disk
2. Store high-entropy
fl
oats as Byte Stream Split encoded columns
3. Check whether sorting has an e
ff
ect
4. Delta Encoding in Parquet would be useful, use handcrafted for now
5. Zstd on level 3/4 seems like a good default compression setting
Slide 55
Slide 55 text
Cost Function for compression
Slide 56
Slide 56 text
What do we get?
1. Run once with the default settings
2. Test all compression settings, but also…
1. … use hand-crafted delta.
2. … use Byte Stream Split on predictions.
Slide 57
Slide 57 text
Cost Function for compression
Slide 58
Slide 58 text
Cost Function for compression
Slide 59
Slide 59 text
Code example available at
https://github.com/xhochy/pyconde24-parquet