Going beyond Apache Parquet's default settings

Slide 1

Slide 1 text

Going beyond Parquet’s default settings Uwe Korn – QuantCo – April 2024 🔎

Slide 2

Slide 2 text

About me • Uwe Korn  https://mastodon.social/@xhochy / @xhochy • CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer

Slide 3

Slide 3 text

Apache Parquet 1. Data Frame storage? CSV? Why? 2. Use Parquet

Slide 4

Slide 4 text

Photo by Hansjörg Keller on Unsplash

Slide 5

Slide 5 text

Apache Parquet 1. Columnar, on-disk storage format 2. Started in 2012 by Cloudera and Twitter 3. Later, it became Apache Parquet 4. Fall 2016 brought full Python & C++ Support 5. State-of-the-art since the Hadoop era, still going strong

Slide 6

Slide 6 text

Clear bene fi ts 1. Columnar makes vectorized operations fast 2. E ffi cient encodings and compression make it small 3. Predicate-pushdown brings computation to the I/O layer 4. Language-independent and widespread; common exchange format

Slide 7

Slide 7 text

Constructing Parquet Files

Slide 8

Slide 8 text

Parquet with pandas

Slide 9

Slide 9 text

Parquet with polars

Slide 10

Slide 10 text

Anatomy of a fi le

Slide 11

Slide 11 text

Anatomy of a fi le

Slide 12

Slide 12 text

Anatomy of a fi le

Slide 13

Slide 13 text

Anatomy of a fi le

Slide 14

Slide 14 text

Anatomy of a fi le

Slide 15

Slide 15 text

Photo by Gabriel Dias Pimenta on Unsplash Tuning

Slide 16

Slide 16 text

Knobs to tune 1. Compression Algorithm 2. Compression Level 3. RowGroup size 4. Encodings

Slide 17

Slide 17 text

Data Types!? Photo by Patrick Fore on Unsplash

Slide 18

Slide 18 text

Data Types?

Slide 19

Slide 19 text

Data Types? • Well, actually…

Slide 20

Slide 20 text

Data Types? • Well, actually… • …it doesn’t save much on disk.

Slide 21

Slide 21 text

Data Types? • Well, actually… • …it doesn’t save much on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips:

Slide 22

Slide 22 text

Data Types? • Well, actually… • …it doesn’t save much on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips: Saves 963 bytes 😥 of 20.6 MiB

Slide 23

Slide 23 text

Compression Photo by cafeconcetto on Unsplash

Slide 24

Slide 24 text

Compression Algorithm

Slide 25

Slide 25 text

Compression Algorithm • Datasets:

Slide 26

Slide 26 text

Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01

Slide 27

Slide 27 text

Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction

Slide 28

Slide 28 text

Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset

Slide 29

Slide 29 text

Compression Algorithm • Datasets: • New York Yellow Taxi Trips 2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Compression Algorithm

Slide 32

Slide 32 text

Compression Algorithm

Slide 33

Slide 33 text

Compression Algorithm

Slide 34

Slide 34 text

Compression Level 1. For Brotli, ZStandard and GZIP, we can tune the level 2. Snappy and „none“ have a fi xed compression level.

Slide 35

Slide 35 text

GZIP

Slide 36

Slide 36 text

Brotli

Slide 37

Slide 37 text

ZStandard

Slide 38

Slide 38 text

ZStandard 🔬

Slide 39

Slide 39 text

ZStandard & Brotli 🔬

Slide 40

Slide 40 text

Compression

Slide 41

Slide 41 text

Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size.

Slide 42

Slide 42 text

Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli

Slide 43

Slide 43 text

Compression 1. Let’s stick for now with ZStandard, as it seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli • …but Brotli is relatively slow to decompress.

Slide 44

Slide 44 text

RowGroup size 1. If you plan to partially access the data, RowGroups are the common place to fi lter. 2. If you want to read the whole data, less are better. 3. Compression & encoding also works better.

Slide 45

Slide 45 text

Single RowGroup

Slide 46

Slide 46 text

Encodings 1. https://parquet.apache.org/docs/ fi le-format/data-pages/encodings/ 2. We have been using RLE_DICTIONARY for all columns 3. DELTA_* encodings not implemented in pyarrow 4. Byte Stream Split a recent addition

Slide 47

Slide 47 text

Dictionary Encoding

Slide 48

Slide 48 text

RLE Encoding

Slide 49

Slide 49 text

Byte Stream Split Encoding

Slide 50

Slide 50 text

Encodings 1. Byte Stream Split sometimes is faster than dictionary encoding, but not signi fi cantly 2. For high entropy columns, BSS shines

Slide 51

Slide 51 text

Hand-Crafted Delta

Slide 52

Slide 52 text

Hand-Crafted Delta 1. Let’s take the timestamps in NYC Taxi Trip 2. Sort by pickup date 3. Compute a delta column for both dates 4. 17.5% saving on the whole fi le.

Slide 53

Slide 53 text

Order your data 1. With our hand-crafted delta, it was worth sorting the data 2. This can help, but only worked for the Price Paid dataset in tests, there it saved 25%, all others actually got larger

Slide 54

Slide 54 text

Summary 1. Adjusting your data types is helpful for in-memory, but have no signi fi cant e ff ect on-disk 2. Store high-entropy fl oats as Byte Stream Split encoded columns 3. Check whether sorting has an e ff ect 4. Delta Encoding in Parquet would be useful, use handcrafted for now 5. Zstd on level 3/4 seems like a good default compression setting

Slide 55

Slide 55 text

Cost Function for compression

Slide 56

Slide 56 text

What do we get? 1. Run once with the default settings 2. Test all compression settings, but also… 1. … use hand-crafted delta. 2. … use Byte Stream Split on predictions.