Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Going beyond Apache Parquet's default settings

Going beyond Apache Parquet's default settings

In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.

While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query.

This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.

Uwe L. Korn

April 24, 2024
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. About me • Uwe Korn
 https://mastodon.social/@xhochy / @xhochy • CTO

    at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer
  2. Apache Parquet 1. Columnar, on-disk storage format 2. Started in

    2012 by Cloudera and Twitter 3. Later, it became Apache Parquet 4. Fall 2016 brought full Python & C++ Support 5. State-of-the-art since the Hadoop era, still going strong
  3. Clear bene fi ts 1. Columnar makes vectorized operations fast

    2. E ffi cient encodings and compression make it small 3. Predicate-pushdown brings computation to the I/O layer 4. Language-independent and widespread; common exchange format
  4. Data Types? • Well, actually… • …it doesn’t save much

    on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips:
  5. Data Types? • Well, actually… • …it doesn’t save much

    on disk. • By choosing the optimal types (lossless cast to e.g. fl oat32 or uint8) on a month of New York Taxi trips: Saves 963 bytes 😥 of 20.6 MiB
  6. Compression Algorithm • Datasets: • New York Yellow Taxi Trips

    2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction
  7. Compression Algorithm • Datasets: • New York Yellow Taxi Trips

    2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset
  8. Compression Algorithm • Datasets: • New York Yellow Taxi Trips

    2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology
  9. Compression Algorithm • Datasets: • New York Yellow Taxi Trips

    2021-01 • New York Yellow Taxi Trips 2021-01 with a custom prediction • gov.uk (House) Price Paid dataset • COVID-19 Epidemiology • Time measurements: Pick the median of fi ve runs
  10. Compression Level 1. For Brotli, ZStandard and GZIP, we can

    tune the level 2. Snappy and „none“ have a fi xed compression level.
  11. Compression 1. Let’s stick for now with ZStandard, as it

    seems a good tradeo ff between speed and size.
  12. Compression 1. Let’s stick for now with ZStandard, as it

    seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli
  13. Compression 1. Let’s stick for now with ZStandard, as it

    seems a good tradeo ff between speed and size. 2. In some cases (e.g. slow network drives), it might be worth to also considering Brotli • …but Brotli is relatively slow to decompress.
  14. RowGroup size 1. If you plan to partially access the

    data, RowGroups are the common place to fi lter. 2. If you want to read the whole data, less are better. 3. Compression & encoding also works better.
  15. Encodings 1. https://parquet.apache.org/docs/ fi le-format/data-pages/encodings/ 2. We have been using

    RLE_DICTIONARY for all columns 3. DELTA_* encodings not implemented in pyarrow 4. Byte Stream Split a recent addition
  16. Encodings 1. Byte Stream Split sometimes is faster than dictionary

    encoding, but not signi fi cantly 2. For high entropy columns, BSS shines
  17. Hand-Crafted Delta 1. Let’s take the timestamps in NYC Taxi

    Trip 2. Sort by pickup date 3. Compute a delta column for both dates 4. 17.5% saving on the whole fi le.
  18. Order your data 1. With our hand-crafted delta, it was

    worth sorting the data 2. This can help, but only worked for the Price Paid dataset in tests, there it saved 25%, all others actually got larger
  19. Summary 1. Adjusting your data types is helpful for in-memory,

    but have no signi fi cant e ff ect on-disk 2. Store high-entropy fl oats as Byte Stream Split encoded columns 3. Check whether sorting has an e ff ect 4. Delta Encoding in Parquet would be useful, use handcrafted for now 5. Zstd on level 3/4 seems like a good default compression setting
  20. What do we get? 1. Run once with the default

    settings 2. Test all compression settings, but also… 1. … use hand-crafted delta. 2. … use Byte Stream Split on predictions.