PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

Slide 1

Slide 1 text

1 Eﬃcient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017

Slide 2

Slide 2 text

2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy [email protected]

Slide 3

Slide 3 text

3 Agenda • History of Apache Parquet • The format in detail • Use it in Python

Slide 4

Slide 4 text

4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option

Slide 5

Slide 5 text

5 Why use Parquet? 1. Columnar format  —> vectorized operations 2. Eﬃcient encodings and compressions  —> small size without the need for a fat CPU 3. Query push-down  —> bring computation to the I/O layer 4. Language independent format  —> libs in Java / Scala / C++ / Python /…

Slide 6

Slide 6 text

6 Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask

Slide 7

Slide 7 text

File Structure File RowGroup Column Chunks Page Statistics

Slide 8

Slide 8 text

Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml

Slide 9

Slide 9 text

Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB

Slide 10

Slide 10 text

Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels

Slide 11

Slide 11 text

Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)

Slide 12

Slide 12 text

Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli  —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)  Snappy: 216 MiB (14 %)

Slide 13

Slide 13 text

Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded

Slide 14

Slide 14 text

Benchmarks (size)

Slide 15

Slide 15 text

Benchmarks (time)

Slide 16

Slide 16 text

Benchmarks (size vs time)

Slide 17

Slide 17 text

Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

Slide 18

Slide 18 text

18 Apache Arrow? • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3

Slide 19

Slide 19 text

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!

Slide 20

Slide 20 text

Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20