PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

Apache Parquet is the most used columnar data format in the big data processing space and recently gained Pandas support. It leverages various techniques to store data in a CPU and I/O efficient way and provides capabilities to push-down queries to the I/O layer. In this talk, it is shown how to use it in Python, detail its structure and present the portable usage with other tools.

D6fcc16462fbe93673342da3ff5d8121?s=128

Uwe L. Korn

May 07, 2017
Tweet

Transcript

  1. 1 Efficient and portable DataFrame storage with Apache Parquet Uwe

    L. Korn, PyData London 2017
  2. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache

    {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy uwe@apache.org
  3. 3 Agenda • History of Apache Parquet • The format

    in detail • Use it in Python
  4. 4 About Parquet 1. Columnar on-disk storage format 2. Started

    in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  5. 5 Why use Parquet? 1. Columnar format
 —> vectorized operations

    2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Query push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  6. 6 Who uses Parquet? • Query Engines • Hive •

    Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask
  7. File Structure File RowGroup Column Chunks Page Statistics

  8. Encodings • Know the data • Exploit the knowledge •

    Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml
  9. Encodings — PLAIN • Simply write the binary representation to

    disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB
  10. Encodings — RLE & Bit Packing • bit-packing: only use

    the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels
  11. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value

    is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)
  12. Compression 1. Shrink data size independent of its content 2.

    More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  13. Query pushdown 1. Only load used data 1. skip columns

    that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded
  14. Benchmarks (size)

  15. Benchmarks (time)

  16. Benchmarks (size vs time)

  17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

  18. 18 Apache Arrow? • Specification for in-memory columnar data layout

    • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3
  19. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org

    • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!
  20. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721

    383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20