Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

Apache Parquet is the most used columnar data format in the big data processing space and recently gained Pandas support. It leverages various techniques to store data in a CPU and I/O efficient way and provides capabilities to push-down queries to the I/O layer. In this talk, it is shown how to use it in Python, detail its structure and present the portable usage with other tools.

Uwe L. Korn

May 07, 2017
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1
    Efficient and portable DataFrame
    storage with Apache Parquet
    Uwe L. Korn, PyData London 2017

    View Slide

  2. 2
    • Data Scientist at Blue Yonder
    (@BlueYonderTech)
    • Apache {Arrow, Parquet} PMC
    • Work in Python, Cython, C++11 and SQL
    • Heavy Pandas User
    About me
    xhochy
    [email protected]

    View Slide

  3. 3
    Agenda
    • History of Apache Parquet
    • The format in detail
    • Use it in Python

    View Slide

  4. 4
    About Parquet
    1. Columnar on-disk storage format
    2. Started in fall 2012 by Cloudera & Twitter
    3. July 2013: 1.0 release
    4. top-level Apache project
    5. Fall 2016: Python & C++ support
    6. State of the art format in the Hadoop ecosystem
    • often used as the default I/O option

    View Slide

  5. 5
    Why use Parquet?
    1. Columnar format

    —> vectorized operations
    2. Efficient encodings and compressions

    —> small size without the need for a fat CPU
    3. Query push-down

    —> bring computation to the I/O layer
    4. Language independent format

    —> libs in Java / Scala / C++ / Python /…

    View Slide

  6. 6
    Who uses Parquet?
    • Query Engines
    • Hive
    • Impala
    • Drill
    • Presto
    • …
    • Frameworks
    • Spark
    • MapReduce
    • …
    • Pandas
    • Dask

    View Slide

  7. File Structure
    File
    RowGroup
    Column Chunks
    Page
    Statistics

    View Slide

  8. Encodings
    • Know the data
    • Exploit the knowledge
    • Cheaper than universal compression
    • Example dataset:
    • NYC TLC Trip Record data for January 2016
    • 1629 MiB as CSV
    • columns: bool(1), datetime(2), float(12), int(4)
    • Source: http://www.nyc.gov/html/tlc/html/about/
    trip_record_data.shtml

    View Slide

  9. Encodings — PLAIN
    • Simply write the binary representation to disk
    • Simple to read & write
    • Performance limited by I/O throughput
    • —> 1499 MiB

    View Slide

  10. Encodings — RLE & Bit Packing
    • bit-packing: only use the necessary bit
    • RunLengthEncoding: 378 times „12“
    • hybrid: dynamically choose the best
    • Used for Definition & Repetition levels

    View Slide

  11. Encodings — Dictionary
    • PLAIN_DICTIONARY / RLE_DICTIONARY
    • every value is assigned a code
    • Dictionary: store a map of code —> value
    • Data: store only codes, use RLE on that
    • —> 329 MiB (22%)

    View Slide

  12. Compression
    1. Shrink data size independent of its content
    2. More CPU intensive than encoding
    3. encoding+compression performs better than
    compression alone with less CPU cost
    4. LZO, Snappy, GZIP, Brotli

    —> If in doubt: use Snappy
    5. GZIP: 174 MiB (11%)

    Snappy: 216 MiB (14 %)

    View Slide

  13. Query pushdown
    1. Only load used data
    1. skip columns that are not needed
    2. skip (chunks of) rows that not relevant
    2. saves I/O load as the data is not transferred
    3. saves CPU as the data is not decoded

    View Slide

  14. Benchmarks (size)

    View Slide

  15. Benchmarks (time)

    View Slide

  16. Benchmarks (size vs time)

    View Slide

  17. Read & Write Parquet
    17
    https://arrow.apache.org/docs/python/parquet.html
    Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

    View Slide

  18. 18
    Apache Arrow?
    • Specification for in-memory columnar data layout
    • No overhead for cross-system communication
    • Designed for efficiency (exploit SIMD, cache locality, ..)
    • Exchange data without conversion between Python, C++, C(glib),
    Ruby, Lua, R and the JVM
    • This brought Parquet to Pandas without any Python code in
    parquet-cpp
    Just released 0.3

    View Slide

  19. Cross language DataFrame library
    • Website: https://arrow.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/ARROW
    • Slack: https://
    apachearrowslackin.herokuapp.com/
    • Github mirror: https://github.com/apache/
    arrow
    Apache Arrow Apache Parquet
    Famous columnar file format
    • Website: https://parquet.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/PARQUET
    • Slack: https://parquet-slack-
    invite.herokuapp.com/
    • C++ Github mirror: https://github.com/
    apache/parquet-cpp
    19
    Get Involved!

    View Slide

  20. Blue Yonder GmbH
    Ohiostraße 8
    76149 Karlsruhe
    Germany
    +49 721 383117 0
    Blue Yonder Software Limited
    19 Eastbourne Terrace
    London, W2 6LG
    United Kingdom
    +44 20 3626 0360
    Blue Yonder
    Best decisions,
    delivered daily
    Blue Yonder Analytics, Inc.
    5048 Tennyson Parkway
    Suite 250
    Plano, Texas 75024
    USA
    20

    View Slide