Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

Uwe L. Korn

October 25, 2017
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1
    Connecting PyData to other Big Data
    Landscapes using Arrow and Parquet
    Uwe L. Korn, PyCon.DE 2017

    View Slide

  2. 2
    • Data Scientist & Architect at Blue Yonder
    (@BlueYonderTech)
    • Apache {Arrow, Parquet} PMC
    • Work in Python, Cython, C++11 and SQL
    • Heavy Pandas user
    About me
    xhochy
    [email protected]

    View Slide

  3. 3
    Python is a good companion for a Data Scientist
    …but there are other ecosystems out there.

    View Slide

  4. • Large set of files on distributed filesystem
    • Non-uniform schema
    • Execute query
    • Only a subset is interesting
    4
    Why do I care?
    not in Python

    View Slide

  5. 5
    All are amazing but…
    How to get my data out of Python and back in again?
    …but there was no fast Parquet access 2 years ago.
    Use Parquet!

    View Slide

  6. 6
    A general problem
    • Great interoperability inside ecosystems
    • Often based on a common backend (e.g. NumPy)
    • Poor integration to other systems
    • CSV is your only resort
    • „We need to talk!“
    • Memory copy is about 10GiB/s
    • (De-)serialisation comes on top

    View Slide

  7. 7
    Columnar Data
    Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

    View Slide

  8. 8
    Apache Parquet

    View Slide

  9. 9
    About Parquet
    1. Columnar on-disk storage format
    2. Started in fall 2012 by Cloudera & Twitter
    3. July 2013: 1.0 release
    4. top-level Apache project
    5. Fall 2016: Python & C++ support
    6. State of the art format in the Hadoop ecosystem
    • often used as the default I/O option

    View Slide

  10. 10
    Why use Parquet?
    1. Columnar format

    —> vectorized operations
    2. Efficient encodings and compressions

    —> small size without the need for a fat CPU
    3. Predicate push-down

    —> bring computation to the I/O layer
    4. Language independent format

    —> libs in Java / Scala / C++ / Python /…

    View Slide

  11. Compression
    1. Shrink data size independent of its content
    2. More CPU intensive than encoding
    3. encoding+compression performs better than
    compression alone with less CPU cost
    4. LZO, Snappy, GZIP, Brotli

    —> If in doubt: use Snappy
    5. GZIP: 174 MiB (11%)

    Snappy: 216 MiB (14 %)

    View Slide

  12. Predicate pushdown
    1. Only load used data
    • skip columns that are not needed
    • skip (chunks of) rows that not relevant
    2. saves I/O load as the data is not transferred
    3. saves CPU as the data is not decoded
    Which products are sold in $?

    View Slide

  13. File Structure
    File
    RowGroup
    Column Chunks
    Page
    Statistics

    View Slide

  14. Read & Write Parquet
    14
    https://arrow.apache.org/docs/python/parquet.html
    Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

    View Slide

  15. Read & Write Parquet
    15
    Pandas 0.21 will bring
    pd.read_parquet(…)
    df.write_parquet(…)
    http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet

    View Slide

  16. 16
    Save in one, load in another ecosystem
    …but always persist the intermediate.

    View Slide

  17. 17
    Zero-Copy DataFrames

    View Slide

  18. 2.57s
    Converting 1 million longs
    from Spark to PySpark
    18
    (8MiB)

    View Slide

  19. 19
    Apache Arrow
    • Specification for in-memory columnar data layout
    • No overhead for cross-system communication
    • Designed for efficiency (exploit SIMD, cache locality, ..)
    • Exchange data without conversion between Python, C++, C(glib),
    Ruby, Lua, R, JavaScript and the JVM
    • This brought Parquet to Pandas without any Python code in
    parquet-cpp

    View Slide

  20. 20
    Dissecting Arrow C++
    • General zero-copy memory management
    • jemalloc as the base allocator
    • Columnar memory format & metadata
    • Schema & DataType
    • Columns & Table

    View Slide

  21. 21
    Dissecting Arrow C++
    • Structured data IPC (inter-process communication)
    • used in Spark for JVM<->Python
    • future extensions include: GRPC backend, shared memory
    communication, …
    • Columnar in-memory analytics
    • be the backbone of Pandas 2.0

    View Slide

  22. 0.05s
    Converting 1 million longs
    from Spark to PySpark
    22
    with Arrow
    https://github.com/apache/spark/pull/15821#issuecomment-282175163

    View Slide

  23. 23
    Apache Arrow – Real life improvement
    Real life example!
    Retrieve a dataset from an MPP database and analyze it in Pandas
    1. Run a query in the DB
    2. Pass it in columnar form to the DB driver
    3. The OBDC layer transform it into row-wise form
    4. Pandas makes it columnar again
    Ugly real-life solution: export as CSV, bypass ODBC

    View Slide

  24. 24
    Better solution: Turbodbc with Arrow support
    1. Retrieve columnar results
    2. Pass them in a columnar fashion to Pandas
    More systems in the future (without the ODBC overhead)
    See also Michael’s talk tomorrow: Turbodbc: Turbocharged database
    access for data scientists
    Apache Arrow – Real life improvement

    View Slide

  25. 25
    Ray

    View Slide

  26. GPU Open Analytics Initiative
    26
    https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/

    View Slide

  27. Cross language DataFrame library
    • Website: https://arrow.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/ARROW
    • Slack: https://
    apachearrowslackin.herokuapp.com/
    • Github: https://github.com/apache/arrow
    Apache Arrow Apache Parquet
    Famous columnar file format
    • Website: https://parquet.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/PARQUET
    • Slack: https://parquet-slack-
    invite.herokuapp.com/
    • Github: https://github.com/apache/parquet-
    cpp
    27
    Get Involved!

    View Slide

  28. Blue Yonder GmbH
    Ohiostraße 8
    76149 Karlsruhe
    Germany
    +49 721 383117 0
    Blue Yonder Software Limited
    19 Eastbourne Terrace
    London, W2 6LG
    United Kingdom
    +44 20 3626 0360
    Blue Yonder
    Best decisions,
    delivered daily
    Blue Yonder Analytics, Inc.
    5048 Tennyson Parkway
    Suite 250
    Plano, Texas 75024
    USA
    28

    View Slide