Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Arrow's promise was to reduce the (serialization & copy) overhead of working with columnar data between different systems. Using the latest Pandas release and Arrow's ability to share memory between the JVM and Python as ingredients, we demonstrate that Arrow can fulfill this bold statement. The performance benefits of this will be shown using a typical data engineering use-case that produces data in the JVM and then passes it on to a Python-based machine learning model.

Uwe L. Korn

October 25, 2018
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1
    Fulfilling Apache Arrow's Promises:
    Pandas on JVM memory without a copy
    PyCon.DE Karlsruhe 2018
    Uwe L. Korn

    View full-size slide

  2. 2
    • Senior Data Scientist at Blue Yonder
    (@BlueYonderTech)
    • Apache {Arrow, Parquet} PMC
    • Data Engineer and Architect with heavy
    focus around Pandas
    About me
    xhochy
    [email protected]

    View full-size slide

  3. 3
    What’s Apache Arrow?
    • Published in February 2016
    • Specification for in-memory columnar data layout
    • No overhead for cross-system communication
    • Designed for efficiency (exploit SIMD, cache locality, ..)
    • Exchange data without conversion between Python, C++, C(glib), Ruby,
    Lua, R, JavaScript, Go, Rust, Matlab and the JVM
    • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

    View full-size slide

  4. 4
    February 2016: Birth of Apache Arrow
    Just a goal…

    View full-size slide

  5. 5
    Data Science Workflow in 2018
    Python
    machine
    learning
    model
    pre-processing
    with pandas
    probability density
    function (PDF)
    SQL
    Engine

    View full-size slide

  6. 6
    Looks simple?
    • It isn’t.
    • „Data“ is very heterogeneous landscape
    • Most common setup:
    • Java/Scala, i.e. JVM, for data processing
    • Python for machine learning

    View full-size slide

  7. 7
    Data Science Workflow in 2018
    Python
    machine
    learning
    model
    pre-processing
    with pandas
    SQL
    Engine
    JDBC Driver JayDeBeApi
    P
    Y
    T
    H
    O
    N
    R
    O
    W
    S
    J
    D
    B
    C
    R
    O
    W
    S

    View full-size slide

  8. 8
    org.apache.arrow.adapter.jdbc
    • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
    • Do conversion of rows to columns in the JVM
    • Data is stored „off-heap“, i.e:
    • not managed by the JVM
    • native memorly layout, same as in pyarrow

    View full-size slide

  9. 9
    Workflow in 2018 with Arrow
    Python
    machine
    learning
    model
    pre-processing
    with pandas
    SQL
    Engine
    JDBC Driver
    org.apache.
    arrow.adapter.
    jdbc
    A
    R
    R
    O
    W
    J
    D
    B
    C
    R
    O
    W
    S
    ?

    View full-size slide

  10. 10
    So we’re done? No.
    • We still only have Arrow data in the JVM
    • Arrow and Pandas have a slightly different memory layout
    • We have this today in PySpark
    • It’s fast
    • Still involves a copy over the network
    • Arrow → pandas conversion is tuned but still a copy

    View full-size slide

  11. 11
    pyarrow.jvm
    • Access Arrow data created in the JVM from Python
    • Involves no copy of the data
    • Translation of the helper objects
    • Actually passes memory addresses around
    No copy between the JVM and Python!

    View full-size slide

  12. NumPy & the BlockManager
    Photo by Susan Holt Simpson on Unsplash

    View full-size slide

  13. 13
    Pandas Shortcomings
    • Limited to NumPy data types, otherwise object
    • Columns are not separate, grouped by type
    • Nullability is not type-safe (yet)
    —> Arrow memory does not match Pandas memory
    —> Copy

    View full-size slide

  14. 14
    Pandas ExtensionArrays
    • Introduced new interfaces in 0.23
    • ExtensionDtype
    • What type of scalars?
    • ExtensionArray
    • Implement basic array ops
    • Pandas provides algorithms on top
    • Still, experimental, wait for 0.24

    View full-size slide

  15. 15 Photo by Niklas Tidbury on Unsplash

    View full-size slide

  16. 16
    fletcher
    • https://github.com/xhochy/fletcher
    • Implements Extension{Array,Dtype} with Apache Arrow as storage
    • Uses Numba to implement the necessary analytic on top
    • Needs {pandas, Arrow, …} master
    No copy between Apache Arrow and pandas!

    View full-size slide

  17. 17
    Workflow in 2018 with Arrow
    Python
    machine
    learning
    model
    pre-processing
    with pandas
    SQL
    Engine
    JDBC Driver
    org.apache.
    arrow.adapter.
    jdbc
    A
    R
    R
    O
    W
    J
    D
    B
    C
    R
    O
    W
    S
    pyarrow.jvm

    /
    fletcher

    View full-size slide

  18. 18
    ???
    Does it work?

    View full-size slide

  19. 19
    Does it work?

    View full-size slide

  20. 20
    Does it work?

    View full-size slide

  21. Make your
    best decision
    today.
    blueyonder.ai/en/careers
    Blue Yonder Analytics, Inc.
    5048 Tennyson Parkway
    Suite 250
    Plano, Texas 75024
    USA
    21

    View full-size slide

  22. Cross language DataFrame library
    • Website: https://arrow.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/ARROW
    • Slack: https://
    apachearrowslackin.herokuapp.com/
    • Github mirror: https://github.com/apache/
    arrow
    Apache Arrow Apache Parquet
    Famous columnar file format
    • Website: https://parquet.apache.org/
    • ML: [email protected]
    • Issues & Tasks: https://issues.apache.org/jira/
    browse/PARQUET
    • Slack: https://parquet-slack-
    invite.herokuapp.com/
    • C++ Github mirror: https://github.com/
    apache/parquet-cpp
    22
    Get Involved!

    View full-size slide