Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Arrow's promise was to reduce the (serialization & copy) overhead of working with columnar data between different systems. Using the latest Pandas release and Arrow's ability to share memory between the JVM and Python as ingredients, we demonstrate that Arrow can fulfill this bold statement. The performance benefits of this will be shown using a typical data engineering use-case that produces data in the JVM and then passes it on to a Python-based machine learning model.

Uwe L. Korn

October 25, 2018
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without

    a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn
  2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •

    Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]
  3. 3 What’s Apache Arrow? • Published in February 2016 •

    Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
  4. 5 Data Science Workflow in 2018 Python machine learning model

    pre-processing with pandas probability density function (PDF) SQL Engine
  5. 6 Looks simple? • It isn’t. • „Data“ is very

    heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning
  6. 7 Data Science Workflow in 2018 Python machine learning model

    pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S
  7. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch /

    VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored „off-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow
  8. 9 Workflow in 2018 with Arrow Python machine learning model

    pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?
  9. 10 So we’re done? No. • We still only have

    Arrow data in the JVM • Arrow and Pandas have a slightly different memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy
  10. 11 pyarrow.jvm • Access Arrow data created in the JVM

    from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!
  11. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise

    object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy
  12. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 •

    ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24
  13. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow

    as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!
  14. 17 Workflow in 2018 with Arrow Python machine learning model

    pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm
 / fletcher
  15. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc.

    5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21
  16. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]

    • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!