Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Slide 1

Slide 1 text

1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn

Slide 2

Slide 2 text

2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]

Slide 3

Slide 3 text

3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

Slide 4

Slide 4 text

4 February 2016: Birth of Apache Arrow Just a goal…

Slide 5

Slide 5 text

5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine

Slide 6

Slide 6 text

6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning

Slide 7

Slide 7 text

7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S

Slide 8

Slide 8 text

8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored „oﬀ-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow

Slide 9

Slide 9 text

9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?

Slide 10

Slide 10 text

10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly diﬀerent memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy

Slide 11

Slide 11 text

11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!

Slide 12

Slide 12 text

NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash

Slide 13

Slide 13 text

13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy

Slide 14

Slide 14 text

14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24

Slide 15

Slide 15 text

15 Photo by Niklas Tidbury on Unsplash

Slide 16

Slide 16 text

16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!

Slide 17

Slide 17 text

17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm  / fletcher

Slide 18

Slide 18 text

18 ??? Does it work?

Slide 19

Slide 19 text

19 Does it work?

Slide 20

Slide 20 text

20 Does it work?

Slide 21

Slide 21 text

Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21

Slide 22

Slide 22 text

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!