Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without
a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn

2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •
Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]

3 What’s Apache Arrow? • Published in February 2016 •
Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4 February 2016: Birth of Apache Arrow Just a goal…

5 Data Science Workflow in 2018 Python machine learning model
pre-processing with pandas probability density function (PDF) SQL Engine

6 Looks simple? • It isn’t. • „Data“ is very
heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning

7 Data Science Workflow in 2018 Python machine learning model
pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S

8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch /
VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored „oﬀ-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow

9 Workflow in 2018 with Arrow Python machine learning model
pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?

10 So we’re done? No. • We still only have
Arrow data in the JVM • Arrow and Pandas have a slightly diﬀerent memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy

11 pyarrow.jvm • Access Arrow data created in the JVM
from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!

NumPy & the BlockManager Photo by Susan Holt Simpson on
Unsplash

13 Pandas Shortcomings • Limited to NumPy data types, otherwise
object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy

14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 •
ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24

15 Photo by Niklas Tidbury on Unsplash

16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow
as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!

17 Workflow in 2018 with Arrow Python machine learning model
pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm  / fletcher

18 ??? Does it work?

19 Does it work?

20 Does it work?

Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc.
5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]
• Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: [email protected] • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!

Fulfilling Apache Arrow's Promises: Pandas on J...

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Uwe L. Korn

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript

1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without

2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •

3 What’s Apache Arrow? • Published in February 2016 •

4 February 2016: Birth of Apache Arrow Just a goal…

5 Data Science Workflow in 2018 Python machine learning model

6 Looks simple? • It isn’t. • „Data“ is very

7 Data Science Workflow in 2018 Python machine learning model

8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch /

9 Workflow in 2018 with Arrow Python machine learning model

10 So we’re done? No. • We still only have

11 pyarrow.jvm • Access Arrow data created in the JVM

NumPy & the BlockManager Photo by Susan Holt Simpson on

13 Pandas Shortcomings • Limited to NumPy data types, otherwise

14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 •

15 Photo by Niklas Tidbury on Unsplash

16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow

17 Workflow in 2018 with Arrow Python machine learning model

18 ??? Does it work?

19 Does it work?

20 Does it work?

Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc.

Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: [email protected]