Extending Pandas using Apache Arrow and Numba

1 PyData Berlin 2018 Uwe L. Korn Extending Pandas using
Apache Arrow and Numba

2 PyData Berlin 2018 Uwe L. Korn Extending Pandas using
Apache Arrow and Numba

3 PyData Berlin 2018 Uwe L. Korn Strings, Strings, please
give me Strings!

4 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •
Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com

5 1. Shortcomings of Pandas 2. ExtensionArrays 3. Arrow for
storage 4. Numba for compute 5. All the stuﬀ Agenda

6 Pandas Series • Payload stored in a numpy.ndarray •
Index for data alignment • Rich analytical API • Accessors like .dt or .str

7 Shortcomings • Limited to NumPy data types, otherwise object
• NumPy’s focus is numerical data and tensors • Pandas performs well when NumPy performs well • Most popular: • no native variable-length strings • integers are non-nullable

8 What’s the problem?

9 What’s the problem?

10 Why are objects bad? Python Data Science Handbook, Jake
VanderPlas; O’Reilly Media, Nov 2016 https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

11 Extending Pandas (0.23+) • Two new interfaces: • ExtensionDtype
• What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top

10x !!1 12

13 Extending Pandas (0.23+) • _from_sequence • _from_factorized • __getitem__
• __len__ • dtype • nbytes • isna • copy • _concat_same_type https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html 13

14 Apache Arrow • Specification for in-memory columnar data layout
• No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

15 Nice properties • More native datatypes: string, date, nullable
int, list of X, … • Everything is nullable • Memory can be chunked • Zero-copy to other ecosystems like Java / R • Highly eﬃcient I/O

16 Not so nice properties • Still a young project
• Not much analytic on top (yet!) • Core is in modern C++ • Extremely fast but hard to extend in Python

17 Writing Algorithms in Python is easy! but slow

18 Photo by Matthew Brodeur on Unsplash

19 Fast for-loops with Numba

20 Anatomy of an Arrow StringArray • 3 memory buffers
• bitmap to indicate valid (non-null) entries • uint32 array of offsets: „where does the string start“ • uint8 array of characters (UTF-8 encoded) • int64 offset • allows zero-copy slicing

21 Numba @jitclass

22 Numba @jitclass

23 Photo by Niklas Tidbury on Unsplash

24 Fletcher https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as
storage • Uses Numba to implement the necessary analytic on top

Demo 25

26 Fletcher Demo

27 Fletcher Demo

28 Fletcher Demo

29 Fletcher Demo

30 ExtensionArray Implementations https://github.com/ContinuumIO/cyberpandas IPArray (PR) https://github.com/geopandas/geopandas GeometryArray (WIP) https://github.com/xhochy/fletcher
Apache Arrow + Numba backed Arrays

31 Photo by Israel Sundseth on Unsplash pip install fletcher

32 By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via
Wikimedia Commons By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 24. - 26. October + 2 days of sprints (27/28.10.) ZKM Karlsruhe, DE Karlsruhe Call for Participation opens next week.

33 I’m Uwe Korn Twitter: @xhochy https://github.com/xhochy Thank you!

Extending Pandas using Apache Arrow and Numba

Extending Pandas using Apache Arrow and Numba

Uwe L. Korn

More Decks by Uwe L. Korn

Other Decks in Programming

Featured

Transcript

1 PyData Berlin 2018 Uwe L. Korn Extending Pandas using

2 PyData Berlin 2018 Uwe L. Korn Extending Pandas using

3 PyData Berlin 2018 Uwe L. Korn Strings, Strings, please

4 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •

5 1. Shortcomings of Pandas 2. ExtensionArrays 3. Arrow for

6 Pandas Series • Payload stored in a numpy.ndarray •

7 Shortcomings • Limited to NumPy data types, otherwise object

8 What’s the problem?

9 What’s the problem?

10 Why are objects bad? Python Data Science Handbook, Jake

11 Extending Pandas (0.23+) • Two new interfaces: • ExtensionDtype

10x !!1 12

13 Extending Pandas (0.23+) • _from_sequence • _from_factorized • getitem

14 Apache Arrow • Specification for in-memory columnar data layout

15 Nice properties • More native datatypes: string, date, nullable

16 Not so nice properties • Still a young project

17 Writing Algorithms in Python is easy! but slow

18 Photo by Matthew Brodeur on Unsplash

19 Fast for-loops with Numba

20 Anatomy of an Arrow StringArray • 3 memory buﬀers

21 Numba @jitclass

22 Numba @jitclass

23 Photo by Niklas Tidbury on Unsplash

24 Fletcher https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as

Demo 25

26 Fletcher Demo

27 Fletcher Demo

28 Fletcher Demo

29 Fletcher Demo

30 ExtensionArray Implementations https://github.com/ContinuumIO/cyberpandas IPArray (PR) https://github.com/geopandas/geopandas GeometryArray (WIP) https://github.com/xhochy/fletcher

31 Photo by Israel Sundseth on Unsplash pip install fletcher

32 By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via

33 I’m Uwe Korn Twitter: @xhochy https://github.com/xhochy Thank you!