Extending Pandas using Apache Arrow and Numba

Slide 1

Slide 1 text

1 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba

Slide 2

Slide 2 text

2 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba

Slide 3

Slide 3 text

3 PyData Berlin 2018 Uwe L. Korn Strings, Strings, please give me Strings!

Slide 4

Slide 4 text

4 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]

Slide 5

Slide 5 text

5 1. Shortcomings of Pandas 2. ExtensionArrays 3. Arrow for storage 4. Numba for compute 5. All the stuﬀ Agenda

Slide 6

Slide 6 text

6 Pandas Series • Payload stored in a numpy.ndarray • Index for data alignment • Rich analytical API • Accessors like .dt or .str

Slide 7

Slide 7 text

7 Shortcomings • Limited to NumPy data types, otherwise object • NumPy’s focus is numerical data and tensors • Pandas performs well when NumPy performs well • Most popular: • no native variable-length strings • integers are non-nullable

Slide 8

Slide 8 text

8 What’s the problem?

Slide 9

Slide 9 text

9 What’s the problem?

Slide 10

Slide 10 text

10 Why are objects bad? Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016 https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

Slide 11

Slide 11 text

11 Extending Pandas (0.23+) • Two new interfaces: • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top

Slide 12

Slide 12 text

10x !!1 12

Slide 13

Slide 13 text

13 Extending Pandas (0.23+) • _from_sequence • _from_factorized • __getitem__ • __len__ • dtype • nbytes • isna • copy • _concat_same_type https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html 13

Slide 14

Slide 14 text

14 Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

Slide 15

Slide 15 text

15 Nice properties • More native datatypes: string, date, nullable int, list of X, … • Everything is nullable • Memory can be chunked • Zero-copy to other ecosystems like Java / R • Highly eﬃcient I/O

Slide 16

Slide 16 text

16 Not so nice properties • Still a young project • Not much analytic on top (yet!) • Core is in modern C++ • Extremely fast but hard to extend in Python

Slide 17

Slide 17 text

17 Writing Algorithms in Python is easy! but slow

Slide 18

Slide 18 text

18 Photo by Matthew Brodeur on Unsplash

Slide 19

Slide 19 text

19 Fast for-loops with Numba

Slide 20

Slide 20 text

20 Anatomy of an Arrow StringArray • 3 memory buffers • bitmap to indicate valid (non-null) entries • uint32 array of offsets: „where does the string start“ • uint8 array of characters (UTF-8 encoded) • int64 offset • allows zero-copy slicing

Slide 21

Slide 21 text

21 Numba @jitclass

Slide 22

Slide 22 text

22 Numba @jitclass

Slide 23

Slide 23 text

23 Photo by Niklas Tidbury on Unsplash

Slide 24

Slide 24 text

24 Fletcher https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top

Slide 25

Slide 25 text

Demo 25

Slide 26

Slide 26 text

26 Fletcher Demo

Slide 27

Slide 27 text

27 Fletcher Demo

Slide 28

Slide 28 text

28 Fletcher Demo

Slide 29

Slide 29 text

29 Fletcher Demo

Slide 30

Slide 30 text

30 ExtensionArray Implementations https://github.com/ContinuumIO/cyberpandas IPArray (PR) https://github.com/geopandas/geopandas GeometryArray (WIP) https://github.com/xhochy/fletcher Apache Arrow + Numba backed Arrays

Slide 31

Slide 31 text

31 Photo by Israel Sundseth on Unsplash pip install fletcher

Slide 32

Slide 32 text

32 By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 24. - 26. October + 2 days of sprints (27/28.10.) ZKM Karlsruhe, DE Karlsruhe Call for Participation opens next week.

Slide 33

Slide 33 text

33 I’m Uwe Korn Twitter: @xhochy https://github.com/xhochy Thank you!