Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extending Pandas using Apache Arrow and Numba

Extending Pandas using Apache Arrow and Numba

With the latest release of Pandas the ability to extend it with custom dtypes was introduced. Using Apache Arrow as the in-memory storage and Numba for fast, vectorized computations on these memory regions, it is possible to extend Pandas in pure Python while achieving the same performance of the built-in types. In the talk we implement a native string type as an example.

Uwe L. Korn

July 08, 2018
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1
    PyData Berlin 2018
    Uwe L. Korn
    Extending Pandas
    using Apache Arrow and Numba

    View full-size slide

  2. 2
    PyData Berlin 2018
    Uwe L. Korn
    Extending Pandas
    using Apache Arrow and Numba

    View full-size slide

  3. 3
    PyData Berlin 2018
    Uwe L. Korn
    Strings, Strings, please give me Strings!

    View full-size slide

  4. 4
    • Senior Data Scientist at Blue Yonder
    (@BlueYonderTech)
    • Apache {Arrow, Parquet} PMC
    • Data Engineer and Architect with heavy
    focus around Pandas
    About me
    xhochy
    [email protected]

    View full-size slide

  5. 5
    1. Shortcomings of Pandas
    2. ExtensionArrays
    3. Arrow for storage
    4. Numba for compute
    5. All the stuff
    Agenda

    View full-size slide

  6. 6
    Pandas Series
    • Payload stored in a numpy.ndarray
    • Index for data alignment
    • Rich analytical API
    • Accessors like .dt or .str

    View full-size slide

  7. 7
    Shortcomings
    • Limited to NumPy data types, otherwise object
    • NumPy’s focus is numerical data and tensors
    • Pandas performs well when NumPy performs well
    • Most popular:
    • no native variable-length strings
    • integers are non-nullable

    View full-size slide

  8. 8
    What’s the problem?

    View full-size slide

  9. 9
    What’s the problem?

    View full-size slide

  10. 10
    Why are objects bad?
    Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016
    https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

    View full-size slide

  11. 11
    Extending Pandas (0.23+)
    • Two new interfaces:
    • ExtensionDtype
    • What type of scalars?
    • ExtensionArray
    • Implement basic array ops
    • Pandas provides algorithms on top

    View full-size slide

  12. 13
    Extending Pandas (0.23+)
    • _from_sequence
    • _from_factorized
    • __getitem__
    • __len__
    • dtype
    • nbytes
    • isna
    • copy
    • _concat_same_type
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html
    13

    View full-size slide

  13. 14
    Apache Arrow
    • Specification for in-memory columnar data layout
    • No overhead for cross-system communication
    • Designed for efficiency (exploit SIMD, cache locality, ..)
    • Exchange data without conversion between Python, C++, C(glib),
    Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM
    • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

    View full-size slide

  14. 15
    Nice properties
    • More native datatypes: string, date, nullable int, list of X, …
    • Everything is nullable
    • Memory can be chunked
    • Zero-copy to other ecosystems like Java / R
    • Highly efficient I/O

    View full-size slide

  15. 16
    Not so nice properties
    • Still a young project
    • Not much analytic on top (yet!)
    • Core is in modern C++
    • Extremely fast but hard to extend in Python

    View full-size slide

  16. 17
    Writing Algorithms in Python is easy!
    but slow

    View full-size slide

  17. 18 Photo by Matthew Brodeur on Unsplash

    View full-size slide

  18. 19
    Fast for-loops with Numba

    View full-size slide

  19. 20
    Anatomy of an Arrow StringArray
    • 3 memory buffers
    • bitmap to indicate valid (non-null) entries
    • uint32 array of offsets: „where does the string start“
    • uint8 array of characters (UTF-8 encoded)
    • int64 offset
    • allows zero-copy slicing

    View full-size slide

  20. 21
    Numba @jitclass

    View full-size slide

  21. 22
    Numba @jitclass

    View full-size slide

  22. 23 Photo by Niklas Tidbury on Unsplash

    View full-size slide

  23. 24
    Fletcher
    https://github.com/xhochy/fletcher
    • Implements Extension{Array,Dtype} with Apache Arrow as storage
    • Uses Numba to implement the necessary analytic on top

    View full-size slide

  24. 26
    Fletcher Demo

    View full-size slide

  25. 27
    Fletcher Demo

    View full-size slide

  26. 28
    Fletcher Demo

    View full-size slide

  27. 29
    Fletcher Demo

    View full-size slide

  28. 30
    ExtensionArray Implementations
    https://github.com/ContinuumIO/cyberpandas
    IPArray
    (PR) https://github.com/geopandas/geopandas
    GeometryArray
    (WIP) https://github.com/xhochy/fletcher
    Apache Arrow + Numba backed Arrays

    View full-size slide

  29. 31 Photo by Israel Sundseth on Unsplash
    pip install fletcher

    View full-size slide

  30. 32
    By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
    By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
    24. - 26. October
    + 2 days of sprints (27/28.10.)
    ZKM Karlsruhe, DE
    Karlsruhe
    Call for Participation opens next week.

    View full-size slide

  31. 33
    I’m Uwe Korn
    Twitter: @xhochy
    https://github.com/xhochy
    Thank you!

    View full-size slide