Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Sofia May 2024 - Intro to Apache Arrow

PyData Sofia May 2024 - Intro to Apache Arrow

Exploring the text that powers the modern data (science) stack

Uwe L. Korn

May 27, 2024
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. Apache Arrow
 Exploring the tech that powers the modern data

    (science) stack Uwe Korn – QuantCo – May 2024
  2. About me • Uwe Korn
 https://mastodon.social/@xhochy / @xhochy
 https://www.linkedin.com/in/uwekorn/ •

    CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer
  3. Agenda 1. Why do we need this? 2. What is

    it? 3. What’s its impact?
  4. Why do we need this? • Di ff erent Ecosystems

    • PyData / R space • Java/Scala „Big Data“ • SQL Databases • Di ff erent technologies • Pandas / SQLite
  5. Why solve it? • We build pipelines to move data

    • We want to use all tools we can leverage • Avoid working on converters or waiting for the data to be converted
  6. Introducing Apache Arrow • Columnar representation of data in main

    memory • Provide libraries to access the data structures • Building blocks for various ecosystems to use them • Implements adopters for existing structures
  7. All the languages! 1. „Pure“ implementations in
 C++, Java, Go,

    JavaScript, C#, Rust, Julia, Swift, C(nanoarrow) 2. Wrappers on-top of them in
 Python, R, Ruby, C/GLib, Matlab
  8. There is a social component 1. A standard is only

    as good as its usage 2. Di ff erent communities came together to form Arrow 3. Nowadays even more use it to connect
  9. Arrow Basics 1. Array: a sequence of values of the

    same type in contiguous bu ff ers 2. ChunkedArray: a sequence of arrays of the same type 3. Table: a sorted dictionary of ChunkedArrays of the same length
  10. Arrow Basics: valid masks 1. Track null_count per Array 2.

    Each array has a bu ff er of bits indicating whether a value is valid,
 i.e. non-null
  11. Arrow Basics: int array Python array: [1, null, 2, 4,

    8] Length: 5, Null count: 1 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00011101 | 0 (padding) | Value Buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | |-------------|-------------|-------------|-------------|-------------|-----------------------| | 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |
  12. Arrow Basics: string array Python array: ['joe', null, null, 'mark']

    Length: 4, Null count: 2 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001001 | 0 (padding) | Offsets buffer: | Bytes 0-19 | Bytes 20-63 | |----------------|-----------------------| | 0, 3, 3, 3, 7 | unspecified (padding) | Value buffer: | Bytes 0-6 | Bytes 7-63 | |----------------|-----------------------| | joemark | unspecified (padding) |
  13. Impact! Arrow is now used in all „edges“ where data

    passes through: • Databases, either in clients or in UDFs • Data Engineering tooling • Machine Learning libraries • Dashboarding and BI applications
  14. Parquet 1. This was the fi rst exposure of Arrow

    to the Python world 2. End-users only see pandas.read_parquet 3. Actually, it is: A. C++ Parquet->Arrow reader B. C++ Pandas<->Arrow Adapter C. Small Python shim to connect both and give a nice API
  15. DuckDB Interop 1. Load data in Arrow 2. Process in

    DuckDB 3. Convert back to Arrow 4. Hand over to another tool All the above happened without any serialization overhead
  16. Fast Database Access Nowadays, you get even more speed with

    • ADBC – Arrow DataBase Connector • arrow-odbc
  17. Should you use Arrow? 1. Actually, No. 2. Not directly,

    but make sure it is used in the backend. 3. If you need performance, but the current exchange is slow; then dive deeper. 4. If you want to write high-performance, framework-agnostic code.