PyData Sofia May 2024 - Intro to Apache Arrow

Slide 1

Slide 1 text

Apache Arrow  Exploring the tech that powers the modern data (science) stack Uwe Korn – QuantCo – May 2024

Slide 2

Slide 2 text

About me • Uwe Korn  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/ • CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer

Slide 3

Slide 3 text

Agenda 1. Why do we need this? 2. What is it? 3. What’s its impact?

Slide 4

Slide 4 text

Why do we need this? • Di ff erent Ecosystems • PyData / R space • Java/Scala „Big Data“ • SQL Databases • Di ff erent technologies • Pandas / SQLite

Slide 5

Slide 5 text

Why solve it? • We build pipelines to move data • We want to use all tools we can leverage • Avoid working on converters or waiting for the data to be converted

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Introducing Apache Arrow • Columnar representation of data in main memory • Provide libraries to access the data structures • Building blocks for various ecosystems to use them • Implements adopters for existing structures

Slide 9

Slide 9 text

Columnar?

Slide 10

Slide 10 text

All the languages! 1. „Pure“ implementations in  C++, Java, Go, JavaScript, C#, Rust, Julia, Swift, C(nanoarrow) 2. Wrappers on-top of them in  Python, R, Ruby, C/GLib, Matlab

Slide 11

Slide 11 text

There is a social component 1. A standard is only as good as its usage 2. Di ff erent communities came together to form Arrow 3. Nowadays even more use it to connect

Slide 12

Slide 12 text

Arrow Basics 1. Array: a sequence of values of the same type in contiguous bu ff ers 2. ChunkedArray: a sequence of arrays of the same type 3. Table: a sorted dictionary of ChunkedArrays of the same length

Slide 13

Slide 13 text

Arrow Basics: valid masks 1. Track null_count per Array 2. Each array has a bu ff er of bits indicating whether a value is valid,  i.e. non-null

Slide 14

Slide 14 text

Arrow Basics: int array Python array: [1, null, 2, 4, 8] Length: 5, Null count: 1 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00011101 | 0 (padding) | Value Buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | |-------------|-------------|-------------|-------------|-------------|-----------------------| | 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |

Slide 15

Slide 15 text

Arrow Basics: string array Python array: ['joe', null, null, 'mark'] Length: 4, Null count: 2 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001001 | 0 (padding) | Offsets buffer: | Bytes 0-19 | Bytes 20-63 | |----------------|-----------------------| | 0, 3, 3, 3, 7 | unspecified (padding) | Value buffer: | Bytes 0-6 | Bytes 7-63 | |----------------|-----------------------| | joemark | unspecified (padding) |

Slide 16

Slide 16 text

Impact! Arrow is now used in all „edges“ where data passes through: • Databases, either in clients or in UDFs • Data Engineering tooling • Machine Learning libraries • Dashboarding and BI applications

Slide 17

Slide 17 text

Examples of Arrow’s massive Impact If it ain’t a 10x-100x+ speedup, it ain’t worth it.

Slide 18

Slide 18 text

Parquet

Slide 19

Slide 19 text

Anatomy of a Parquet fi le

Slide 20

Slide 20 text

Parquet 1. This was the fi rst exposure of Arrow to the Python world 2. End-users only see pandas.read_parquet 3. Actually, it is: A. C++ Parquet->Arrow reader B. C++ Pandas<->Arrow Adapter C. Small Python shim to connect both and give a nice API

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

DuckDB Interop 1. Load data in Arrow 2. Process in DuckDB 3. Convert back to Arrow 4. Hand over to another tool All the above happened without any serialization overhead

Slide 23

Slide 23 text

DuckDB Interop

Slide 24

Slide 24 text

Fast Database Access

Slide 25

Slide 25 text

Fast Database Access Nowadays, you get even more speed with • ADBC – Arrow DataBase Connector • arrow-odbc

Slide 26

Slide 26 text

Should you use Arrow? 1. Actually, No. 2. Not directly, but make sure it is used in the backend. 3. If you need performance, but the current exchange is slow; then dive deeper. 4. If you want to write high-performance, framework-agnostic code.

Slide 27

Slide 27 text

The ecosystem …and many more.…  https://arrow.apache.org/powered_by/

Slide 28

Slide 28 text

Questions? Follow me:  https://mastodon.social/@xhochy / @xhochy  https://www.linkedin.com/in/uwekorn/