The PyArrow revolution in Pandas

Slide 1

Slide 1 text

The PyArrow revolution In Pandas Reuven M. Lerner • EuroPython 2024 https://LernerPython.com

Slide 2

Slide 2 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Corporate training • Online courses at LernerPython.com • Newsletters, including Bamboo Weekly (Pandas puzzles on current events) • YouTube I teach Python and Pandas! 2

Slide 3

Slide 3 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com My books 3

Slide 4

Slide 4 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Do you have the same code on multiple lines? • Don't repeat yourself: Use a loop! • Do you have the same code in multiple places in a program? • Don't repeat yourself: Use a function! DRY: Don’t repeat yourself! 4

Slide 5

Slide 5 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Do you have the same code in multiple programs? • Don’t repeat yourself: Use a library • In Python, we call this a "module" or “package” • A module helps the future you • It also helps other people avoid repeating your solution DRY: Don’t repeat yourself! 5

Slide 6

Slide 6 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Don’t implement your own data-analysis routines • If you use Pandas, the hard stuff is done for you • Reading data • Cleaning data • Analyzing data • Visualizing data • Writing data • Pandas is extremely convenient — and also popular Pandas 6

Slide 7

Slide 7 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Wes McKinney invented Pandas in 2008 • He built it on top of NumPy • Stable • Fast • Handles 1D and 2D data • Numerous data types Pandas used a package, too 7

Slide 8

Slide 8 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Automatic vs. manual transmission • Pandas series • A wrapper around a 1D NumPy array • Pandas data frames • A wrapper around a 2D NumPy array • Or, if you prefer: A dictionary of Pandas series Pandas and NumPy 8

Slide 9

Slide 9 text

• NumPy’s storage is in C • Much faster than Python • Much less memory usage than Python • Vectorization • Lots of analysis methods • Used by many people and projects, so you know it’s stable Lots of good news

Slide 10

Slide 10 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Storing data in Pandas (via NumPy) uses lots of memory • Storage in rows, vs. in columns • We store all of the data precisely as it is • No compression • No use of zero-copy techniques • Not designed for batch processing or streaming • No complex data types • Strings • Dates and times • Nested types The bad news 10

Slide 11

Slide 11 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Let’s read a 2.2 GB CSV f ile (NYC parking violations in 2020) • df.shape • (12495734, 43) # 12.5 million rows • df.info() • Memory usage: 4.0+ GB • df.info(memory_usage='deep') • Memory usage: 15.6 GB Memory usage 11

Slide 12

Slide 12 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Many languages and frameworks work with 2D data • What if we had a single library that everyone could rely on? • Don’t reinvent the wheel; use a single, working data structure • Use columns, rather than rows, for fast retrieval • Reduce the overhead of exchanging data among systems • Take advantage of modern processors • Arrow was f irst released in 2016 • Latest version, 13.0.0, was released in August 2023 Arrow: DRY for data 12

Slide 13

Slide 13 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Python bindings for Arrow • You can use PyArrow in your programs! • Create arrays (1D) and tables (2D) • Retrieve particular rows and columns • Sorting and grouping PyArrow 13

Slide 14

Slide 14 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Primitive types • Integers (signed and unsigned) • Floats • Date, time, and timestamp • String, binary • Dict (like Pandas categories) • Map (like Python dicts) • Complex types • Array (like Pandas series) • Table (like Pandas data frame) Some of Arrow’s data types 14

Slide 15

Slide 15 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com So what? 15

Slide 16

Slide 16 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Pandas is moving toward PyArrow • Some functionality is already here • Much more is coming in the future • Using PyArrow can save you time and memory • And get ready: It’ll be required in Pandas 3.0 The PyArrow revolution 16

Slide 17

Slide 17 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com PyArrow revolution, part 1: Faster CSV reading/writing 17

Slide 18

Slide 18 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • “read_csv” is very flexible, useful, and popular • Also: Can be very slow! • Speed up by specifying dtypes • Speed up by setting low_memory=False • … but it’s still really slow • Solution: Use PyArrow for reading/writing CSV f iles • How? Specify engine=‘pyarrow’ We use lots of CSV f iles 18

Slide 19

Slide 19 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com def load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, low_memory=False) end_time = time.perf_counter() total_time = end_time - start_time print(f'{total_time:0.2f}') def pyarrow_load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, engine='pyarrow') end_time = time.perf_count() total_time = end_time - start_time print(f'{total_time:0.2f}') Time comparison 19

Slide 20

Slide 20 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com load_with_time() 69.12 pyarrow_load_with_time() 13.54 The results? 20

Slide 21

Slide 21 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Good: • It’s 5x faster! Does anything else really matter?!? • PyArrow reads the whole thing; no more low_memory=False • PyArrow (usually) detects datetime columns, so there’s less need for parse_dates • Bad: • Some CSV f iles are too weird for PyArrow • If the f ile is small, then PyArrow isn’t worthwhile Differences 21

Slide 22

Slide 22 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • I now use PyArrow to load CSV f iles by default • It doesn’t always work • It usually does, and is way faster Use this today! 22

Slide 23

Slide 23 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com PyArrow revolution, part 2: Faster f ile formats 23

Slide 24

Slide 24 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Most data is in: • CSV (text-based, slow, poorly speci f ied) • Excel (handles dtypes, slow, proprietary) • Arrow de f ined two new columnar, binary formats • Feather • Fast reads and writes • No compression • Parquet • Slower reads and writes • Highly compressed What formats do we use? 24

Slide 25

Slide 25 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • The same data, in three formats: • CSV: 2.2G • Feather: 1.4G • Parquet: 379M • Not only smaller! • Much faster to load • Binary format • No dtype guessing/hints • Other systems/languages support them, too Size comparison 25

Slide 26

Slide 26 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Let’s load the same data, in three different formats: • CSV, Python engine: 55.8 s • CSV, PyArrow engine: 11.8 s • Feather: 10.6 s • Parquet: 9.1 s How much faster? 26

Slide 27

Slide 27 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • You can use these formats today! • Do a one-time translation from CSV to Feather/ Parquet • Then read from the binary format Store data in Feather/Parquet 27

Slide 28

Slide 28 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com 28 PyArrow revolution, part 3: Swapping out NumPy

Slide 29

Slide 29 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • It’s also experimental • It’ll eventually be preferred or default • This will take time! This is big! (Or it will be) 29

Slide 30

Slide 30 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Choose a PyArrow dtype, rather than one from NumPy • Usually, that just means putting [pyarrow] after the name s = Series(np.random.randint(-50, 50, 10), index=list('abcdefghij'), dtype='int64[pyarrow]') df = DataFrame(np.random.randint(-50, 50, [3,4]), index=list('abc'), columns=list('wxyz'), dtype='int64[pyarrow]') Using PyArrow on the back end 30

Slide 31

Slide 31 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com df['Vehicle Color'].memory_usage(deep=True) 635123659 # 635M df['Vehicle Color'] = df['Vehicle Color'].astype('string[pyarrow]') df['Vehicle Color'].memory_usage(deep=True) 134160082 # 134M f'{(134160082 / 635123659):.02%}' '21.12%' Convert one column 31

Slide 32

Slide 32 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com df_pa = pd.read_csv(filename, engine='pyarrow', dtype_backend='pyarrow') Use PyArrow when reading a CSV 32

Slide 33

Slide 33 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com Summons Number int64[pyarrow] Plate ID string[pyarrow] Registration State string[pyarrow] Plate Type string[pyarrow] Issue Date string[pyarrow] Violation Code int64[pyarrow] Vehicle Body Type string[pyarrow] Vehicle Make string[pyarrow] Issuing Agency string[pyarrow] Street Code1 int64[pyarrow] Street Code2 int64[pyarrow] Street Code3 int64[pyarrow] Vehicle Expiration Date int64[pyarrow] Violation Location int64[pyarrow] Violation Precinct int64[pyarrow] Issuer Precinct int64[pyarrow] Issuer Code int64[pyarrow] Issuer Command string[pyarrow] Issuer Squad string[pyarrow] Violation Time string[pyarrow] Time First Observed string[pyarrow] Violation County string[pyarrow] Violation In Front Of Or Opposite string[pyarrow] House Number string[pyarrow] Street Name string[pyarrow] Intersecting Street string[pyarrow] Date First Observed int64[pyarrow] Law Section int64[pyarrow] Sub Division string[pyarrow] Violation Legal Code string[pyarrow] Days Parking In Effect string[pyarrow] From Hours In Effect string[pyarrow] To Hours In Effect string[pyarrow] Vehicle Color string[pyarrow] Unregistered Vehicle? int64[pyarrow] Vehicle Year int64[pyarrow] Meter Number string[pyarrow] Feet From Curb int64[pyarrow] Violation Post Code string[pyarrow] Violation Description string[pyarrow] No Standing or Stopping Violation null[pyarrow] Hydrant Violation null[pyarrow] Double Parking Violation null[pyarrow] df.dtypes 33

Slide 34

Slide 34 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com df.info() dtypes: int64[pyarrow](15), null[pyarrow](3), string[pyarrow](25) ( df['Hydrant Violation'] .isna() .value_counts(normalize=True) ) Hydrant Violation True 1.0 Name: proportion, dtype: float64 Or, just use df.info() 34

Slide 35

Slide 35 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Using NumPy: 15.0 GB • Using PyArrow: 3.7 GB And the memory usage? 35

Slide 36

Slide 36 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com How fast is it? 36

Slide 37

Slide 37 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np['Vehicle Color'].value_counts().head(5) 510 ms ± 781 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].value_counts().head(5) 188 ms ± 211 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Top 5 values in a column 37

Slide 38

Slide 38 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 4.7 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 731 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) • PyArrow is about 6x faster Searching in strings with regex=True 38

Slide 39

Slide 39 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 812 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 87.7 ms ± 5.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Most common states with blue cars 39

Slide 40

Slide 40 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np['Issue Date’].dt.month .value_counts().head(5) 249 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Issue Date’].dt.month .value_counts().head(5) 336 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Is PyArrow always faster? 40

Slide 41

Slide 41 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 378 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 460 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Most common states in March/July 41

Slide 42

Slide 42 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.groupby('Registration State')['Feet From Curb'].mean() 631 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.groupby('Registration State')['Feet From Curb'].mean() 412 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Grouping 42

Slide 43

Slide 43 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.iloc[[0, 100, 100_000, -10_000]] 126 μs ± 855 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %timeit df_pa.iloc[[0, 100, 100_000, -10_000]] 816 ms ± 75.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 816 ms == 816,000 µs Retrieve rows with .iloc 43

Slide 44

Slide 44 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.loc[[0, 100, 100_000]] 218 μs ± 2.01 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %timeit df_pa.loc[[0, 100, 100_000]] 789 ms ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 789 ms == 789,000 µs Retrieve rows with .loc 44

Slide 45

Slide 45 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com %timeit df_np.join(df_np, rsuffix='_r') 20.3 s ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.join(df_np, rsuffix='_r') 10.3 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Joining (self-join) 45

Slide 46

Slide 46 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com Overall 46

Slide 47

Slide 47 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com New and different behavior 47

Slide 48

Slide 48 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s = Series([10, 20, 30, 40, 50], dtype='int64') s.loc[2] = np.nan s 0 10.0 1 20.0 2 NaN 3 40.0 4 50.0 dtype: float64 Recognize this problem? 48

Slide 49

Slide 49 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s = Series([10, 20, 30, 40, 50], dtype='int64[pyarrow]') s.loc[2] = np.nan s 0 10 1 20 2 3 40 4 50 dtype: int64[pyarrow] Nullable types 49

Slide 50

Slide 50 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s = Series('hello out there'.split(), dtype='string[pyarrow]') s.loc[1] = np.nan s 0 hello 1 2 there dtype: string Nullable types — not just ints 50

Slide 51

Slide 51 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s_np = Series([10, 70, 100], dtype='int8') s_np + 100 0 110 1 -86 2 -56 dtype: int8 s_np + 1000 OverflowError: Python integer 1000 out of bounds for int NumPy 2.0 over f low behavior 51

Slide 52

Slide 52 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s_pa = Series([10, 70, 100], dtype=‘int8[pyarrow]') s_pa + 100 0 110 1 170 2 200 dtype: int64[pyarrow] PyArrow over f low behavior 52

Slide 53

Slide 53 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com s_pa + 1000 0 1010 1 1070 2 1100 dtype: int64[pyarrow] PyArrow over f low behavior 53

Slide 54

Slide 54 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • There is another, separate way to use a backend that isn't NumPy, namely "extension types.” • Their main advantage: They’re nullable • Otherwise, they have the same issues as NumPy dtypes: • Row oriented storage • Python strings • No compression • Not interoperable with other systems Different from extension types! 54

Slide 55

Slide 55 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • You can, of course, use PyArrow directly • It’s a fast, smart, capable data structure • If and when you want, you can convert it to a Pandas data frame: t.to_pandas() • You can also =import a data frame into PyArrow: pa.Table.from_pandas(df_pa) • Also, when our backend uses PyArrow: • s.values returns a PyArrow array • df_pa[‘column’].values returns a PyArrow array • df_pa.values returns a NumPy array, for compatibility purposes Using raw PyArrow 55

Slide 56

Slide 56 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Right now, Pandas is a powerful package • It’s becoming a powerful platform • Swappable back ends (NumPy and PyArrow) • It’s setting the standard for data-analysis API • Other libraries (e.g., Polars) are partly emulating it • It’s becoming something that other software can work with • Via PyArrow, R and Apache Spark • In memory, DuckDB can query Pandas data frames The real Pandas revolution 56

Slide 57

Slide 57 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • PyArrow is revolutionizing Pandas • Faster f ile loading today • Faster, more ef f icient back-end storage tomorrow • (Or you can try it today!) • Pandas is becoming a platform • PyArrow is part of that move • You’ll be able to choose how much complex ef f iciency you want vs. simple, inef f icient clarity Summary 57

Slide 58

Slide 58 text

The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com • Check out my courses at https://LernerPython.com • Solve Pandas challenges with real-world data based on current events at https://BambooWeekly.com • Follow me on YouTube/LinkedIn/X • Contact me at [email protected] • Enjoy the rest of the conference! Questions? 58