Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The PyArrow revolution in Pandas

The PyArrow revolution in Pandas

Pandas can now take advantage of PyArrow. What does that mean now, and what does it mean in the future? A lot! In this talk, from Euro Python 2024, I outline what PyArrow is, how it's different, and ways in which you can use it now. I also compare the speed of NumPy-based Pandas with PyArrow-based Pandas, and show where and how each backend currently has an advantage.

Reuven M. Lerner

July 13, 2024
Tweet

More Decks by Reuven M. Lerner

Other Decks in Technology

Transcript

  1. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Corporate training • Online courses at LernerPython.com • Newsletters, including Bamboo Weekly (Pandas puzzles on current events) • YouTube I teach Python and Pandas! 2
  2. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Do you have the same code on multiple lines? • Don't repeat yourself: Use a loop! • Do you have the same code in multiple places in a program? • Don't repeat yourself: Use a function! DRY: Don’t repeat yourself! 4
  3. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Do you have the same code in multiple programs? • Don’t repeat yourself: Use a library • In Python, we call this a "module" or “package” • A module helps the future you • It also helps other people avoid repeating your solution DRY: Don’t repeat yourself! 5
  4. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Don’t implement your own data-analysis routines • If you use Pandas, the hard stuff is done for you • Reading data • Cleaning data • Analyzing data • Visualizing data • Writing data • Pandas is extremely convenient — and also popular Pandas 6
  5. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Wes McKinney invented Pandas in 2008 • He built it on top of NumPy • Stable • Fast • Handles 1D and 2D data • Numerous data types Pandas used a package, too 7
  6. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Automatic vs. manual transmission • Pandas series • A wrapper around a 1D NumPy array • Pandas data frames • A wrapper around a 2D NumPy array • Or, if you prefer: A dictionary of Pandas series Pandas and NumPy 8
  7. • NumPy’s storage is in C • Much faster than

    Python • Much less memory usage than Python • Vectorization • Lots of analysis methods • Used by many people and projects, so you know it’s stable Lots of good news
  8. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Storing data in Pandas (via NumPy) uses lots of memory • Storage in rows, vs. in columns • We store all of the data precisely as it is • No compression • No use of zero-copy techniques • Not designed for batch processing or streaming • No complex data types • Strings • Dates and times • Nested types The bad news 10
  9. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Let’s read a 2.2 GB CSV f ile (NYC parking violations in 2020) • df.shape • (12495734, 43) # 12.5 million rows • df.info() • Memory usage: 4.0+ GB • df.info(memory_usage='deep') • Memory usage: 15.6 GB Memory usage 11
  10. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Many languages and frameworks work with 2D data • What if we had a single library that everyone could rely on? • Don’t reinvent the wheel; use a single, working data structure • Use columns, rather than rows, for fast retrieval • Reduce the overhead of exchanging data among systems • Take advantage of modern processors • Arrow was f irst released in 2016 • Latest version, 13.0.0, was released in August 2023 Arrow: DRY for data 12
  11. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Python bindings for Arrow • You can use PyArrow in your programs! • Create arrays (1D) and tables (2D) • Retrieve particular rows and columns • Sorting and grouping PyArrow 13
  12. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Primitive types • Integers (signed and unsigned) • Floats • Date, time, and timestamp • String, binary • Dict (like Pandas categories) • Map (like Python dicts) • Complex types • Array (like Pandas series) • Table (like Pandas data frame) Some of Arrow’s data types 14
  13. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Pandas is moving toward PyArrow • Some functionality is already here • Much more is coming in the future • Using PyArrow can save you time and memory • And get ready: It’ll be required in Pandas 3.0 The PyArrow revolution 16
  14. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    PyArrow revolution, part 1: Faster CSV reading/writing 17
  15. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • “read_csv” is very flexible, useful, and popular • Also: Can be very slow! • Speed up by specifying dtypes • Speed up by setting low_memory=False • … but it’s still really slow • Solution: Use PyArrow for reading/writing CSV f iles • How? Specify engine=‘pyarrow’ We use lots of CSV f iles 18
  16. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    def load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, low_memory=False) end_time = time.perf_counter() total_time = end_time - start_time print(f'{total_time:0.2f}') def pyarrow_load_with_time(): start_time = time.perf_counter() df = pd.read_csv(filename, engine='pyarrow') end_time = time.perf_count() total_time = end_time - start_time print(f'{total_time:0.2f}') Time comparison 19
  17. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    load_with_time() 69.12 pyarrow_load_with_time() 13.54 The results? 20
  18. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Good: • It’s 5x faster! Does anything else really matter?!? • PyArrow reads the whole thing; no more low_memory=False • PyArrow (usually) detects datetime columns, so there’s less need for parse_dates • Bad: • Some CSV f iles are too weird for PyArrow • If the f ile is small, then PyArrow isn’t worthwhile Differences 21
  19. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • I now use PyArrow to load CSV f iles by default • It doesn’t always work • It usually does, and is way faster Use this today! 22
  20. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    PyArrow revolution, part 2: Faster f ile formats 23
  21. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Most data is in: • CSV (text-based, slow, poorly speci f ied) • Excel (handles dtypes, slow, proprietary) • Arrow de f ined two new columnar, binary formats • Feather • Fast reads and writes • No compression • Parquet • Slower reads and writes • Highly compressed What formats do we use? 24
  22. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • The same data, in three formats: • CSV: 2.2G • Feather: 1.4G • Parquet: 379M • Not only smaller! • Much faster to load • Binary format • No dtype guessing/hints • Other systems/languages support them, too Size comparison 25
  23. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Let’s load the same data, in three different formats: • CSV, Python engine: 55.8 s • CSV, PyArrow engine: 11.8 s • Feather: 10.6 s • Parquet: 9.1 s How much faster? 26
  24. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • You can use these formats today! • Do a one-time translation from CSV to Feather/ Parquet • Then read from the binary format Store data in Feather/Parquet 27
  25. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    28 PyArrow revolution, part 3: Swapping out NumPy
  26. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • It’s also experimental • It’ll eventually be preferred or default • This will take time! This is big! (Or it will be) 29
  27. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Choose a PyArrow dtype, rather than one from NumPy • Usually, that just means putting [pyarrow] after the name s = Series(np.random.randint(-50, 50, 10), index=list('abcdefghij'), dtype='int64[pyarrow]') df = DataFrame(np.random.randint(-50, 50, [3,4]), index=list('abc'), columns=list('wxyz'), dtype='int64[pyarrow]') Using PyArrow on the back end 30
  28. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    df['Vehicle Color'].memory_usage(deep=True) 635123659 # 635M df['Vehicle Color'] = df['Vehicle Color'].astype('string[pyarrow]') df['Vehicle Color'].memory_usage(deep=True) 134160082 # 134M f'{(134160082 / 635123659):.02%}' '21.12%' Convert one column 31
  29. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    df_pa = pd.read_csv(filename, engine='pyarrow', dtype_backend='pyarrow') Use PyArrow when reading a CSV 32
  30. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    Summons Number int64[pyarrow] Plate ID string[pyarrow] Registration State string[pyarrow] Plate Type string[pyarrow] Issue Date string[pyarrow] Violation Code int64[pyarrow] Vehicle Body Type string[pyarrow] Vehicle Make string[pyarrow] Issuing Agency string[pyarrow] Street Code1 int64[pyarrow] Street Code2 int64[pyarrow] Street Code3 int64[pyarrow] Vehicle Expiration Date int64[pyarrow] Violation Location int64[pyarrow] Violation Precinct int64[pyarrow] Issuer Precinct int64[pyarrow] Issuer Code int64[pyarrow] Issuer Command string[pyarrow] Issuer Squad string[pyarrow] Violation Time string[pyarrow] Time First Observed string[pyarrow] Violation County string[pyarrow] Violation In Front Of Or Opposite string[pyarrow] House Number string[pyarrow] Street Name string[pyarrow] Intersecting Street string[pyarrow] Date First Observed int64[pyarrow] Law Section int64[pyarrow] Sub Division string[pyarrow] Violation Legal Code string[pyarrow] Days Parking In Effect string[pyarrow] From Hours In Effect string[pyarrow] To Hours In Effect string[pyarrow] Vehicle Color string[pyarrow] Unregistered Vehicle? int64[pyarrow] Vehicle Year int64[pyarrow] Meter Number string[pyarrow] Feet From Curb int64[pyarrow] Violation Post Code string[pyarrow] Violation Description string[pyarrow] No Standing or Stopping Violation null[pyarrow] Hydrant Violation null[pyarrow] Double Parking Violation null[pyarrow] df.dtypes 33
  31. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    df.info() dtypes: int64[pyarrow](15), null[pyarrow](3), string[pyarrow](25) ( df['Hydrant Violation'] .isna() .value_counts(normalize=True) ) Hydrant Violation True 1.0 Name: proportion, dtype: float64 Or, just use df.info() 34
  32. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Using NumPy: 15.0 GB • Using PyArrow: 3.7 GB And the memory usage? 35
  33. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np['Vehicle Color'].value_counts().head(5) 510 ms ± 781 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].value_counts().head(5) 188 ms ± 211 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Top 5 values in a column 37
  34. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 4.7 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Vehicle Color'].str.contains('[BZ]', regex=True, case=False).value_counts().head(5) 731 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) • PyArrow is about 6x faster Searching in strings with regex=True 38
  35. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 812 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Vehicle Color'] == 'BLUE', 'Registration State'].value_counts().head(5) 87.7 ms ± 5.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Most common states with blue cars 39
  36. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np['Issue Date’].dt.month .value_counts().head(5) 249 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa['Issue Date’].dt.month .value_counts().head(5) 336 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Is PyArrow always faster? 40
  37. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 378 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.loc[lambda df_: df_['Issue Date'].dt.month.isin([3, 7]), 'Registration State'].value_counts().head(5) 460 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Most common states in March/July 41
  38. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.groupby('Registration State')['Feet From Curb'].mean() 631 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.groupby('Registration State')['Feet From Curb'].mean() 412 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Grouping 42
  39. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.iloc[[0, 100, 100_000, -10_000]] 126 μs ± 855 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %timeit df_pa.iloc[[0, 100, 100_000, -10_000]] 816 ms ± 75.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 816 ms == 816,000 µs Retrieve rows with .iloc 43
  40. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.loc[[0, 100, 100_000]] 218 μs ± 2.01 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %timeit df_pa.loc[[0, 100, 100_000]] 789 ms ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 789 ms == 789,000 µs Retrieve rows with .loc 44
  41. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    %timeit df_np.join(df_np, rsuffix='_r') 20.3 s ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df_pa.join(df_np, rsuffix='_r') 10.3 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Joining (self-join) 45
  42. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s = Series([10, 20, 30, 40, 50], dtype='int64') s.loc[2] = np.nan s 0 10.0 1 20.0 2 NaN 3 40.0 4 50.0 dtype: float64 Recognize this problem? 48
  43. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s = Series([10, 20, 30, 40, 50], dtype='int64[pyarrow]') s.loc[2] = np.nan s 0 10 1 20 2 <NA> 3 40 4 50 dtype: int64[pyarrow] Nullable types 49
  44. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s = Series('hello out there'.split(), dtype='string[pyarrow]') s.loc[1] = np.nan s 0 hello 1 <NA> 2 there dtype: string Nullable types — not just ints 50
  45. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s_np = Series([10, 70, 100], dtype='int8') s_np + 100 0 110 1 -86 2 -56 dtype: int8 s_np + 1000 OverflowError: Python integer 1000 out of bounds for int NumPy 2.0 over f low behavior 51
  46. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s_pa = Series([10, 70, 100], dtype=‘int8[pyarrow]') s_pa + 100 0 110 1 170 2 200 dtype: int64[pyarrow] PyArrow over f low behavior 52
  47. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    s_pa + 1000 0 1010 1 1070 2 1100 dtype: int64[pyarrow] PyArrow over f low behavior 53
  48. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • There is another, separate way to use a backend that isn't NumPy, namely "extension types.” • Their main advantage: They’re nullable • Otherwise, they have the same issues as NumPy dtypes: • Row oriented storage • Python strings • No compression • Not interoperable with other systems Different from extension types! 54
  49. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • You can, of course, use PyArrow directly • It’s a fast, smart, capable data structure • If and when you want, you can convert it to a Pandas data frame: t.to_pandas() • You can also =import a data frame into PyArrow: pa.Table.from_pandas(df_pa) • Also, when our backend uses PyArrow: • s.values returns a PyArrow array • df_pa[‘column’].values returns a PyArrow array • df_pa.values returns a NumPy array, for compatibility purposes Using raw PyArrow 55
  50. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Right now, Pandas is a powerful package • It’s becoming a powerful platform • Swappable back ends (NumPy and PyArrow) • It’s setting the standard for data-analysis API • Other libraries (e.g., Polars) are partly emulating it • It’s becoming something that other software can work with • Via PyArrow, R and Apache Spark • In memory, DuckDB can query Pandas data frames The real Pandas revolution 56
  51. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • PyArrow is revolutionizing Pandas • Faster f ile loading today • Faster, more ef f icient back-end storage tomorrow • (Or you can try it today!) • Pandas is becoming a platform • PyArrow is part of that move • You’ll be able to choose how much complex ef f iciency you want vs. simple, inef f icient clarity Summary 57
  52. The PyArrow revolution in Pandas Reuven M. Lerner • https://LernerPython.com

    • Check out my courses at https://LernerPython.com • Solve Pandas challenges with real-world data based on current events at https://BambooWeekly.com • Follow me on YouTube/LinkedIn/X • Contact me at [email protected] • Enjoy the rest of the conference! Questions? 58