Handling Large Data with Python

Slide 1

Slide 1 text

Handling Large Data with Python Jill Cates Senior Data Scientist at Shopify PyCon Italia | May 2024

Slide 2

Slide 2 text

De ining the Scope

Slide 3

Slide 3 text

What kind of data? In scope for this presentation Data can come in many diﬀerent forms: • Tabular (dataframes) • Images • N-dimensional arrays • Time-series data • Audio and video • Graphical structures

Slide 4

Slide 4 text

What kind of data? An example of a tabular dataframe member_id name age height 1 Joe 47 194 2 Cindy 32 168 3 Roman 11 161 4 Lucy 26 173 Rows = samples Columns = attributes / features

Slide 5

Slide 5 text

Data have two possible states In motion ETL pipelines Batch or streaming processed Examples: Apache Spark, Flink, Beam What kind of data? At rest Lives in data warehouse or static iles (e.g., csv, tsv, excel, etc.) Exploratory data analysis

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Data operations can happen on two types of systems Single Node Data exploration and ML modelling Less overhead What kind of system? Ideal for running tasks in parallel High overhead Distributed

Slide 8

Slide 8 text

Data operations can happen on two types of systems Single Node Data exploration and ML modelling Less overhead What kind of system? Ideal for running tasks in parallel High overhead Distributed

Slide 9

Slide 9 text

Python packages for large data Polars • Pandas • Polars • DuckDB • Vaex • Dask • Modin

Slide 10

Slide 10 text

Pandas

Slide 11

Slide 11 text

Pandas v1 • Created in 2008 by Wes McKinney • Built around “data frames” and “series” data structures • Represented as NumPy arrays behind the scenes pandas.DataFrame pandas.Series Overview

Slide 12

Slide 12 text

• Great option for manipulating and analyzing relatively small datasets ! Rule of thumb: have 5 to 10 times as much RAM as the size of your dataset • Very strong community, works well with many other libraries (e.g., seaborn, matplotlib, scikit-learn) Pandas v1 Overview

Slide 13

Slide 13 text

1) Only load the columns that you care about 2) Specify column data types when loading data 3) Down-cast integers and loats 4) Filter irst before merging or aggregating 5) Avoid for loops at all costs 6) Serialize your data Optimization Tips Pandas v1

Slide 14

Slide 14 text

1) Reduce memory by only loading the columns that you care about Reduces dataframe size by 75% " Original size: 203.8+ MB Optimization Tips Pandas v1

Slide 15

Slide 15 text

2) Reduce memory by specifying column data types Optimization Tips Pandas v1 Column data types aﬀect memory usage!

Slide 16

Slide 16 text

2) Reduce memory by specifying column data types Optimization Tips Pandas v1 • Convert string datatypes from “object” to “category” or “string” • Category: a string variable with a few possible values Reduced dataframe size by 20% # Original size: 203.8+ MB

Slide 17

Slide 17 text

3) Reduce memory by down-casting integers and loats Optimization Tips Pandas v1 Reduces size by 40% (Original size 203MB) Takes up less space than loat64 and int64

Slide 18

Slide 18 text

3) Reduce memory by down-casting integers and loats Optimization Tips min max int8 -128 127 int16 -32768 32767 int32 -2147483648 2147483647 int64 -9223372036854775808 9223372036854775807 min max loat32 -6.5504E+04 6.5504E+04 loat64 -1.7976E+308 1.7976E+308 np.finfo(np.float64) np.iinfo(np.int64) Pandas v1

Slide 19

Slide 19 text

4) Filter irst before merging or aggregating Optimization Tips Aggregate irst, ilter last 123 ms $ Filter irst, aggregate last 58.3 ms % Pandas v1 2.1x speed up!

Slide 20

Slide 20 text

5) Avoid for loops at all costs Optimization Tips Pandas v1 & for index, row in df.iterrows(): ' 6) Serialize your data Alternative ile formats to csv include: • Feather —> df.to_feather() • Parquet —> df.to_parquet() • Pickle —> df.to_pickle() • Hdf5 —> df.to_hdf() Use vectorized operations instead!

Slide 21

Slide 21 text

The motivation behind Pandas v2

Slide 22

Slide 22 text

• Released in April 2023 " • Supports Apache Arrow backend - very e icient! • Better support for strings • Better support for missing values • Zero-copy data access • E icient data transfer between diﬀerent systems The Apache Arrow Revolution Pandas >= v2.0

Slide 23

Slide 23 text

The Apache Arrow Revolution NumPy version vs. Pandas >= v2.0

Slide 24

Slide 24 text

The Apache Arrow Revolution PyArrow backend outperforms NumPy backend Operation NumPy Backend PyArrow Backend Speed Up Read parquet ile 141 ms 87 ms 1.6x Calculate mean ( loat64) 3.56 ms 1.73 ms 2.1x Endswith (string) 471 ms 14.1 ms 31.6x Aggregated count 97.2 ms 39.5 ms 2.5x Aggregated mean 102 ms 70 ms 1.5x First three rows from Pandas 2.0 and the Arrow Revolution by Marc Garcia Pandas >= v2.0

Slide 25

Slide 25 text

Polars

Slide 26

Slide 26 text

• Created in 2020 by Ritchie Vink • Built around “data frames” and “series” (same as Pandas) • Uses Apache Arrow in the backend • Written in Rust • Supports multi-threading • Supports lazy evaluation and query optimization • Streaming API allows for out-of-core processing Polars Overview

Slide 27

Slide 27 text

Polars is very new and very popular!

Slide 28

Slide 28 text

Polars Syntax is very similar to PySpark

Slide 29

Slide 29 text

Polars Supports string datatypes! No more “objects” Syntax is very similar to PySpark

Slide 30

Slide 30 text

Polars Eager vs Lazy Evaluation Eager Evaluation Lazy Evaluation Code is evaluated as soon as you run the code Each line of code is added to a query plan rather than being evaluated straight away

Slide 31

Slide 31 text

Polars Eager vs Lazy Evaluation Eager Evaluation Lazy Evaluation Code is evaluated as soon as you run the code Each line of code is added to a query plan rather than being evaluated straight away

Slide 32

Slide 32 text

Polars Lazy Evaluation • Polars LazyFrames get evaluated lazily • collect() materializes the LazyFrame into a DataFrame • Query is optimized behind the scenes • e.g. iltering before aggregating • Dataframes can be converted to LazyFrames by calling the LazyFrame class: pl.LazyFrame(df) Eager vs Lazy Evaluation

Slide 33

Slide 33 text

Polars Lazy Evaluation + Streaming • Executes query in batches instead of all at once • Streaming is supported for several operations: • filter, slice, head • with_columns • group_by • join • unique • sort • explode, melt • ❗ Still in active development Streaming API

Slide 34

Slide 34 text

DuckDB

Slide 35

Slide 35 text

DuckDB • In-process online analytical processing (OLAP) database • Created in 2019 by Mühleisen and Raasveldt • Written in C++ and has client APIs in several languages • Uses a columnar-vectorized SQL query processing engine • Query optimization behind the scenes • Interoperable with Apache Arrow Overview

Slide 36

Slide 36 text

DuckDB • Great option for SQL enthusiasts! Overview General-Purpose DataTypes Float —> double String —> varchar

Slide 37

Slide 37 text

DuckDB Overview Output • Uses PostgreSQL syntax Symbolic representation of the SQL query Does not store any data

Slide 38

Slide 38 text

DuckDB Overview Pandas DataFrame! • Easily convert to data frames

Slide 39

Slide 39 text

DuckDB Overview • Easily convert to data frames Polars DataFrame!

Slide 40

Slide 40 text

DuckDB Overview You can store intermediate queries as variables and reference them in other queries ) • Reference variables in SQL queries

Slide 41

Slide 41 text

DuckDB Overview • Calling Python functions inside a SQL query

Slide 42

Slide 42 text

DuckDB Overview • Calling Python functions inside a SQL query Register Python user-de ined function (UDF) Call the Python UDF directly inside the SQL query

Slide 43

Slide 43 text

Side-by-Side Comparison

Slide 44

Slide 44 text

Side-by-Side Comparison Pandas v2 Polars DuckDb *Data was queried on May 15, 2024 Downloads by Continent

Slide 45

Slide 45 text

Side-by-Side Comparison Pandas v2 Polars DuckDb Year of irst release 2008 2020 2019 Latest version* (as of May 2024) 2.2.2 0.20.27 0.10.3 Downloads last month 215.6M 6.9M 3.5M Total number of stars 42.1K 26.6K 17.2K Merged PRs last month 273 255 175 Total number of contributors 3.2K 407 316 *Data was queried on May 15, 2024

Slide 46

Slide 46 text

Side-by-Side Comparison Pandas v2 Polars DuckDb System type Single node Single node Single node Syntax Python Python SQL Lazy evaluation ❌ ✅ ✅ Query optimization ❌ ✅ ✅ Supports multi-threading ❌ ✅ ✅ Interoperable with PyArrow ✅ ✅ ✅

Slide 47

Slide 47 text

Benchmarking (from the Polars website) TCP-H benchmark results Benchmarking is an art and a science. Proceed with caution!

Slide 48

Slide 48 text

Benchmarking Benchmarking is an art and a science. Proceed with caution! My very uno icial benchmark test 3.5GB data local Pandas v2 Polars DuckDb DuckDb (without materializing to Arrow object) Aggregated count 1.18 s 487 ms 770 ms 149 µs Aggregated mean 44.7 s 499 ms 767 ms 148 µs

Slide 49

Slide 49 text

Summary • Each package has its advantages and disadvantages - Pandas v1 and v2 - Polars - DuckDB • Things to consider when making a decision: - Size of data - Learning curve and amount of spare time to learn a new syntax - Knowledge base of contributors and peer reviewers • My recommendation: try them all! Which package is best?

Slide 50

Slide 50 text

Thank you! Jill Cates [email protected] GitHub @topspinj

Slide 51

Slide 51 text

Recommended Resources • Pandas Docs | Polars Docs | DuckDB Docs • pandas 2.0 and the Arrow revolution (part I) by Marc Garcias • Apache Arrow and the “10 Things I Hate About pandas” by Wes McKinney • DuckDB: Bringing analytical SQL directly to your Python shell (PyData Presentation) by Pedro Holanda • Comprehensive Guide To Optimize Your Pandas Code by Eyal Trabelsi • Pandas 2.0 and its Ecosystem (Arrow, Polars, DuckDB) by Simon Späti