Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering SG Meetup

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Data Engineering SG Meetup

Polars DataFrame Library

Avatar for Sivakumar

Sivakumar

April 12, 2025
Tweet

Other Decks in Programming

Transcript

  1. 5 Polars Open-Source Library for data manipulation Written in Rust

    Language Python, R & NodeJS Support Started in 2020 Created by Ritchie Vink
  2. 7 ▪ Fastest data processing on a single machine ▪

    Supports both eager (default) & lazy execution ▪ Uses Apache Arrow for Query Engine ▪ Multi-threaded ▪ Query optimizer ▪ Expression API ▪ Lazy execution with streaming capabilities (when datasets do not fit in memory, query engine executes query in batches) ▪ GPU Support Polars – Features
  3. 8 Polars – I/O Formats CSV Excel AVRO Parquet Database

    Hugging Face Google Big Query Cloud Storage Systems DeltaLake Iceberg OpenOffice JSON
  4. 9 ▪ Numeric ▪ String ▪ Temporal ▪ Boolean ▪

    Enums ▪ Lists, Arrays ▪ Struct ▪ Objects Polars – Data Types
  5. 10 Polars – Interoperability Library Description Pandas Convert data to/from

    pandas dataframes/series Numpy Convert data to/from Numpy arrays PyArrow Convert data to/from PyArrow tables/arrays PyDantic Convert data from PyDantic Models to Polars
  6. 13 ▪ 2-dimensional data structure, useful for data manipulation and

    analytics ▪ Default Polars installation support up to 4.3 billion rows in a DataFrame ▪ When installed with polars-u64-idx, it can support maximum of ~18 quintillion rows Polars – DataFrame import polars as pl import datetime as dt df = pl.DataFrame({ "name": ["Alice Archer", "Ben Brown"], "birthdate": [ dt.date(1997, 1, 10), dt.date(1985, 2, 15)], "weight": [57.9, 72.5], "height": [1.56, 1.77], }) print(df)
  7. 14 ▪ Expression is a Lazy representation of a data

    transformation ▪ Modular and flexible ▪ No computations takes place until called from Contexts ▪ Commonly used contexts: select, with_columns, filter, group_by Polars – Expressions and Contexts import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) print(bmi_expr)
  8. 15 ▪ Applies expressions over columns ▪ May produce new

    columns, combinations of other columns or literals ▪ Result DataFrame must be of same length as source DataFrame or a scalar Polars – select Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.select( bmi=bmi_expr, avg_bmi=bmi_expr.mean(), ideal_max_bmi=25, ) print(result)
  9. 16 ▪ Creates new DataFrame contains all columns from source

    DataFrame and the new columns according to expressions ▪ It can overwrite existing column’s data according to expressions ▪ Result DataFrame must be of same length as source DataFrame Polars – with_columns Context import polars as pl bmi_expr = pl.col("weight") / (pl.col("height") ** 2) result = df.with_columns(bmi=bmi_expr,avg_bmi=bmi_expr.mean(),ideal_max_bmi=25) print(result)
  10. 17 ▪ Filters the rows of a DataFrame ▪ Based

    on one or more expressions ▪ Evaluates to Boolean data type Polars – filter Context import polars as pl result = df.filter( pl.col("birthdate").is_between (date(1982, 12, 31), date(1996, 1, 1)), pl.col("height") > 1.7 ) print(result)
  11. 18 ▪ Rows are grouped according to unique values of

    the grouping expressions ▪ Use agg() to apply aggregating expressions to the groups Polars – group_by Context import polars as pl result = df.group_by( (pl.col("birthdate").dt.year() // 10 * 10) .alias("decade"), (pl.col("height") < 1.7).alias("short?"), ).agg(pl.col("name")) print(result)
  12. 19 ▪ Shorthand notation for applying same transformation to multiple

    columns ▪ The above expression is equivalent to: Polars – Expression Expansion pl.col("weight", "height").min().name.prefix(“min_") [ pl.col("weight").min().alias(“min_weight"), pl.col("height").min().alias(“min_height"), ]
  13. 20 ▪ Combines columns from one or more DataFrames into

    a new DataFrame ▪ Supports all types (inner, full, left outer, right outer etc) of join operations Polars – Join import polars as pl df_emp = pl.read_csv(“employees.csv") df_dept = pl.read_csv(“dept.csv") result = df_emp.join(df_dept, on=“department_id") print(result)
  14. 21 ▪ Concatenate two DataFrames with non-overlapping columns horizontally Polars

    – Horizontal Concatenation import polars as pl df_h1 = pl.DataFrame({ "l1": [1, 2], "l2": [3, 4], }) df_h2 = pl.DataFrame({ "r1": [5, 6], "r2": [7, 8], "r3": [9, 10], }) df_horizontal_concat = pl.concat( [ df_h1, df_h2, ], how="horizontal", ) print(df_horizontal_concat)
  15. 22 ▪ Concatenate two DataFrames with overlapping columns vertically Polars

    – Vertical Concatenation import polars as pl df_v1 = pl.DataFrame({ "a": [1], "b": [3], }) df_v2 = pl.DataFrame({ "a": [2], "b": [4], }) df_vertical_concat = pl.concat( [ df_v1, df_v2, ], how="vertical", ) print(df_vertical_concat)
  16. 23 ▪ Automatic query optimization. Query planner perform various optimizations

    (Predicate, Projection, Slice etc) ▪ pl.scan_*() functions available for various file types to load file lazily & returns LazyFrame ▪ Query evaluated only it is collected using collect() function ▪ Allows users to work with larger than memory datasets using streaming option (call collect() with argument engine=streaming) ▪ Use the function explain() to get the query plan that will be used by Polars ▪ Call .lazy() to convert DataFrame to a LazyFrame Polars – Lazy API
  17. 24 ▪ Uses SQLContext object to manage SQL Queries ▪

    Register LazyFrame and DataFrame objects in global namespace or via dictionary mapping ▪ SQL Queries are always executed in Lazy mode ▪ Polars translates SQL queries into Expressions and then gets executed Polars – SQL Support import polars as pl df_emp = pl.read_csv(“employees.csv") with pl.SQLContext(register_globals=True, eager=True) as ctx: df_top = ctx.execute("SELECT * from df_emp LIMIT 5") print(df_top)
  18. 26 Demo Use case #1: ▪ Analyzing Singapore HDB dataset

    (data.gov.sg) having ~1 million records from the year 1990-till date ▪ Get the min & max resale values for each flat types ▪ Using Pivot & Join to get min & max resale values from each town & decades (1990-1999, 2000-2009, 2010- 2019, 2020-Till Date) & flat types (3RM, 4RM, 5RM, EXEC & MULTI-GEN) SG HDB Dataset Filter Records Data Standardization Data Aggregation Pivot & Join Write output
  19. 27 Demo Use case #2: ▪ Generated synthetic customer data

    (having 14 columns) of various size (1 Million, 5 Million, 10 Million, 25 Million & 50 Million) using ficto library ▪ Simple data pipeline (Filter, Standardize & Aggregate) implementation in Polars, Pandas & PySpark 0.62 1.67 3.16 7.48 51.7 1.9 8.71 17 51.25 134.39 15.9 12.82 15.44 26.29 47.26 0 20 40 60 80 100 120 140 160 1 Million 5 Million 10 Million 25 Million 50 Million Execution Time (in Seconds) Dataset Size Customer Data Processing (lower is better) Polars Pandas PySpark Environment: Windows 11 i7 Processor 8 Cores & 16 GB RAM Polars v1.27.1 Pandas v2.2.0 PySpark v3.5.0