Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

Skinny Pandas Riding on a Rocket @IanOzsvald – ianozsvald.com Ian
Ozsvald (The inaugural!) PyDataGlobal 2020

•Pandas – get more into RAM & see what’s slow
•Dask – work on larger datasets and on multi-core •SQLite – slight segue to and oldie and a goodie •Vaex – a Pandas+Dask competitor for certain workflows? •25M row UK Land Registry (house purchase) dataset Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

 Interim Chief Data Scientist  19+ years experience 
Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

 Remember – all benchmarks are wrong  Your data/OS/libraries
will change the results  You need to do your own experiments Benchmarking caveat By [ian]@ianozsvald[.com] Ian Ozsvald

 In-memory only  Operations can be RAM expensive 
Very wide & rich API  Lots of timeseries tooling  Mixed history with NumPy & Python datatypes Pandas background By [ian]@ianozsvald[.com] Ian Ozsvald

A quick look at the data (25M rows) By [ian]@ianozsvald[.com]
Ian Ozsvald 25M rows, lots of text columns, circa 10GB in RAM in Pandas before we even start any work (the Pickle file we loaded was 4GB) https://www.gov.uk/government/collections/price-paid-data

The slice we’ll use (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald

Beware “info” - it underestimates By [ian]@ianozsvald[.com] Ian Ozsvald

Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald
10x reduction! 10x speed-up!

So how many sales do we see? By [ian]@ianozsvald[.com] Ian
Ozsvald

•Useful package to profile RAM usage in Python •Can track
whole-program execution with “mprof” •Can track cost of single command with “%memit” memory_profiler By [ian]@ianozsvald[.com] Ian Ozsvald

Note that groupby can be RAM hungry By [ian]@ianozsvald[.com] Ian
Ozsvald Unfortunately the categorical version doesn’t save much RAM

“ipython memory usage” (“auto- %memit”) By [ian]@ianozsvald[.com] Ian Ozsvald

Install “bottleneck” for faster math By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck
(and numexpr) are not installed by default and they offer free speed- ups, so install them!

Drop to NumPy if you know you can By [ian]@ianozsvald[.com]
Ian Ozsvald Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is doing more work. This example works if you know you have no NaNs See my PyDataAmsterdam 2020 talk for more details (see my blog)

 Use Categoricals for speed-ups and to save RAM 
Try Int32 and Float32 – half the size, sometimes faster  Install “bottleneck” and “numexpr” for free speed-ups  We can only take this so far, to scale we need newer tools Tips By [ian]@ianozsvald[.com] Ian Ozsvald

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions
(very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

 Bigger than RAM or “I want to use all
my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Reasonably fast
and uses 1GB of RAM and 8 processes 100 parquet files of 20MB each (probably 100MB would be a better size)

memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald 2 processes, 1.2GB total
RAM footprint groupby similar to the Pandas version plus some preparation (which took 11GB) but takes 45 seconds Caveat – this one loads all columns to amplify memory usage (which we could optimise!) for this demo

memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald With 16 processes, 6GB
total RAM usage but only 25 seconds You have control over the RAM footprint Note selecting only the subset of needed columns significantly speeds this up

Time series resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald
Dask Dashboard Task-list as a diagnostics

 Keep it simple – many moving parts  Few
processes, lots of RAM per process  Always call .compute()  ddf.sample(frac=0.01) to subsample makes things faster  Check Array, Bag and Delayed options in Dask too Top Dask Tips By [ian]@ianozsvald[.com] Ian Ozsvald

 20 years old, 2M tests  “Most widely deployed
db on the planet”  “To be supported through to 2050”  1GB rows, 281TB max db, Android, Java, Python, Firefox, ... SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald

Query and retrieve a DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald

 Circa 90MB SQLite process in RAM  Happily reads
large (many GB-TB) database files  “Fast enough” for many applications & multi-user  Sane binary format for data interchange (beats CSV?)  Read/write and could be added to source control SQLite - simple, portable By [ian]@ianozsvald[.com] Ian Ozsvald

 A lesser known Pandas alternative  Designed for billions
of rows, probably in HDF5 (single big file)  Memory maps to keep RAM usage low  Efficient groupby, faster strings, NumPy focused (also ML, visualisation and more), lazy expressions Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

 Typically you convert data to HDF5 (with helper functions)
 Once converted opening the file and running queries is very quick  Results look like Pandas DataFrames but aren’t Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

Completion’s per day By [ian]@ianozsvald[.com] Ian Ozsvald

Vaex RAM usage on a groupby By [ian]@ianozsvald[.com] Ian Ozsvald
Using the same groupby operations we had with Pandas (which cost up to 1GB) – Vaex is far more efficient with RAM usage than Pandas or Dask and much faster than Dask! Sidenote – string operations (e.g. lower, contains) much faster than Pandas

 Dask – multi-core, -machine & -files, Vaex – multicore,
single machine and typically single file  Dask allows most of Pandas and supports huge NumPy Arrays, a Bag, Delayed functions and lots more  Both can handle billions of rows and ML. Both have readable code Dask and Vaex? By [ian]@ianozsvald[.com] Ian Ozsvald

 Stick to Dask if you need Pandas functions 
Think about Vaex common DF operations  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald

File sizes By [ian]@ianozsvald[.com] Ian Ozsvald Format Size on Disk
(GB) CSV (original) 4.4 Pickle 5 3.6 Parquet uncompressed (faster to load) 2.2 Parquet + Snappy (small speed penalty) 1.5 SQLite 4.6 HDF5 (Vaex) 4.5

 We can export from Pandas with df.to_sql  Works
via SQLAlchemy  We can execute arbitrary SQL e.g. to “CREATE INDEX” SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald

Skinny Pandas Riding on a Rocket (PyDataGlobal ...

Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

More Decks by ianozsvald

Other Decks in Research

Featured

Transcript