Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

Slide 1

Slide 1 text

Skinny Pandas Riding on a Rocket @IanOzsvald – ianozsvald.com Ian Ozsvald (The inaugural!) PyDataGlobal 2020

Slide 2

Slide 2 text

•Pandas – get more into RAM & see what’s slow •Dask – work on larger datasets and on multi-core •SQLite – slight segue to and oldie and a goodie •Vaex – a Pandas+Dask competitor for certain workflows? •25M row UK Land Registry (house purchase) dataset Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 3

Slide 3 text

 Interim Chief Data Scientist  19+ years experience  Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 4

Slide 4 text

 Remember – all benchmarks are wrong  Your data/OS/libraries will change the results  You need to do your own experiments Benchmarking caveat By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 5

Slide 5 text

 In-memory only  Operations can be RAM expensive  Very wide & rich API  Lots of timeseries tooling  Mixed history with NumPy & Python datatypes Pandas background By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

A quick look at the data (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald 25M rows, lots of text columns, circa 10GB in RAM in Pandas before we even start any work (the Pickle file we loaded was 4GB) https://www.gov.uk/government/collections/price-paid-data

Slide 7

Slide 7 text

The slice we’ll use (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

Beware “info” - it underestimates By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald 10x reduction! 10x speed-up!

Slide 10

Slide 10 text

So how many sales do we see? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 11

Slide 11 text

•Useful package to profile RAM usage in Python •Can track whole-program execution with “mprof” •Can track cost of single command with “%memit” memory_profiler By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

Note that groupby can be RAM hungry By [ian]@ianozsvald[.com] Ian Ozsvald Unfortunately the categorical version doesn’t save much RAM

Slide 13

Slide 13 text

“ipython memory usage” (“auto- %memit”) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 14

Slide 14 text

Install “bottleneck” for faster math By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and numexpr) are not installed by default and they offer free speed- ups, so install them!

Slide 15

Slide 15 text

Drop to NumPy if you know you can By [ian]@ianozsvald[.com] Ian Ozsvald Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is doing more work. This example works if you know you have no NaNs See my PyDataAmsterdam 2020 talk for more details (see my blog)

Slide 16

Slide 16 text

 Use Categoricals for speed-ups and to save RAM  Try Int32 and Float32 – half the size, sometimes faster  Install “bottleneck” and “numexpr” for free speed-ups  We can only take this so far, to scale we need newer tools Tips By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

 Bigger than RAM or “I want to use all my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

Resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Reasonably fast and uses 1GB of RAM and 8 processes 100 parquet files of 20MB each (probably 100MB would be a better size)

Slide 20

Slide 20 text

memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald 2 processes, 1.2GB total RAM footprint groupby similar to the Pandas version plus some preparation (which took 11GB) but takes 45 seconds Caveat – this one loads all columns to amplify memory usage (which we could optimise!) for this demo

Slide 21

Slide 21 text

memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald With 16 processes, 6GB total RAM usage but only 25 seconds You have control over the RAM footprint Note selecting only the subset of needed columns significantly speeds this up

Slide 22

Slide 22 text

Time series resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Dask Dashboard Task-list as a diagnostics

Slide 23

Slide 23 text

 Keep it simple – many moving parts  Few processes, lots of RAM per process  Always call .compute()  ddf.sample(frac=0.01) to subsample makes things faster  Check Array, Bag and Delayed options in Dask too Top Dask Tips By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 24

Slide 24 text

 20 years old, 2M tests  “Most widely deployed db on the planet”  “To be supported through to 2050”  1GB rows, 281TB max db, Android, Java, Python, Firefox, ... SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 25

Slide 25 text

Query and retrieve a DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 26

Slide 26 text

 Circa 90MB SQLite process in RAM  Happily reads large (many GB-TB) database files  “Fast enough” for many applications & multi-user  Sane binary format for data interchange (beats CSV?)  Read/write and could be added to source control SQLite - simple, portable By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 27

Slide 27 text

 A lesser known Pandas alternative  Designed for billions of rows, probably in HDF5 (single big file)  Memory maps to keep RAM usage low  Efficient groupby, faster strings, NumPy focused (also ML, visualisation and more), lazy expressions Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 28

Slide 28 text

 Typically you convert data to HDF5 (with helper functions)  Once converted opening the file and running queries is very quick  Results look like Pandas DataFrames but aren’t Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 29

Slide 29 text

Completion’s per day By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 30

Slide 30 text

Vaex RAM usage on a groupby By [ian]@ianozsvald[.com] Ian Ozsvald Using the same groupby operations we had with Pandas (which cost up to 1GB) – Vaex is far more efficient with RAM usage than Pandas or Dask and much faster than Dask! Sidenote – string operations (e.g. lower, contains) much faster than Pandas

Slide 31

Slide 31 text

 Dask – multi-core, -machine & -files, Vaex – multicore, single machine and typically single file  Dask allows most of Pandas and supports huge NumPy Arrays, a Bag, Delayed functions and lots more  Both can handle billions of rows and ML. Both have readable code Dask and Vaex? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 32

Slide 32 text

 Stick to Dask if you need Pandas functions  Think about Vaex common DF operations  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 33

Slide 33 text

File sizes By [ian]@ianozsvald[.com] Ian Ozsvald Format Size on Disk (GB) CSV (original) 4.4 Pickle 5 3.6 Parquet uncompressed (faster to load) 2.2 Parquet + Snappy (small speed penalty) 1.5 SQLite 4.6 HDF5 (Vaex) 4.5

Slide 34

Slide 34 text

 We can export from Pandas with df.to_sql  Works via SQLAlchemy  We can execute arbitrary SQL e.g. to “CREATE INDEX” SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald