$30 off During Our Annual Pro Sale. View Details »

Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

ianozsvald
November 11, 2020

Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

With larger datasets we need to be smarter about how we use Pandas to get results. We’ll look at strategies to shrink our data to get more into RAM, offload computation to tools like Dask or Vaex, store with Parquet or SQLite, make calculations faster and retain debuggability.
https://global.pydata.org/talks/163
For more talks like this see: https://ianozsvald.com/

ianozsvald

November 11, 2020
Tweet

More Decks by ianozsvald

Other Decks in Research

Transcript

  1. Skinny Pandas Riding on a Rocket
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    (The inaugural!) PyDataGlobal 2020

    View Slide

  2. •Pandas – get more into RAM & see what’s slow
    •Dask – work on larger datasets and on multi-core
    •SQLite – slight segue to and oldie and a goodie
    •Vaex – a Pandas+Dask competitor for certain workflows?
    •25M row UK Land Registry (house purchase) dataset
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  3. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    –I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide


  4. Remember – all benchmarks are wrong

    Your data/OS/libraries will change the results

    You need to do your own experiments
    Benchmarking caveat
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  5. In-memory only

    Operations can be RAM expensive

    Very wide & rich API

    Lots of timeseries tooling

    Mixed history with NumPy & Python datatypes
    Pandas background
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  6. A quick look at the data (25M rows)
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25M rows, lots of text
    columns, circa 10GB in RAM
    in Pandas before we even
    start any work (the Pickle file
    we loaded was 4GB)
    https://www.gov.uk/government/collections/price-paid-data

    View Slide

  7. The slice we’ll use (25M rows)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Beware “info” - it underestimates
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  9. Let’s use Category & int32 dtypes
    By [ian]@ianozsvald[.com] Ian Ozsvald
    10x reduction!
    10x speed-up!

    View Slide

  10. So how many sales do we see?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  11. •Useful package to profile RAM usage in Python
    •Can track whole-program execution with “mprof”
    •Can track cost of single command with “%memit”
    memory_profiler
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  12. Note that groupby can be RAM hungry
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Unfortunately the
    categorical version
    doesn’t save much RAM

    View Slide

  13. “ipython memory usage” (“auto-
    %memit”)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  14. Install “bottleneck” for faster math
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Bottleneck (and numexpr)
    are not installed by default
    and they offer free speed-
    ups, so install them!

    View Slide

  15. Drop to NumPy if you know you can
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is
    doing more work. This example works if you know you have no NaNs
    See my PyDataAmsterdam 2020 talk for more details (see my blog)

    View Slide


  16. Use Categoricals for speed-ups and to save RAM

    Try Int32 and Float32 – half the size, sometimes faster

    Install “bottleneck” and “numexpr” for free speed-ups

    We can only take this so far, to scale we need newer
    tools
    Tips
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  17. Rich library wrapping DataFrames, Arrays and arbitrary
    Python functions (very powerful, quite complex)

    Popular for scaling Pandas

    Wraps Pandas DataFrames
    Dask for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  18. Bigger than RAM or “I want to use all my cores”

    Generally you’ll use Parquet (or CSV or many choices)

    The Dashboard gives rich diagnostics

    Write Pandas-like code

    Lazy – use “.compute()”
    Dask for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  19. Resampling across 100 partitions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Reasonably fast and uses 1GB of RAM and 8 processes
    100 parquet files
    of 20MB each
    (probably 100MB
    would be a better
    size)

    View Slide

  20. memory_profiler’s mprof
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2 processes, 1.2GB total
    RAM footprint
    groupby similar to the Pandas
    version plus some preparation
    (which took 11GB) but takes
    45 seconds
    Caveat – this one loads all
    columns to amplify memory
    usage (which we could
    optimise!) for this demo

    View Slide

  21. memory_profiler’s mprof
    By [ian]@ianozsvald[.com] Ian Ozsvald
    With 16 processes,
    6GB total RAM usage
    but only 25 seconds
    You have control over
    the RAM footprint
    Note selecting only
    the subset of needed
    columns significantly
    speeds this up

    View Slide

  22. Time series resampling across 100
    partitions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Dask Dashboard Task-list as a
    diagnostics

    View Slide


  23. Keep it simple – many moving parts

    Few processes, lots of RAM per process

    Always call .compute()

    ddf.sample(frac=0.01) to subsample makes things faster

    Check Array, Bag and Delayed options in Dask too
    Top Dask Tips
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  24. 20 years old, 2M tests

    “Most widely deployed db on the planet”

    “To be supported through to 2050”

    1GB rows, 281TB max db, Android, Java, Python,
    Firefox, ...
    SQLite – why use “boring SQL”?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  25. Query and retrieve a DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  26. Circa 90MB SQLite process in RAM

    Happily reads large (many GB-TB) database files

    “Fast enough” for many applications & multi-user

    Sane binary format for data interchange (beats CSV?)

    Read/write and could be added to source control
    SQLite - simple, portable
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  27. A lesser known Pandas alternative

    Designed for billions of rows, probably in HDF5 (single
    big file)

    Memory maps to keep RAM usage low

    Efficient groupby, faster strings, NumPy focused (also
    ML, visualisation and more), lazy expressions
    Vaex – a newer DataFrame system
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  28. Typically you convert data to HDF5 (with helper
    functions)

    Once converted opening the file and running queries is
    very quick

    Results look like Pandas DataFrames but aren’t
    Vaex – a newer DataFrame system
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  29. Completion’s per day
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  30. Vaex RAM usage on a groupby
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Using the same
    groupby
    operations we
    had with Pandas
    (which cost up
    to 1GB) – Vaex
    is far more
    efficient with
    RAM usage than
    Pandas or Dask
    and much faster
    than Dask!
    Sidenote – string
    operations (e.g.
    lower, contains)
    much faster than
    Pandas

    View Slide


  31. Dask – multi-core, -machine & -files, Vaex – multicore,
    single machine and typically single file

    Dask allows most of Pandas and supports huge NumPy
    Arrays, a Bag, Delayed functions and lots more

    Both can handle billions of rows and ML. Both have
    readable code
    Dask and Vaex?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  32. Stick to Dask if you need Pandas functions

    Think about Vaex common DF operations

    See blog for my classes + many past talks

    I’d love a postcard if you learned something
    new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  33. File sizes
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Format Size on Disk (GB)
    CSV (original) 4.4
    Pickle 5 3.6
    Parquet
    uncompressed
    (faster to load)
    2.2
    Parquet + Snappy
    (small speed penalty)
    1.5
    SQLite 4.6
    HDF5 (Vaex) 4.5

    View Slide


  34. We can export from Pandas with df.to_sql

    Works via SQLAlchemy

    We can execute arbitrary SQL e.g. to “CREATE INDEX”
    SQLite – why use “boring SQL”?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide