Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sprinting Pandas at ODSC 2020

ianozsvald
September 19, 2020

Sprinting Pandas at ODSC 2020

Sprinting Pandas
Sometimes our Python Pandas code feels slow and sometimes we can't fit enough data into RAM. Based on recent updates to the 2nd edition of Ian's High Performance Python book and his public training classes come and learn how to get more into RAM (reducing your need for other technologies like Spark), how to quickly compile for significant speedups, how to run in parallel and which libraries you're missing that unlock additional performance benefits. You'll leave with new techniques to make your DataFrames smaller and many ideas for processing your data faster.
This talk is inspired by Ian's work updating his O'Reilly book High Performance Python to the 2nd edition for 2020. With over 10 years of evolution the Pandas DataFrame library has gained a huge amount of functionality and it is used by millions of Pythonistas - but the most obvious way to solve a task isn't always the fastest or most RAM efficient. This talk will help any Pandas user (beginner or beyond) process more data faster, making them more effective at their jobs.

See related talks at https://ianozsvald.com

ianozsvald

September 19, 2020
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  2.  Pandas – Saving RAM to fit in more data

    – Calculating faster by dropping to Numpy  Advice for “being highly performant”  Has Covid 19 affected UK Company Registrations? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3. Categoricals – over 10x speed up (on this data)! By

    [ian]@ianozsvald[.com] Ian Ozsvald
  4. Make choices to save RAM By [ian]@ianozsvald[.com] Ian Ozsvald Including

    the index (previously we ignored it) we still save circa 50% RAM so you can fit in more rows of data
  5. Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

    Ian Ozsvald Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean which is slower – see my blog or PyDataAmsterdam 2020 talk for details
  6. NumPy vs Pandas overhead (ser.sum()) By [ian]@ianozsvald[.com] Ian Ozsvald 25

    files, 83 functions Very few NumPy calls! Thanks! https://github.com/ianozsvald/callgraph_james_powell
  7. Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  8. Is Pandas unnecessarily slow – NO! By [ian]@ianozsvald[.com] Ian Ozsvald

    https://github.com/pandas-dev/pandas/issues/34773 - the truth is a bit complicated!
  9.  Install optional (but great!) Pandas dependencies – bottleneck –

    numexpr  Investigate https://github.com/ianozsvald/dtype_diet  Investigate my ipython_memory_usage (PyPI/Conda) Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
  10. Pure Python is “slow” and expressive By [ian]@ianozsvald[.com] Ian Ozsvald

    Deliberately poor function – pretend this is clever but slow!
  11. Parallelise with Dask for multi-core By [ian]@ianozsvald[.com] Ian Ozsvald 

    Make plain-Python code multi-core  Note I had to drop text index column due to speed-hit  Data copy cost can overwhelm any benefits so (always) profile & time
  12.  Mistakes slow us down (PAY ATTENTION!) – Try nullable

    Int64 & boolean, forthcoming Float64 – Write tests (unit & end-to-end) – Lots more material & my newsletter on my blog IanOzsvald.com – Time saving docs: Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald
  13.  Memory mapped & lazy computation – New string dtype

    (RAM efficient)  Modin sits on Pandas, new “algebra” for dfs – Drop in replacement, easy to try Vaex / Modin By [ian]@ianozsvald[.com] Ian Ozsvald See talks on my blog:
  14.  You have a huge dataset on a single harddrive

     Memory mapped files (HDF5) are best  Numpy types and simpler Pandas-like functions  Investment – similar but different API to Pandas When to try Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/vaexio/vaex/issues/968
  15.  You want Pandas but ran out of RAM on

    1 machine  You want multi-machine cluster scalability  You want multi-core support for operations like groupby on parallelisable datasets  Investment – quick start then a learning curve When to try Dask By [ian]@ianozsvald[.com] Ian Ozsvald
  16.  You want all of Pandas  You have lots

    of RAM and many CPUs  You’re doing groupby operations on many columns  Investment – easy to try When to try Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390
  17. Covid 19’s effect on UK Economy? By [ian]@ianozsvald[.com] Ian Ozsvald

    Sharp decline in corporate registration after Lockdown – then apparent surge (perhaps just backed-up paperwork?). Will the recovery “last”? All open data, you can do similar things!
  18.  Make it right then make it fast  Think

    about being performant  See blog for my classes  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald