$30 off During Our Annual Pro Sale. View Details »

Sprinting Pandas (London Python)

ianozsvald
October 22, 2020

Sprinting Pandas (London Python)

ianozsvald

October 22, 2020
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Sprinting Pandas (live in London) @IanOzsvald – ianozsvald.com Ian Ozsvald

    London Python October 2020
  2.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  3.  Pandas – Saving RAM to fit in more data

    – Calculating faster by dropping to Numpy  Advice for “being highly performant”  Has Covid 19 affected UK Company Registrations? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  4. Strings are expensive and slow By [ian]@ianozsvald[.com] Ian Ozsvald

  5. Categoricals are cheap and fast! By [ian]@ianozsvald[.com] Ian Ozsvald Circa

    1% of previous memory cost
  6. Categoricals “.cat” accessor By [ian]@ianozsvald[.com] Ian Ozsvald

  7. Categoricals – over 10x speed up (on this data)! By

    [ian]@ianozsvald[.com] Ian Ozsvald
  8. Categoricals – index queries faster! By [ian]@ianozsvald[.com] Ian Ozsvald Circa

    500x speed-up!
  9. float64 is default and a bit expensive By [ian]@ianozsvald[.com] Ian

    Ozsvald
  10. float32 “half-price” and a bit faster By [ian]@ianozsvald[.com] Ian Ozsvald

  11. Make choices to save RAM By [ian]@ianozsvald[.com] Ian Ozsvald Including

    the index (previously we ignored it) we still save circa 50% RAM so you can fit in more rows of data
  12. “dtype_diet” gives you advice By [ian]@ianozsvald[.com] Ian Ozsvald

  13. Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

    Ian Ozsvald Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean which is slower – see my blog or PyDataAmsterdam 2020 talk for details
  14. NumPy vs Pandas overhead (ser.sum()) By [ian]@ianozsvald[.com] Ian Ozsvald 25

    files, 83 functions Very few NumPy calls! Thanks! https://github.com/ianozsvald/callgraph_james_powell
  15. Overhead... By [ian]@ianozsvald[.com] Ian Ozsvald

  16. Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  17. Is Pandas unnecessarily slow – NO! By [ian]@ianozsvald[.com] Ian Ozsvald

    https://github.com/pandas-dev/pandas/issues/34773 - the truth is a bit complicated!
  18.  Install optional (but great!) Pandas dependencies – bottleneck –

    numexpr  Investigate https://github.com/ianozsvald/dtype_diet  Investigate my ipython_memory_usage (PyPI/Conda) Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
  19. Pure Python is “slow” and expressive By [ian]@ianozsvald[.com] Ian Ozsvald

    Deliberately poor function – pretend this is clever but slow!
  20. Compile to Numba judiciously By [ian]@ianozsvald[.com] Ian Ozsvald Near 10x

    speed-up!
  21. Parallelise with Dask for multi-core By [ian]@ianozsvald[.com] Ian Ozsvald 

    Make plain-Python code multi-core  Note I had to drop text index column due to speed-hit  Data copy cost can overwhelm any benefits so (always) profile & time
  22.  Mistakes slow us down (PAY ATTENTION!) – Try nullable

    Int64 & boolean, forthcoming Float64 – Write tests (unit & end-to-end) – Lots more material & my newsletter on my blog IanOzsvald.com – Time saving docs: Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald
  23.  Memory mapped & lazy computation – New string dtype

    (RAM efficient)  Modin sits on Pandas, new “algebra” for dfs – Drop in replacement, easy to try Vaex / Modin By [ian]@ianozsvald[.com] Ian Ozsvald See talks on my blog:
  24.  You have a huge dataset on a single harddrive

     Memory mapped files (HDF5) are best  Numpy types and simpler Pandas-like functions  Investment – similar but different API to Pandas When to try Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/vaexio/vaex/issues/968
  25.  You want Pandas but ran out of RAM on

    1 machine  You want multi-machine cluster scalability  You want multi-core support for operations like groupby on parallelisable datasets  Investment – quick start then a learning curve When to try Dask By [ian]@ianozsvald[.com] Ian Ozsvald
  26.  You want all of Pandas  You have lots

    of RAM and many CPUs  You’re doing groupby operations on many columns  Investment – easy to try When to try Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390
  27. Covid 19’s effect on UK Economy? By [ian]@ianozsvald[.com] Ian Ozsvald

    Sharp decline in corporate registration after Lockdown – then apparent surge (perhaps just backed-up paperwork?). Will the recovery “last”? All open data, you can do similar things!
  28.  Make it right then make it fast  Think

    about being performant  See blog for my classes  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  29. Be faster by learning new approaches By [ian]@ianozsvald[.com] Ian Ozsvald