Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Pandas Fly (PyDataAmsterdam 2020)

ianozsvald
June 18, 2020

Making Pandas Fly (PyDataAmsterdam 2020)

Another variant of the recent talks, this one focuses on making Pandas faster by digging into NumPy, using my `dtype_diet` memory-saving tool and understanding what's going on with some of Pandas' low level functions. See https://ianozsvald.com/ for more.

ianozsvald

June 18, 2020
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Making Pandas Fly (live from
    London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataAmsterdam 2020

    View Slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    – Higher Performance!
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide


  3. All volunteers – go say thank you in #lobby

    NumFOCUS benefits us all
    Thank the organisers!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  4. Pandas
    – Saving RAM
    – Calculating faster by dropping to Numpy

    Advice for “being highly performant”
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  5. Go to Notebook for demo
    Demo
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  6. NumPy vs Pandas overhead
    (ser.sum())
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25 files, 83 functions
    Very few NumPy
    calls!
    Thanks!

    View Slide

  7. Overhead...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Overhead with ser.values.sum()
    By [ian]@ianozsvald[.com] Ian Ozsvald
    18 files, 51 functions
    Many fewer Pandas
    calls (but still a lot!)

    View Slide

  9. Is Pandas unnecessarily slow?
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Missing? The
    bottleneck
    library! This
    certainly helps

    View Slide

  10. Is Pandas unnecessarily slow – NO!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/pandas-dev/pandas/issues/34773 -
    the truth is a bit complicated!

    View Slide


  11. Install optional (but great!) Pandas dependencies
    – bottleneck
    – numexpr

    Investigate https://github.com/ianozsvald/dtype_diet

    Investigate my ipython_memory_usage (PyPI/Conda)
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

    View Slide


  12. Mistakes slow us down (PAY ATTENTION!)
    – Try nullable Int64 & boolean, forthcoming Float64
    – Write tests (unit & end-to-end)
    – Codify your assumptions – bulwark library
    – https://github.com/ianozsvald/notes_to_self
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  13. Make it right then make it fast

    Think about being performant

    See blog for my classes

    I’d love a postcard if you learned
    something new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  14. Covid 19 UK economic impact?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide