Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defly Delivering Data Science Projects

ianozsvald
November 11, 2018

Defly Delivering Data Science Projects

Battle tested observations on ways to improve the likelihood that your data science project goes smoothly and gets delivered correctly. Given at the inaugural PyDataPrague.

ianozsvald

November 11, 2018
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Introductions • I’m an engineering data

    scientist • 15+ years experience • Team coaching • Strategic planning • Training
  2. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Problems delivering DS projects • “Make

    us more [money|signups|...]” - desire for magic • Desire over actual need – vanity projects • Lack of technical leadership – poor specs • Bad data – lies, mistakes, confusion • Lack of client buy-in
  3. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Good DS projects • Numerate management

    asking good data driven questions • Suitable data • Well defined outcomes that are agreed to be achievable
  4. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Project Specification • You need a

    clearly defined problem • Where are the unknowns? • Known unknowns • What might kill the project? • Propose milestones • Where’s your Gold Standard data set? • What’s your “definition of done” • Minimal results and great results • Appropriate metrics to communicate results
  5. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 “Data story” • Do you understand

    your data? • Explain your data – what does it say? • What’s good and what’s bad? • What are the relationships? • Where is the signal in the data? • Export your Notebook as html artefact • Data Story proposed by Bertil (Medium)
  6. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Standardised approaches • Reduce the mental

    load for common decisions • Cookiecutter (folders) • pandas-profiling • watermark • Anaconda
  7. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Improving code quality • Encode assumptions

    with asserts • Refactor to modules • Add unit-tests • Visual reports with analyst interpretations • Diagnostics e.g. yellowbrick for sklearn
  8. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Improving project quality • Code reviews

    (with a check-list, PEP8) • nbdime for diffs • “Data Defences” - regular critiques by colleagues on your project
  9. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Continuous delivery to client • Early

    deliveries – reports • Get to a minimal working delivery as soon as possible (UI? App? Reports?) • Consider papermill for deployable Notebooks
  10. [email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Summary • Honesty throughout your work

    • Strive to keep improving your technique • Keep communicating your results