Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the Delivery of Data Science Projects

On the Delivery of Data Science Projects

Talk at PyDataCambridge 2019-05 on the two halves of Data Science delivery - derisking the business side of the project and improving the software engineering side of the project. Contains observations, issues, new tools and process ideas you can use back in your teams.

ianozsvald

May 29, 2019
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. On the Delivery of Data Science
    Projects
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataCambridge 2019-05

    View Slide


  2. Interim Chief Data Scientist

    19+ years experience

    Quickly build strategic data science plans

    Team coaching & public courses
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  3. Numerate management ask good data-driven questions

    You have suitable data

    Well defined achievable outcomes are defined

    Change is enabled by these projects
    Data Science shows value when...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  4. Unclear true business need

    No visibility on the data (and its quality)

    Blind belief in 100% success

    No project specification – lacking shared agreement
    Common failure points
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  5. What’s the driver? Is there a fire under it?

    Joonatan’s example from PyDataLT – OCR

    Cost/benefit estimate accepting uncertainty
    Checking business need
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  6. States a clearly defined problem

    Guesses at unknowns (and project torpedoes!)

    Proposed milestones and Gold Standard/metrics

    Clear “definition of done”

    Story from 10 years back
    You need a Project Specification
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  7. Do you understand your data?
    – What’s good and bad?
    – What relationships exist?

    Build exportable Notebook as html artefact

    Read Bertil’s piece on Medium
    “Data Story”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  8. Easy first deliveries – reports

    Get to a minimal working delivery as soon as
    possible

    Two tracks? R&D and client integration?
    Continuous delivery to clients
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  9. Reduce mental load for common decisions
    – Cookiecutter data-science
    – Watermark
    – Pandas-profiling / edaviz
    – Anaconda
    Standardised Approaches
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  10. Code quality
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Attrib: https://devrant.com/rants/347670/code-quality-as-measured-in-wtfs-minute

    View Slide


  11. Encode assumptions using asserts (example -
    yesterday’s client issue)

    Refactor to modules

    Add unit-tests

    Diagnostics e.g. yellowbrick for sklearn
    Continuously improving code quality
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  12. Exposure to new processes

    Enforced clear communication

    Balanced consumption & contribution

    You’re more visible & valuable
    Contributing to Open Source gets you
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  13. High test coverage

    Easy roll out & roll back

    Culture of constructive criticism
    High performance teams
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  14. “Successfully Delivering
    Data Science Projects” &
    “Software Engineering for Data
    Scientists” - early July
    Resources
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  15. Derisk early and often

    Communicate visually, all the time

    Strive to continuous improvement

    Join my thoughts+jobs list for tips and my training list

    Attend PyDataLondon 2019 July 12-14?
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide