Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Successful Data Science Projects

Building Successful Data Science Projects

This was given at PyDataLondon 2022 and a previous version was the opening keynote for PyDataBudapest 2022.

Abstract:

Your data science projects haven't worked out so well - maybe you didn't have a plan, you suffered from surprising unknowns, you couldn't deliver what was promised or you just failed to ship. I've been there over the last 20 years.

I'll share some painful past experiences and explain how you can avoid these common failures. I've just shipped a solution worth $1 million for a client by following my own advice - you can do the same.

We'll talk about estimating the value of your choices, getting written agreement for a project plan, identifying risks and derisking them, validating results with the business, shipping quickly, identifying bottlenecks and iterating our way to success. You'll leave the talk with new ideas to improve the success of your team and your personal career.

Given at PyDataLondon 2022 to 100+ people: https://london2022.pydata.org/cfp/talk/MKUTFH/

ianozsvald

June 19, 2022
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Building Successful Data Science
    Projects
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyData London 2022 Talk

    View full-size slide

  2. NEW projects – pains & gains
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  3. Interim Chief Data Scientist

    20+ years experience

    Coaching & public courses
    –I’m sharing from my Successful Data
    Science Projects course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View full-size slide

  4. By [ian]@ianozsvald[.com] Ian Ozsvald
    Credit – Southpark and the Underpant Gnomes

    View full-size slide

  5. By [ian]@ianozsvald[.com] Ian Ozsvald
    Get
    Data
    ML! DNN!
    Big Data!

    View full-size slide

  6. By [ian]@ianozsvald[.com] Ian Ozsvald
    We fail a lot (anyone else?)
    What if it isn’t about the
    technology?
    What if it is down to human
    factors?

    View full-size slide


  7. Find “cheapest TV” on other sites (famous at the time)

    We agreed the specification verbally

    Sklearn, BoW model, gold validation set – all sensible

    We made great progress - what could go wrong?
    Story – Automated price comparison
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Best price not on Amazon...

    View full-size slide


  8. The specification changed despite having agreement

    They held back the “hard data” so I could have an easy
    start – critical facts missing!

    This is not what we discussed, nowhere to pivot to
    Story – Automated price comparison
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  9. What problem needs solving? What examples do you
    have? What is it worth to the business?

    How would an expert solve this? Do they solve it?

    Get the bosses to agree to your specification
    Solution – write a specification
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  10. Specification:
    By [ian]@ianozsvald[.com] Ian Ozsvald
    What’s the $value?

    View full-size slide


  11. Boss in new department wanted $$$ Big Success

    “Success” was sold to business departments, then the
    Data Science team were involved after agreement

    Sometimes no data in dB (just paper)

    But – this was also a problem rich environment
    Story – insurance and Big DS Projects
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  12. Your client knows more than you do

    What do they need?

    What’s feasible with the data?

    What’s it $worth?
    Solution – talk to the client first
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  13. Data Maturity Model
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Reference: https://www.svds.com/thought-leadership/data-maturity-assessment/
    Growing
    up is
    Hard!

    View full-size slide


  14. ML Project nearly finished...

    Client didn’t trust “ML”

    Colleague drew many diagrams...
    Story – insurance & low client trust
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://stackoverflow.com/questions/40155128/plot-trees-for-a-random-forest-in-python-with-scikit-learn

    View full-size slide


  15. We found data errors → iteration → build confidence

    The client ultimately agreed “this is useful, I want it” by
    diagnosing cases they knew personally
    Story – SHAP to explain predictions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://towardsdatascience.com/using-model-interpretation-with-shap-to-understand-what-happened-in-
    the-titanic-1dd42ef41888

    View full-size slide


  16. “We want to investigate our data, please do some magic”

    Agreed a derisking project, identified many issues,
    proposed next project – a sane start

    Later we built models to prioritise interesting companies

    The data really was awful!
    Story – VC and Dirty Data
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  17. Insurance – the “mean model” beats
    Random Fr. – huge embarrassment

    VC – My Logistic Regression beat
    the human rules (encoded as derived
    sklearn estimators)
    Story – Always have a baseline
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://scikit-learn.org/stable/developers/develop.html

    View full-size slide


  18. 10/10,000 chance of success

    The junior associates have their own methods

    Client suggested “more advanced methods” but Log.Reg.
    and GBMs very good (accepting limited signal!)

    My contract got terminated! Problem-poor environment
    Story – VC and “nobody to check the
    results”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  19. Deliver early and often to client

    Give them enough so they look cool

    Use simplest models (e.g. linear), make lots of pictures,
    diagnose problems, figure out the value to them
    Solution – get clients involved early
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  20. Successful Random Forest model for insurance

    “We can’t deploy Python, please write SQL” - IT

    Colleague had to hand-write SQL rules from RF model –
    did it ever actually work? Was it right?
    Story – insurance and “no ML, please
    write SQL”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  21. Operationalization is often hard (especially for v1)

    In your specification think about the client, their needs
    and how to deploy so they can use the tools

    Sit with the client – how do they work right now?

    A corporate might take 6 months to provision a machine
    Solution – plan for deployment early
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  22. Finding insurance fraud and overbilling – really hard!

    Prior fraud project 6 months old & no results

    We derisked projects early – 2+ months of discussion

    Found positive examples, assigned $value, prioritised

    Agreed a delivery schedule for whatever we learn
    Story – Making $1M $2M for my client
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  23. Mix of better SQL ($0.4M), counting ($0.8M), percentiles
    ($0.4M), lots of discussion, lots of SQL (problem rich!)

    Isolation Forest + GBM good but rules better for client

    Boss’ boss writing their own BI as they’re so inspired

    New team begging us to start with them
    Story – Making $1M $2M for my client
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  24. New problem!

    No bandwidth in Fraud team for new results – we
    swamped them (in a good way)!

    Getting an organisation to move up the Data
    Maturity Model is hard and just takes time
    Story – “Fixing business” takes time
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  25. A colleague’s view
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  26. Have a clear written specification

    Know the business need & value → metrics

    Know the risks – does it make sense

    Know how to deploy (and deploy early)
    For new projects check you
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  27. By [ian]@ianozsvald[.com] Ian Ozsvald
    Find a Good
    Puzzle
    Solve the
    Puzzle

    View full-size slide


  28. Identify failure, iterate to success

    See blog for past talks

    I’d love a postcard if you learned something
    new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  29. Need “the face fits” and “relevant skills”

    Similarity tool for company and skills from PDF text

    Client annotated data & scored results from week 1

    “You’ve given us a superpower, we phone the top 10
    results, sign a contract, then we’re done for the day”
    Story – automated contract
    recruitment and “new superpowers”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  30. Initial deployment – CSV for similarity results, then
    Jupyter Notebooks, then microservices + Flask with black-
    box tests (now I’d use FastAPI + Streamlit or Viola)

    Boss sat next to me and we typed examples together

    Tests caught MongoDB corruption and MySQL “3 byte
    unicode”
    Story – recruitment & deployment
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  31. This is your career – you’re in charge

    Identify possible problems

    Make sensible choices

    (accept some failures!)

    Enjoy yourself
    You’re in charge
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  32. You should write your own specification

    Identify risks, talk to the experts, get good examples

    Quickly deliver results & iterate

    Deploy often, deploy early (be embarrassed and learn)
    A checklist for you
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide