Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Successful Data Science Projects

Building Successful Data Science Projects

This was given at PyDataLondon 2022 and a previous version was the opening keynote for PyDataBudapest 2022.

Abstract:

Your data science projects haven't worked out so well - maybe you didn't have a plan, you suffered from surprising unknowns, you couldn't deliver what was promised or you just failed to ship. I've been there over the last 20 years.

I'll share some painful past experiences and explain how you can avoid these common failures. I've just shipped a solution worth $1 million for a client by following my own advice - you can do the same.

We'll talk about estimating the value of your choices, getting written agreement for a project plan, identifying risks and derisking them, validating results with the business, shipping quickly, identifying bottlenecks and iterating our way to success. You'll leave the talk with new ideas to improve the success of your team and your personal career.

Given at PyDataLondon 2022 to 100+ people: https://london2022.pydata.org/cfp/talk/MKUTFH/

ianozsvald

June 19, 2022
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1.  Interim Chief Data Scientist  20+ years experience 

    Coaching & public courses –I’m sharing from my Successful Data Science Projects course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  2. By [ian]@ianozsvald[.com] Ian Ozsvald We fail a lot (anyone else?)

    What if it isn’t about the technology? What if it is down to human factors?
  3.  Find “cheapest TV” on other sites (famous at the

    time)  We agreed the specification verbally  Sklearn, BoW model, gold validation set – all sensible  We made great progress - what could go wrong? Story – Automated price comparison By [ian]@ianozsvald[.com] Ian Ozsvald Best price not on Amazon...
  4.  The specification changed despite having agreement  They held

    back the “hard data” so I could have an easy start – critical facts missing!  This is not what we discussed, nowhere to pivot to Story – Automated price comparison By [ian]@ianozsvald[.com] Ian Ozsvald
  5.  What problem needs solving? What examples do you have?

    What is it worth to the business?  How would an expert solve this? Do they solve it?  Get the bosses to agree to your specification Solution – write a specification By [ian]@ianozsvald[.com] Ian Ozsvald
  6.  Boss in new department wanted $$$ Big Success 

    “Success” was sold to business departments, then the Data Science team were involved after agreement  Sometimes no data in dB (just paper)  But – this was also a problem rich environment Story – insurance and Big DS Projects By [ian]@ianozsvald[.com] Ian Ozsvald
  7.  Your client knows more than you do  What

    do they need?  What’s feasible with the data?  What’s it $worth? Solution – talk to the client first By [ian]@ianozsvald[.com] Ian Ozsvald
  8.  ML Project nearly finished...  Client didn’t trust “ML”

     Colleague drew many diagrams... Story – insurance & low client trust By [ian]@ianozsvald[.com] Ian Ozsvald https://stackoverflow.com/questions/40155128/plot-trees-for-a-random-forest-in-python-with-scikit-learn
  9.  We found data errors → iteration → build confidence

     The client ultimately agreed “this is useful, I want it” by diagnosing cases they knew personally Story – SHAP to explain predictions By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/using-model-interpretation-with-shap-to-understand-what-happened-in- the-titanic-1dd42ef41888
  10.  “We want to investigate our data, please do some

    magic”  Agreed a derisking project, identified many issues, proposed next project – a sane start  Later we built models to prioritise interesting companies  The data really was awful! Story – VC and Dirty Data By [ian]@ianozsvald[.com] Ian Ozsvald
  11.  Insurance – the “mean model” beats Random Fr. –

    huge embarrassment  VC – My Logistic Regression beat the human rules (encoded as derived sklearn estimators) Story – Always have a baseline By [ian]@ianozsvald[.com] Ian Ozsvald https://scikit-learn.org/stable/developers/develop.html
  12.  10/10,000 chance of success  The junior associates have

    their own methods  Client suggested “more advanced methods” but Log.Reg. and GBMs very good (accepting limited signal!)  My contract got terminated! Problem-poor environment Story – VC and “nobody to check the results” By [ian]@ianozsvald[.com] Ian Ozsvald
  13.  Deliver early and often to client  Give them

    enough so they look cool  Use simplest models (e.g. linear), make lots of pictures, diagnose problems, figure out the value to them Solution – get clients involved early By [ian]@ianozsvald[.com] Ian Ozsvald
  14.  Successful Random Forest model for insurance  “We can’t

    deploy Python, please write SQL” - IT  Colleague had to hand-write SQL rules from RF model – did it ever actually work? Was it right? Story – insurance and “no ML, please write SQL” By [ian]@ianozsvald[.com] Ian Ozsvald
  15.  Operationalization is often hard (especially for v1)  In

    your specification think about the client, their needs and how to deploy so they can use the tools  Sit with the client – how do they work right now?  A corporate might take 6 months to provision a machine Solution – plan for deployment early By [ian]@ianozsvald[.com] Ian Ozsvald
  16.  Finding insurance fraud and overbilling – really hard! 

    Prior fraud project 6 months old & no results  We derisked projects early – 2+ months of discussion  Found positive examples, assigned $value, prioritised  Agreed a delivery schedule for whatever we learn Story – Making $1M $2M for my client By [ian]@ianozsvald[.com] Ian Ozsvald
  17.  Mix of better SQL ($0.4M), counting ($0.8M), percentiles ($0.4M),

    lots of discussion, lots of SQL (problem rich!)  Isolation Forest + GBM good but rules better for client  Boss’ boss writing their own BI as they’re so inspired  New team begging us to start with them Story – Making $1M $2M for my client By [ian]@ianozsvald[.com] Ian Ozsvald
  18.  New problem!  No bandwidth in Fraud team for

    new results – we swamped them (in a good way)!  Getting an organisation to move up the Data Maturity Model is hard and just takes time Story – “Fixing business” takes time By [ian]@ianozsvald[.com] Ian Ozsvald
  19.  Have a clear written specification  Know the business

    need & value → metrics  Know the risks – does it make sense  Know how to deploy (and deploy early) For new projects check you By [ian]@ianozsvald[.com] Ian Ozsvald
  20.  Identify failure, iterate to success  See blog for

    past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  21.  Need “the face fits” and “relevant skills”  Similarity

    tool for company and skills from PDF text  Client annotated data & scored results from week 1  “You’ve given us a superpower, we phone the top 10 results, sign a contract, then we’re done for the day” Story – automated contract recruitment and “new superpowers” By [ian]@ianozsvald[.com] Ian Ozsvald
  22.  Initial deployment – CSV for similarity results, then Jupyter

    Notebooks, then microservices + Flask with black- box tests (now I’d use FastAPI + Streamlit or Viola)  Boss sat next to me and we typed examples together  Tests caught MongoDB corruption and MySQL “3 byte unicode” Story – recruitment & deployment By [ian]@ianozsvald[.com] Ian Ozsvald
  23.  This is your career – you’re in charge 

    Identify possible problems  Make sensible choices  (accept some failures!)  Enjoy yourself You’re in charge By [ian]@ianozsvald[.com] Ian Ozsvald
  24.  You should write your own specification  Identify risks,

    talk to the experts, get good examples  Quickly deliver results & iterate  Deploy often, deploy early (be embarrassed and learn) A checklist for you By [ian]@ianozsvald[.com] Ian Ozsvald