Slide 1

Slide 1 text

Building Successful Data Science Projects @IanOzsvald – ianozsvald.com Ian Ozsvald PyData London 2022 Talk

Slide 2

Slide 2 text

NEW projects – pains & gains By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 3

Slide 3 text

 Interim Chief Data Scientist  20+ years experience  Coaching & public courses –I’m sharing from my Successful Data Science Projects course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 4

Slide 4 text

By [ian]@ianozsvald[.com] Ian Ozsvald Credit – Southpark and the Underpant Gnomes

Slide 5

Slide 5 text

By [ian]@ianozsvald[.com] Ian Ozsvald Get Data ML! DNN! Big Data!

Slide 6

Slide 6 text

By [ian]@ianozsvald[.com] Ian Ozsvald We fail a lot (anyone else?) What if it isn’t about the technology? What if it is down to human factors?

Slide 7

Slide 7 text

 Find “cheapest TV” on other sites (famous at the time)  We agreed the specification verbally  Sklearn, BoW model, gold validation set – all sensible  We made great progress - what could go wrong? Story – Automated price comparison By [ian]@ianozsvald[.com] Ian Ozsvald Best price not on Amazon...

Slide 8

Slide 8 text

 The specification changed despite having agreement  They held back the “hard data” so I could have an easy start – critical facts missing!  This is not what we discussed, nowhere to pivot to Story – Automated price comparison By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

 What problem needs solving? What examples do you have? What is it worth to the business?  How would an expert solve this? Do they solve it?  Get the bosses to agree to your specification Solution – write a specification By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

Specification: By [ian]@ianozsvald[.com] Ian Ozsvald What’s the $value?

Slide 11

Slide 11 text

 Boss in new department wanted $$$ Big Success  “Success” was sold to business departments, then the Data Science team were involved after agreement  Sometimes no data in dB (just paper)  But – this was also a problem rich environment Story – insurance and Big DS Projects By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

 Your client knows more than you do  What do they need?  What’s feasible with the data?  What’s it $worth? Solution – talk to the client first By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 13

Slide 13 text

Data Maturity Model By [ian]@ianozsvald[.com] Ian Ozsvald Reference: https://www.svds.com/thought-leadership/data-maturity-assessment/ Growing up is Hard!

Slide 14

Slide 14 text

 ML Project nearly finished...  Client didn’t trust “ML”  Colleague drew many diagrams... Story – insurance & low client trust By [ian]@ianozsvald[.com] Ian Ozsvald https://stackoverflow.com/questions/40155128/plot-trees-for-a-random-forest-in-python-with-scikit-learn

Slide 15

Slide 15 text

 We found data errors → iteration → build confidence  The client ultimately agreed “this is useful, I want it” by diagnosing cases they knew personally Story – SHAP to explain predictions By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/using-model-interpretation-with-shap-to-understand-what-happened-in- the-titanic-1dd42ef41888

Slide 16

Slide 16 text

 “We want to investigate our data, please do some magic”  Agreed a derisking project, identified many issues, proposed next project – a sane start  Later we built models to prioritise interesting companies  The data really was awful! Story – VC and Dirty Data By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

 Insurance – the “mean model” beats Random Fr. – huge embarrassment  VC – My Logistic Regression beat the human rules (encoded as derived sklearn estimators) Story – Always have a baseline By [ian]@ianozsvald[.com] Ian Ozsvald https://scikit-learn.org/stable/developers/develop.html

Slide 18

Slide 18 text

 10/10,000 chance of success  The junior associates have their own methods  Client suggested “more advanced methods” but Log.Reg. and GBMs very good (accepting limited signal!)  My contract got terminated! Problem-poor environment Story – VC and “nobody to check the results” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

 Deliver early and often to client  Give them enough so they look cool  Use simplest models (e.g. linear), make lots of pictures, diagnose problems, figure out the value to them Solution – get clients involved early By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 20

Slide 20 text

 Successful Random Forest model for insurance  “We can’t deploy Python, please write SQL” - IT  Colleague had to hand-write SQL rules from RF model – did it ever actually work? Was it right? Story – insurance and “no ML, please write SQL” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 21

Slide 21 text

 Operationalization is often hard (especially for v1)  In your specification think about the client, their needs and how to deploy so they can use the tools  Sit with the client – how do they work right now?  A corporate might take 6 months to provision a machine Solution – plan for deployment early By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 22

Slide 22 text

 Finding insurance fraud and overbilling – really hard!  Prior fraud project 6 months old & no results  We derisked projects early – 2+ months of discussion  Found positive examples, assigned $value, prioritised  Agreed a delivery schedule for whatever we learn Story – Making $1M $2M for my client By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 23

Slide 23 text

 Mix of better SQL ($0.4M), counting ($0.8M), percentiles ($0.4M), lots of discussion, lots of SQL (problem rich!)  Isolation Forest + GBM good but rules better for client  Boss’ boss writing their own BI as they’re so inspired  New team begging us to start with them Story – Making $1M $2M for my client By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 24

Slide 24 text

 New problem!  No bandwidth in Fraud team for new results – we swamped them (in a good way)!  Getting an organisation to move up the Data Maturity Model is hard and just takes time Story – “Fixing business” takes time By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 25

Slide 25 text

A colleague’s view By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 26

Slide 26 text

 Have a clear written specification  Know the business need & value → metrics  Know the risks – does it make sense  Know how to deploy (and deploy early) For new projects check you By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 27

Slide 27 text

By [ian]@ianozsvald[.com] Ian Ozsvald Find a Good Puzzle Solve the Puzzle

Slide 28

Slide 28 text

 Identify failure, iterate to success  See blog for past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 29

Slide 29 text

 Need “the face fits” and “relevant skills”  Similarity tool for company and skills from PDF text  Client annotated data & scored results from week 1  “You’ve given us a superpower, we phone the top 10 results, sign a contract, then we’re done for the day” Story – automated contract recruitment and “new superpowers” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 30

Slide 30 text

 Initial deployment – CSV for similarity results, then Jupyter Notebooks, then microservices + Flask with black- box tests (now I’d use FastAPI + Streamlit or Viola)  Boss sat next to me and we typed examples together  Tests caught MongoDB corruption and MySQL “3 byte unicode” Story – recruitment & deployment By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 31

Slide 31 text

 This is your career – you’re in charge  Identify possible problems  Make sensible choices  (accept some failures!)  Enjoy yourself You’re in charge By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 32

Slide 32 text

 You should write your own specification  Identify risks, talk to the experts, get good examples  Quickly deliver results & iterate  Deploy often, deploy early (be embarrassed and learn) A checklist for you By [ian]@ianozsvald[.com] Ian Ozsvald