Slide 1

Slide 1 text

Deftly Delivering Data Science Projects PyDataPrague 2018-10 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Introductions ● I’m an engineering data scientist ● 15+ years experience ● Team coaching ● Strategic planning ● Training

Slide 3

Slide 3 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Problems delivering DS projects ● What are your experiences?

Slide 4

Slide 4 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Problems delivering DS projects ● “Make us more [money|signups|...]” - desire for magic ● Desire over actual need – vanity projects ● Lack of technical leadership – poor specs ● Bad data – lies, mistakes, confusion ● Lack of client buy-in

Slide 5

Slide 5 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Good DS projects ● Numerate management asking good data driven questions ● Suitable data ● Well defined outcomes that are agreed to be achievable

Slide 6

Slide 6 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Learning and applying at...

Slide 7

Slide 7 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Project Specification ● You need a clearly defined problem ● Where are the unknowns? ● Known unknowns ● What might kill the project? ● Propose milestones ● Where’s your Gold Standard data set? ● What’s your “definition of done” ● Minimal results and great results ● Appropriate metrics to communicate results

Slide 8

Slide 8 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 “Data story” ● Do you understand your data? ● Explain your data – what does it say? ● What’s good and what’s bad? ● What are the relationships? ● Where is the signal in the data? ● Export your Notebook as html artefact ● Data Story proposed by Bertil (Medium)

Slide 9

Slide 9 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Standardised approaches ● Reduce the mental load for common decisions ● Cookiecutter (folders) ● pandas-profiling ● watermark ● Anaconda

Slide 10

Slide 10 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Improving code quality ● Encode assumptions with asserts ● Refactor to modules ● Add unit-tests ● Visual reports with analyst interpretations ● Diagnostics e.g. yellowbrick for sklearn

Slide 11

Slide 11 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Improving project quality ● Code reviews (with a check-list, PEP8) ● nbdime for diffs ● “Data Defences” - regular critiques by colleagues on your project

Slide 12

Slide 12 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Continuous delivery to client ● Early deliveries – reports ● Get to a minimal working delivery as soon as possible (UI? App? Reports?) ● Consider papermill for deployable Notebooks

Slide 13

Slide 13 text

[email protected] @IanOzsvald[.com] PyDataPrague 2018-10 Summary ● Honesty throughout your work ● Strive to keep improving your technique ● Keep communicating your results