Slide 1

Slide 1 text

Pitfalls in Data Science Projects Marco Bonzanini PyData Southampton Meetup

Slide 2

Slide 2 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Nice to meet you • Dr Marco Bonzanini • NLP and Data Science stuff • Consulting, training and coaching on Python + Data Science • Former Chair @ PyData London 2

Slide 3

Slide 3 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 3

Slide 4

Slide 4 text

WHY DO PROJECTS FAIL?

Slide 5

Slide 5 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 5 Business and Tech Out of sync

Slide 6

Slide 6 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 6 Business and Tech Out of sync Data Quality

Slide 7

Slide 7 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 7 Business and Tech Out of sync Shiny Object Syndrome Data Quality

Slide 8

Slide 8 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 8 Business and Tech Out of sync Shiny Object Syndrome Data Quality No Route to Deployment

Slide 9

Slide 9 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 9 Business and Tech Out of sync Shiny Object Syndrome Data Quality No Route to Deployment

Slide 10

Slide 10 text

THE DREADED “POC”

Slide 11

Slide 11 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 11 Prototype Idea Product Expectation

Slide 12

Slide 12 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 12 Prototype Idea Product Expectation Reality ???

Slide 13

Slide 13 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 13 First Mile ≠ Last Mile

Slide 14

Slide 14 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 14 R&D Requirements ≠ Production Requirements

Slide 15

Slide 15 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 15 Lack of Planning Lack of Communication Pitfalls

Slide 16

Slide 16 text

SHOULD WE WORK ON IT?

Slide 17

Slide 17 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 17

Slide 18

Slide 18 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 18 Effort Impact

Slide 19

Slide 19 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 19 Effort Impact High impact Low effort 👍 👍 Low impact High effort 👎 👎

Slide 20

Slide 20 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 20 Effort Impact ✅ ✅ ✅ ❌ ❌ ❌ ❌ 🤷 ✅ 🤷 🤷

Slide 21

Slide 21 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 21 Align with Business De-risk

Slide 22

Slide 22 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 22 Align with Business De-risk • Business case: PoC vs Proof-of-Value

Slide 23

Slide 23 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 23 Align with Business De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement

Slide 24

Slide 24 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 24 Align with Business De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data

Slide 25

Slide 25 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 25 Align with Business De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data • Data quality, coverage and volume?

Slide 26

Slide 26 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 26 Align with Business De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data • Data quality, coverage and volume? • Deployment, integration, scalability?

Slide 27

Slide 27 text

TOWARDS THE PROTOTYPE

Slide 28

Slide 28 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 28 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint

Slide 29

Slide 29 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 29 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint “Let’s get the best accuracy”

Slide 30

Slide 30 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 30 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint “Let’s get the best accuracy” • Feature engineering • Model building • Hyperparameter tuning • […]

Slide 31

Slide 31 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 31 Over Optimising Not Speaking the Language Pitfalls

Slide 32

Slide 32 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 32 End-to-end Quickly Iterative Improvement

Slide 33

Slide 33 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 33 End-to-end Quickly Iterative Improvement • Promote early feedback

Slide 34

Slide 34 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 34 End-to-end Quickly Iterative Improvement • Promote early feedback • Avoid early complexity: - Dif fi cult to diagnose - Delay feedback - Hide bigger risks

Slide 35

Slide 35 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 35 End-to-end Quickly Iterative Improvement • Promote early feedback • Avoid early complexity: - Dif fi cult to diagnose - Delay feedback - Hide bigger risks • Optimise for business value, not “accuracy”

Slide 36

Slide 36 text

TOWARDS PRODUCTION

Slide 37

Slide 37 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 37 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint

Slide 38

Slide 38 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 38 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint

Slide 39

Slide 39 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 39 Data Validation Data Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint What got you here
 won’t get you there

Slide 40

Slide 40 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 40 Hidden Technical Debt of Machine Learning Systems, Sculley et al. (2015)

Slide 41

Slide 41 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 41 Research ≠ Engineering Not Looking at the Bigger Picture Pitfalls

Slide 42

Slide 42 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 42 Testing Packaging

Slide 43

Slide 43 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 43 Testing Packaging • Don’t ignore good software engineering principles

Slide 44

Slide 44 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 44 Testing Packaging • Don’t ignore good software engineering principles • Testing: unit testing, integration testing

Slide 45

Slide 45 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 45 Testing Packaging • Don’t ignore good software engineering principles • Testing: unit testing, integration testing • Code re-usability, DRY (Don’t Repeat Yourself) Single Responsibility Principle

Slide 46

Slide 46 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 46 Testing Packaging • Don’t ignore good software engineering principles • Testing: unit testing, integration testing • Code re-usability, DRY (Don’t Repeat Yourself) Single Responsibility Principle • Ditch the notebooks as soon as you: - struggle testing some component - would like to “import from another notebook”

Slide 47

Slide 47 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 47 Testing Packaging

Slide 48

Slide 48 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 48 Testing Packaging

Slide 49

Slide 49 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 49 Code Reviews Integration

Slide 50

Slide 50 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 50 Code Reviews Integration • Code reviews done right: foster collaboration - Spot check errors - Clarify the why’s - Knowledge share

Slide 51

Slide 51 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 51 Code Reviews Integration • Code reviews done right: foster collaboration - Spot check errors - Clarify the why’s - Knowledge share • Code reviews done wrong: hostile environment - Ask open ended questions - Use professional language - Nitpicks: labelled as such, kept to a minimum

Slide 52

Slide 52 text

AFTER PRODUCTION

Slide 53

Slide 53 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 53 Model Degradation Scalability Lack of Explainability Cost of Maintenance 💩

Slide 54

Slide 54 text

SUMMARY

Slide 55

Slide 55 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 55 The First Mile Rarely a technology problem Usually a planning + communication problem

Slide 56

Slide 56 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 56 The First Mile Rarely a technology problem Usually a planning + communication problem The Last Mile Sometimes a technology problem Still a planning + communication problem

Slide 57

Slide 57 text

© Bonzanini Consulting Ltd — BonzaniniConsulting.com 57 Thank You • Linkedin https://www.linkedin.com/in/marcobonzanini/ • Blog: marcobonzanini.com • Newsletter: marcobonzanini.com/newsletter