Introduction to Workflow for Data Science

Slide 1

Slide 1 text

Introduction to Workflow for Data Science Santiago Lizardo 2019/12/11

Slide 2

Slide 2 text

• Sr. Engineering Manager • Hobbyist Data Scientist About the speaker 2

Slide 3

Slide 3 text

About the talk 3 • High level overview of the data science workflow • Use case in bug triage • Tooling for machine learning • Small demo

Slide 4

Slide 4 text

Data science workflow 4

Slide 5

Slide 5 text

Workflow definition 6

Slide 6

Slide 6 text

Data science workflow: 10,000 feet view Data Insights Actions 7

Slide 7

Slide 7 text

Data science workflow: 10,000 feet view Business Question Data Insights Answer Actions 8

Slide 8

Slide 8 text

Business question examples 9 • Predictions • Are we going to grow in more than X% next quarter? • Is this customer going to leave in the next 5 months? • Classification • Is this a server or a workstation? • What type of server this is? (web server, email, file, proxy, …) • Anomaly detection • Is this spike in uploads normal?

Slide 9

Slide 9 text

Workflow example 10

Slide 10

Slide 10 text

Workflow example 11

Slide 11

Slide 11 text

Workflow example 12

Slide 12

Slide 12 text

Workflow example 13

Slide 13

Slide 13 text

Workflow example 14

Slide 14

Slide 14 text

Workflow example 15

Slide 15

Slide 15 text

Workflow example 16

Slide 16

Slide 16 text

Workflow example 17

Slide 17

Slide 17 text

Workflow example 18

Slide 18

Slide 18 text

Workflow example 19

Slide 19

Slide 19 text

Industry standards • Knowledge discovery in databases (KDD) • Analytics Solutions Unified Method – DM (ASUM-DM) • Cross-Industry Standard Process – DM (CRISP-DM) • Team Data Science Process (TDSP) 20

Slide 20

Slide 20 text

KDD process 21

Slide 21

Slide 21 text

CRISP-DM workflow 22

Slide 22

Slide 22 text

ASUM-DM workflow 23

Slide 23

Slide 23 text

TDSP workflow 24

Slide 24

Slide 24 text

DS workflow v. simplified 25

Slide 25

Slide 25 text

Time spent on each step of the process 26

Slide 26

Slide 26 text

Time spent on each step of the process 27

Slide 27

Slide 27 text

Data wrangling • Data… • …sourcing • …exploration (EDA) • …cleaning • …imputation • …enrichment (Feature engineering) 28

Slide 28

Slide 28 text

Data wrangling | Data ingestion • Identify the data source(s) • Static dumps (CSV, TSV, Parquet, …) • Live databases (SQL, NoSQL, …) • Data scrapping (Web) • Database (SQL) • Exposing it to the data science project 29

Slide 29

Slide 29 text

Data and variable type 30

Slide 30

Slide 30 text

Model selection Supervised Model Unsupervised 31

Slide 31

Slide 31 text

Model selection Model Classification Supervised Regression Unsupervised 32

Slide 32

Slide 32 text

Model selection Supervised Model Unsupervised Clustering 33

Slide 33

Slide 33 text

Model selection 34

Slide 34

Slide 34 text

Model validation 35

Slide 35

Slide 35 text

Dissemination • Understandable by stakeholders • Digestible • Concise • Visual (charts, plots) • Accurate 42

Slide 36

Slide 36 text

Chart tools • Python • Matplotlib • Seaborn • Plotly (interactive) 43

Slide 37

Slide 37 text

Charts 44

Slide 38

Slide 38 text

USE CASE 45

Slide 39

Slide 39 text

Use case: Triage • Every morning new bugs are reviewed • Engineering team is identified and assigned • Done based on description among other attributes 46

Slide 40

Slide 40 text

Use case: Triage • Can it be programmed? • Too many rules • Use information about previous 30K tickets to classify new ones • Error margin is acceptable 47

Slide 41

Slide 41 text

Use case: Triage 48 • Jira ticket attributes to read from: • Summary • Components • Labels • Area • Repository • Jira ticket attribute to update: • Team

Slide 42

Slide 42 text

Use case: Triage 49 • High level plan • Extract data from Jira à CSV • Use Python and Pandas to read CSV and clean it up • Use NLP to extract the main info from text • Remove stopwords, punctuaction, etc • Use stemming to get to the root of words • Use XGBclassifier from scikit-learn • Hook Python script to ML algorithm

Slide 43

Slide 43 text

TOOLING 50

Slide 44

Slide 44 text

Tooling: Languages 51

Slide 45

Slide 45 text

Tooling • Local notebooks • Jupyter notebooks (.ipynb) • Cloud notebooks • Google Collaboratory • Amazon Sagemaker 52

Slide 46

Slide 46 text

Tooling: Deployment 53

Slide 47

Slide 47 text

DEMO 54

Slide 48

Slide 48 text

Closing thoughts • Data science is for the many not the few • Save data now for what you might want to ask in the future • Workflows are useful tools in projects • Data science can be hard, so can be some business requirements 55

Slide 49

Slide 49 text

Q&A 57