Introduction to Workflow
for Data Science
Santiago Lizardo
2019/12/11
Slide 2
Slide 2 text
• Sr. Engineering Manager
• Hobbyist Data Scientist
About the speaker
2
Slide 3
Slide 3 text
About the talk
3
• High level overview of the data science workflow
• Use case in bug triage
• Tooling for machine learning
• Small demo
Slide 4
Slide 4 text
Data science workflow
4
Slide 5
Slide 5 text
Workflow definition
6
Slide 6
Slide 6 text
Data science workflow: 10,000 feet view
Data Insights Actions
7
Slide 7
Slide 7 text
Data science workflow: 10,000 feet view
Business
Question
Data Insights Answer Actions
8
Slide 8
Slide 8 text
Business question examples
9
• Predictions
• Are we going to grow in more than X% next quarter?
• Is this customer going to leave in the next 5 months?
• Classification
• Is this a server or a workstation?
• What type of server this is? (web server, email, file, proxy, …)
• Anomaly detection
• Is this spike in uploads normal?
Slide 9
Slide 9 text
Workflow example
10
Slide 10
Slide 10 text
Workflow example
11
Slide 11
Slide 11 text
Workflow example
12
Slide 12
Slide 12 text
Workflow example
13
Slide 13
Slide 13 text
Workflow example
14
Slide 14
Slide 14 text
Workflow example
15
Slide 15
Slide 15 text
Workflow example
16
Slide 16
Slide 16 text
Workflow example
17
Slide 17
Slide 17 text
Workflow example
18
Slide 18
Slide 18 text
Workflow example
19
Slide 19
Slide 19 text
Industry standards
• Knowledge discovery in databases (KDD)
• Analytics Solutions Unified Method – DM (ASUM-DM)
• Cross-Industry Standard Process – DM (CRISP-DM)
• Team Data Science Process (TDSP)
20
Data wrangling | Data ingestion
• Identify the data source(s)
• Static dumps (CSV, TSV, Parquet, …)
• Live databases (SQL, NoSQL, …)
• Data scrapping (Web)
• Database (SQL)
• Exposing it to the data science project
29
Slide 29
Slide 29 text
Data and variable type
30
Slide 30
Slide 30 text
Model selection
Supervised
Model
Unsupervised
31
Slide 31
Slide 31 text
Model selection
Model
Classification
Supervised
Regression
Unsupervised
32
Slide 32
Slide 32 text
Model selection
Supervised
Model
Unsupervised Clustering
33
Slide 33
Slide 33 text
Model selection
34
Slide 34
Slide 34 text
Model validation
35
Slide 35
Slide 35 text
Dissemination
• Understandable by stakeholders
• Digestible
• Concise
• Visual (charts, plots)
• Accurate
42
Use case: Triage
• Every morning new bugs are reviewed
• Engineering team is identified and assigned
• Done based on description among other attributes
46
Slide 40
Slide 40 text
Use case: Triage
• Can it be programmed?
• Too many rules
• Use information about previous 30K tickets to classify new ones
• Error margin is acceptable
47
Slide 41
Slide 41 text
Use case: Triage
48
• Jira ticket attributes to read from:
• Summary
• Components
• Labels
• Area
• Repository
• Jira ticket attribute to update:
• Team
Slide 42
Slide 42 text
Use case: Triage
49
• High level plan
• Extract data from Jira à CSV
• Use Python and Pandas to read CSV and clean it up
• Use NLP to extract the main info from text
• Remove stopwords, punctuaction, etc
• Use stemming to get to the root of words
• Use XGBclassifier from scikit-learn
• Hook Python script to ML algorithm
Slide 43
Slide 43 text
TOOLING
50
Slide 44
Slide 44 text
Tooling: Languages
51
Slide 45
Slide 45 text
Tooling
• Local notebooks
• Jupyter notebooks (.ipynb)
• Cloud notebooks
• Google Collaboratory
• Amazon Sagemaker
52
Slide 46
Slide 46 text
Tooling: Deployment
53
Slide 47
Slide 47 text
DEMO
54
Slide 48
Slide 48 text
Closing thoughts
• Data science is for the many not the few
• Save data now for what you might want to ask in the future
• Workflows are useful tools in projects
• Data science can be hard, so can be some business requirements
55