Workflow in a team of data scientists (tech talk for colleagues)

Slide 1

Slide 1 text

Workflow in a team of data scientists

Slide 2

Slide 2 text

Understanding of data science process data models

Slide 3

Slide 3 text

Understanding of data science process data models

Slide 4

Slide 4 text

Types of ML

Slide 5

Slide 5 text

Supervised Machine Learning

Slide 6

Slide 6 text

Data Science workﬂow by CRISP-DM

Slide 7

Slide 7 text

Business understanding ● Define problem in terms of business - define business question to the future model Example: detect and prevent frauds intrusion ● Define data science problem Example: who are considered to be frauds, how to detect frauds ● Define what we need to solve the problem - what data to gather and analyze

Slide 8

Slide 8 text

Data understanding ● Gather dataset

Slide 9

Slide 9 text

Data Understanding ● EDA (Exploratory Data Analysis) Objectives: 1. Discover patterns 2. Spot anomalies 3. Frame hypothesis 4. Check assumptions

Slide 10

Slide 10 text

Data Preparation (50-70% of project time) ● Data Preprocessing (handle missing values, wrong data types, etc.) ● Dataset Labeling one class problem

Slide 11

Slide 11 text

Data Preparation ● Divide dataset into train/test (validate, folds) ● Feature Engineering

Slide 12

Slide 12 text

Modeling ● Creating baseline model ● Choosing algorithms ● Feature selection

Slide 13

Slide 13 text

Modeling ● Tune model hyperparameters - to achieve higher accuracy - to improve model performance sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)

Slide 14

Slide 14 text

Modeling ● Tune model hyperparameters - to avoid overﬁtting

Slide 15

Slide 15 text

Evaluation - Model Performance Metrics (AUC, Gini, F1, Confusion matrix, etc.) - Business metrics (proﬁts, approval rate, default rate, etc.) - Evaluate achievement of business Purposes Some models may not get to deployment stage after evaluation.

Slide 16

Slide 16 text

De data about an application OneKarma Scoring service API trained model { model score fetch data

Slide 17

Slide 17 text

trained model Response JSON Request JSON

Slide 18

Slide 18 text

Terms for each stage Business understanding - 1 week Data understanding - 3 weeks Data preparation - 5 weeks Modeling - 2 weeks Evaluation - 1 week Deployment - 1 week Full model development process - ~ 13 weeks

Slide 19

Slide 19 text

Data scientists Data engineers Data scientists Business side Data scientists Data engineers Data scientists Data scientists Data engineers Development team QA team Data scientists Business side

Slide 20

Slide 20 text

Thank you for your attention.