ML pipelines quality assurance in production

ML pipelines quality assurance in production Alex Tselikov Lead Data
Scientist @ KI labs GmbH

| 2 © KI labs GmbH - All Rights Reserved.
A Day In The Life Of A Data Scientist: Part 1 1. Data preparation: 40% • Getting the data from different data sources • Make sure that data is correct • EDA, Checking hypotheses, build prototypes

A Day In The Life Of A Data Scientist: Part 1 1. Data preparation: 40% • Getting the data from different data sources • Make sure that data is correct • EDA, Checking hypotheses, build prototypes 2. Meetings: 30% • Clarifying requirements • “Translating” strange business needs to the data scientists in the team • Explaining models’ results • Explain why Big Data is not the answer to the world, universe and everything

A Day In The Life Of A Data Scientist: Part 2 3. Learning & community: 20% • Reading white papers and DS blogs, courses, conferences • Code review with coworkers, discussions about results of experiments

A Day In The Life Of A Data Scientist: Part 2 3. Learning & community: 20% • Reading white papers and DS blogs, courses, conferences • Code review with coworkers, discussions about results of experiments 4. Model building: 10% • Designing predictive models • Deploying models into production and checking results

And Then … • Train • Test • Deploy GINI – accuracy measure in machine learning

And Then … • Train • Test • Deploy • Fail • Why? • How to detect? • How to prevent? GINI – accuracy measure in machine learning

Usecase: Credit Scoring Example Classical supervised ML task: Representation of the credit worthiness of an individual to: • approve credit • set credit limit on a credit/store cards • pre-approval of additional credit for an existing customer Score (target) Age Monthly Income Open Credit Lines Times 90Days Late Real Estate Loans 1 45 9120 13 0 6 0 40 2600 4 0 0 0 38 3042 2 1 0 0 30 3300 5 0 0 1 49 6358 7 0 1 0 74 3500 3 0 1

Technical Metrics ROC AUC • Range: • From 0.5 (random guessing) • To 1 (perfect separation) • Meaning: a right order of customers: Measures to evaluate the performance of a binary classifier

Technical Metrics GINI • Standard measure for credit scoring • Higher Gini means more predictive power = 2*AUC-1 • Meaning: distributions separation ROC AUC • Range: • From 0.5 (random guessing) • To 1 (perfect separation) • Meaning: a right order of customers: Measures to evaluate the performance of a binary classifier

Business Metrics Cumulative lift customer base, ordered by decreasing probability of being late in repaying • picking random 10% of customers we should get 10% of positive responses

Business Metrics Cumulative lift customer base, ordered by decreasing probability of being late in repaying • picking random 10% of customers we should get 10% of positive responses • picking top 10% of customers based on ML model we should get ~37% of positive responses Why it is important? Direct influence to the business!

Business Metrics: Direct Optimization What if the errors are not all equal? For example: False Negative cost €1k, False Positive might cost €10k. While the default function treats all errors equally with a custom objective function we can try to take this into account. Xgboost example:

• Not a random sample from general population for train/test • Sometimes it‘s not really possible due to business usecase (costly) • You could be just given by some train/test data without access to production What Can Go Wrong: Biased Data

• Not a random sample from general population for train/test • Sometimes it‘s not really possible due to business usecase (costly) • You could be just given by some train/test data without access to production • As a result: • Different feature distribution • Wrong class imbalance • Difference in model performance What Can Go Wrong: Biased Data Value counts for the most important features: feature device_type lte_device class_memo_type Value Value Counts Value Value Counts Value Value Counts train+test -99.0 214.714 -99.0 214.839 nan 268.062 1.0 57.275 nan 53.358 -99 24.912 0.0 4.161 0.0 32.645 DSSN 13.020 production 1 609.136 0 341.169 -99 259.697 0 63.862 1 331.054 DSSN 192.203 -99 173 -99 948 IDEV 82.996

What Can Go Wrong : A Shift In Population Population shift is a normal process over a period of time. But how to catch significant changes? Population stability: Shows whether a scorecard has changed over a period of time. Approach: To compare histogram distribution for score results and important features

Check prediction (score) distribution on monthly/daily basis: Population Stability Index (PSI) Model RecCount score_max score_min score_avg PSI m5_bank1_ours_rand_forest_v1_ 50.470.957 0,82 0,12 0,42 0,001921 m6_bank1_comp_rand_forest_v1_ 164.239.489 0,79 0,18 0,47 0,010019 m11_bank2_ours_xgb_v3_ 50.470.957 0,99 0,14 0,94 0,001081 m13_bank3_ours_rand_forest_v1_ 50.470.957 0,79 0,11 0,46 0,001081 m14_bank3_comp_rand_forest_v2_ 164.239.489 0,77 0,24 0,53 0,005805

Kaggle trick: Adversarial Validation • Check the degree of similarity between training and tests in terms of features distribution • Combining train and test sets and evaluating a binary classification task (ROC AUC) • Use ‘istrain’ as a target variable and fit simple classification model

Kaggle trick: Adversarial Validation • Check the degree of similarity between training and tests in terms of features distribution • Combining train and test sets and evaluating a binary classification task (ROC AUC) • Use ‘istrain’ as a target variable and fit simple classification model • Low AUC is good - there in no sufficient difference between test and train features distribution • High AUC means you are dealing with completely different datasets • Extend this idea to check data in production on a regular basis

Need to be accurate with: • Kaggle Tricks: • Target encoding • Ensembling • Proper validation What Can Go Wrong : Overfitting BayesOpt/HyperOpt:

• Transforming categorical features to numerical features based on mean of the response • One of the most powerful Kaggle tricks for classical ML competitions • But, it’s very easy to overfit, K-fold regularization for mean encodings What Can Go Wrong : Target Encoding

• Transforming categorical features to numerical features based on mean of the response • One of the most powerful Kaggle tricks for classical ML competitions • But, it’s very easy to overfit, K-fold regularization for mean encodings What Can Go Wrong : Target Encoding pc is a target mean for a category nc is a number of samples in a category Pglobal is a global target mean α is a regularisation parameter

What Can Go Wrong: Ensembles vs Bagging Ensembles: hard to build, maintain, validate, deploy

Bagging (decrease variance): Averaging same algorithm with different random state What Can Go Wrong: Ensembles vs Bagging Ensembles: hard to build, maintain, validate, deploy

Some pandas functions which are not really scalable: What Can Go Wrong: Too Long Model Execution

Some pandas functions which are not really scalable: What Can Go Wrong: Too Long Model Execution Clustering and dimension reduction: • T-SNE • DBSCAN

Key Take-aways - Business metrics are important - You have
to be involved at all stages from getting the data to checking results in production - Monitor your model’s results and data distribution on a regular basis

Thank you! Questions? www.ki-labs.com

ML pipelines quality assurance in production

ML pipelines quality assurance in production

Alex Tselikov

More Decks by Alex Tselikov

Other Decks in Programming

Featured

Transcript

ML pipelines quality assurance in production Alex Tselikov Lead Data

| 2 © KI labs GmbH - All Rights Reserved.

| 3 © KI labs GmbH - All Rights Reserved.

| 4 © KI labs GmbH - All Rights Reserved.

| 5 © KI labs GmbH - All Rights Reserved.

| 6 © KI labs GmbH - All Rights Reserved.

| 7 © KI labs GmbH - All Rights Reserved.

| 8 © KI labs GmbH - All Rights Reserved.

| 9 © KI labs GmbH - All Rights Reserved.

| 10 © KI labs GmbH - All Rights Reserved.

| 11 © KI labs GmbH - All Rights Reserved.

| 12 © KI labs GmbH - All Rights Reserved.

| 13 © KI labs GmbH - All Rights Reserved.

| 14 © KI labs GmbH - All Rights Reserved.

| 15 © KI labs GmbH - All Rights Reserved.

| 16 © KI labs GmbH - All Rights Reserved.

| 17 © KI labs GmbH - All Rights Reserved.

| 18 © KI labs GmbH - All Rights Reserved.

| 19 © KI labs GmbH - All Rights Reserved.

| 20 © KI labs GmbH - All Rights Reserved.

| 21 © KI labs GmbH - All Rights Reserved.

| 22 © KI labs GmbH - All Rights Reserved.

| 23 © KI labs GmbH - All Rights Reserved.

| 24 © KI labs GmbH - All Rights Reserved.

| 25 © KI labs GmbH - All Rights Reserved.

| 26 © KI labs GmbH - All Rights Reserved.

| 27 © KI labs GmbH - All Rights Reserved.

| 28 © KI labs GmbH - All Rights Reserved.

Key Take-aways - Business metrics are important - You have

Thank you! Questions? www.ki-labs.com