Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML pipelines quality assurance in production

ML pipelines quality assurance in production

This talk discusses questions such as: How not to fail with your 99%-accuracy model in production? Which metrics should be checked and when? How do you control live predictions? What to do if your model accuracy is degrading? Using some live examples, we will cover the main quality assurance steps that can be applied alongside model development and deployment, and we present the practices of unit testing and results monitoring for production.

Alex Tselikov

March 09, 2019
Tweet

More Decks by Alex Tselikov

Other Decks in Programming

Transcript

  1. | 2 © KI labs GmbH - All Rights Reserved.

    A Day In The Life Of A Data Scientist: Part 1 1. Data preparation: 40% • Getting the data from different data sources • Make sure that data is correct • EDA, Checking hypotheses, build prototypes
  2. | 3 © KI labs GmbH - All Rights Reserved.

    A Day In The Life Of A Data Scientist: Part 1 1. Data preparation: 40% • Getting the data from different data sources • Make sure that data is correct • EDA, Checking hypotheses, build prototypes 2. Meetings: 30% • Clarifying requirements • “Translating” strange business needs to the data scientists in the team • Explaining models’ results • Explain why Big Data is not the answer to the world, universe and everything
  3. | 4 © KI labs GmbH - All Rights Reserved.

    A Day In The Life Of A Data Scientist: Part 2 3. Learning & community: 20% • Reading white papers and DS blogs, courses, conferences • Code review with coworkers, discussions about results of experiments
  4. | 5 © KI labs GmbH - All Rights Reserved.

    A Day In The Life Of A Data Scientist: Part 2 3. Learning & community: 20% • Reading white papers and DS blogs, courses, conferences • Code review with coworkers, discussions about results of experiments 4. Model building: 10% • Designing predictive models • Deploying models into production and checking results
  5. | 6 © KI labs GmbH - All Rights Reserved.

    And Then … • Train • Test • Deploy GINI – accuracy measure in machine learning
  6. | 7 © KI labs GmbH - All Rights Reserved.

    And Then … • Train • Test • Deploy • Fail • Why? • How to detect? • How to prevent? GINI – accuracy measure in machine learning
  7. | 8 © KI labs GmbH - All Rights Reserved.

    Usecase: Credit Scoring Example Classical supervised ML task: Representation of the credit worthiness of an individual to: • approve credit • set credit limit on a credit/store cards • pre-approval of additional credit for an existing customer Score (target) Age Monthly Income Open Credit Lines Times 90Days Late Real Estate Loans 1 45 9120 13 0 6 0 40 2600 4 0 0 0 38 3042 2 1 0 0 30 3300 5 0 0 1 49 6358 7 0 1 0 74 3500 3 0 1
  8. | 9 © KI labs GmbH - All Rights Reserved.

    Usecase: Credit Scoring Example Classical supervised ML task: Representation of the credit worthiness of an individual to: • approve credit • set credit limit on a credit/store cards • pre-approval of additional credit for an existing customer Score (target) Age Monthly Income Open Credit Lines Times 90Days Late Real Estate Loans 1 45 9120 13 0 6 0 40 2600 4 0 0 0 38 3042 2 1 0 0 30 3300 5 0 0 1 49 6358 7 0 1 0 74 3500 3 0 1
  9. | 10 © KI labs GmbH - All Rights Reserved.

    Technical Metrics ROC AUC • Range: • From 0.5 (random guessing) • To 1 (perfect separation) • Meaning: a right order of customers: Measures to evaluate the performance of a binary classifier
  10. | 11 © KI labs GmbH - All Rights Reserved.

    Technical Metrics GINI • Standard measure for credit scoring • Higher Gini means more predictive power = 2*AUC-1 • Meaning: distributions separation ROC AUC • Range: • From 0.5 (random guessing) • To 1 (perfect separation) • Meaning: a right order of customers: Measures to evaluate the performance of a binary classifier
  11. | 12 © KI labs GmbH - All Rights Reserved.

    Business Metrics Cumulative lift customer base, ordered by decreasing probability of being late in repaying • picking random 10% of customers we should get 10% of positive responses
  12. | 13 © KI labs GmbH - All Rights Reserved.

    Business Metrics Cumulative lift customer base, ordered by decreasing probability of being late in repaying • picking random 10% of customers we should get 10% of positive responses • picking top 10% of customers based on ML model we should get ~37% of positive responses Why it is important? Direct influence to the business!
  13. | 14 © KI labs GmbH - All Rights Reserved.

    Business Metrics: Direct Optimization What if the errors are not all equal? For example: False Negative cost €1k, False Positive might cost €10k. While the default function treats all errors equally with a custom objective function we can try to take this into account. Xgboost example:
  14. | 15 © KI labs GmbH - All Rights Reserved.

    • Not a random sample from general population for train/test • Sometimes it‘s not really possible due to business usecase (costly) • You could be just given by some train/test data without access to production What Can Go Wrong: Biased Data
  15. | 16 © KI labs GmbH - All Rights Reserved.

    • Not a random sample from general population for train/test • Sometimes it‘s not really possible due to business usecase (costly) • You could be just given by some train/test data without access to production • As a result: • Different feature distribution • Wrong class imbalance • Difference in model performance What Can Go Wrong: Biased Data Value counts for the most important features: feature device_type lte_device class_memo_type Value Value Counts Value Value Counts Value Value Counts train+test -99.0 214.714 -99.0 214.839 nan 268.062 1.0 57.275 nan 53.358 -99 24.912 0.0 4.161 0.0 32.645 DSSN 13.020 production 1 609.136 0 341.169 -99 259.697 0 63.862 1 331.054 DSSN 192.203 -99 173 -99 948 IDEV 82.996
  16. | 17 © KI labs GmbH - All Rights Reserved.

    What Can Go Wrong : A Shift In Population Population shift is a normal process over a period of time. But how to catch significant changes? Population stability: Shows whether a scorecard has changed over a period of time. Approach: To compare histogram distribution for score results and important features
  17. | 18 © KI labs GmbH - All Rights Reserved.

    Check prediction (score) distribution on monthly/daily basis: Population Stability Index (PSI) Model RecCount score_max score_min score_avg PSI m5_bank1_ours_rand_forest_v1_ 50.470.957 0,82 0,12 0,42 0,001921 m6_bank1_comp_rand_forest_v1_ 164.239.489 0,79 0,18 0,47 0,010019 m11_bank2_ours_xgb_v3_ 50.470.957 0,99 0,14 0,94 0,001081 m13_bank3_ours_rand_forest_v1_ 50.470.957 0,79 0,11 0,46 0,001081 m14_bank3_comp_rand_forest_v2_ 164.239.489 0,77 0,24 0,53 0,005805
  18. | 19 © KI labs GmbH - All Rights Reserved.

    Check prediction (score) distribution on monthly/daily basis: Population Stability Index (PSI) Model RecCount score_max score_min score_avg PSI m5_bank1_ours_rand_forest_v1_ 50.470.957 0,82 0,12 0,42 0,001921 m6_bank1_comp_rand_forest_v1_ 164.239.489 0,79 0,18 0,47 0,010019 m11_bank2_ours_xgb_v3_ 50.470.957 0,99 0,14 0,94 0,001081 m13_bank3_ours_rand_forest_v1_ 50.470.957 0,79 0,11 0,46 0,001081 m14_bank3_comp_rand_forest_v2_ 164.239.489 0,77 0,24 0,53 0,005805
  19. | 20 © KI labs GmbH - All Rights Reserved.

    Kaggle trick: Adversarial Validation • Check the degree of similarity between training and tests in terms of features distribution • Combining train and test sets and evaluating a binary classification task (ROC AUC) • Use ‘istrain’ as a target variable and fit simple classification model
  20. | 21 © KI labs GmbH - All Rights Reserved.

    Kaggle trick: Adversarial Validation • Check the degree of similarity between training and tests in terms of features distribution • Combining train and test sets and evaluating a binary classification task (ROC AUC) • Use ‘istrain’ as a target variable and fit simple classification model • Low AUC is good - there in no sufficient difference between test and train features distribution • High AUC means you are dealing with completely different datasets • Extend this idea to check data in production on a regular basis
  21. | 22 © KI labs GmbH - All Rights Reserved.

    Need to be accurate with: • Kaggle Tricks: • Target encoding • Ensembling • Proper validation What Can Go Wrong : Overfitting BayesOpt/HyperOpt:
  22. | 23 © KI labs GmbH - All Rights Reserved.

    • Transforming categorical features to numerical features based on mean of the response • One of the most powerful Kaggle tricks for classical ML competitions • But, it’s very easy to overfit, K-fold regularization for mean encodings What Can Go Wrong : Target Encoding
  23. | 24 © KI labs GmbH - All Rights Reserved.

    • Transforming categorical features to numerical features based on mean of the response • One of the most powerful Kaggle tricks for classical ML competitions • But, it’s very easy to overfit, K-fold regularization for mean encodings What Can Go Wrong : Target Encoding pc is a target mean for a category nc is a number of samples in a category Pglobal is a global target mean α is a regularisation parameter
  24. | 25 © KI labs GmbH - All Rights Reserved.

    What Can Go Wrong: Ensembles vs Bagging Ensembles: hard to build, maintain, validate, deploy
  25. | 26 © KI labs GmbH - All Rights Reserved.

    Bagging (decrease variance): Averaging same algorithm with different random state What Can Go Wrong: Ensembles vs Bagging Ensembles: hard to build, maintain, validate, deploy
  26. | 27 © KI labs GmbH - All Rights Reserved.

    Some pandas functions which are not really scalable: What Can Go Wrong: Too Long Model Execution
  27. | 28 © KI labs GmbH - All Rights Reserved.

    Some pandas functions which are not really scalable: What Can Go Wrong: Too Long Model Execution Clustering and dimension reduction: • T-SNE • DBSCAN
  28. Key Take-aways - Business metrics are important - You have

    to be involved at all stages from getting the data to checking results in production - Monitor your model’s results and data distribution on a regular basis