Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is your ML test score?

What is your ML test score?

Using ML in real-world applications and production systems is a very complex task involving issues rarely encountered in toy problems, R&D environments, or offline cases. Key considerations for accessing the decay, current status, and production readiness of ML systems include testing, monitoring, and logging, but how much is enough? It’s difficult to know where to get started or even to know who should be responsible for the testing and monitoring. If you’ve heard the phrase “test in production” too often when it comes to ML, perhaps you need to change your strategy.

Tania Allard dives deep into some of the most frequent issues encountered in real-life ML applications and how you can make your systems more robust, and she explores a number of indicators pointing to decay of models or algorithms in production systems. Some of the topics covered include problems and pitfalls of ML in production; introducing a rubric to test and monitor your ML applications; and testing data and features, testing your model development, monitoring your ML applications, and model decay.

You’ll leave with a clear rubric with actionable tests and examples to ensure the quality or model in production is adequate. Engineers, DevOps, and data scientists will gain valuable guidelines to evaluate and improve the quality of their ML models before anything reaches production stage.

Tania Allard

July 18, 2019
Tweet

More Decks by Tania Allard

Other Decks in Programming

Transcript

  1. What is your machine learning test score? Tania Allard, PhD

    Developer Advocate @ Microsoft Google Developer expert - ML / Tensorflow
  2. 3 Scoring is also called prediction, and is the process

    of generating values based on a trained machine learning model, given some new input data. @ixek
  3. 4 Scores may refer to a quantification of a model

    or algorithm performance on various metrics. @ixek
  4. 6 This is what we are covering: @ixek • Machine

    learning systems validation / quality assurance • How to establish clear testing responsibilities • How to establish a rubric to measure how good we are at testing • We are not covering generic software engineering best practices • Or specific techniques like unit-testing, smoke or pen testing • This is not a technical dive on ML learning testing strategies
  5. 9 The (ML) systems are continuously evolving: from collecting and

    aggregating more data, to retraining models and improving their accuracy @ixek
  6. 11 We can also get some good laughs... @ixek https://www.reddit.com/r/funny/comments

    /7r9ptc/i_took_a_few_shots_at_lake_louis e_today_and/dsvv1nw/
  7. 12 A high number of false negatives or type-II errors

    can lead to havoc (i.e. healthcare and financial sectors) @ixek
  8. 13 @ixek Automation bias: “The tendency to disregard or not

    search for contradictory information in light of a computer-generated solution that is accepted as correct” (Parasuraman & Riley, 1997)
  9. 15 Quality control and assurance should be performed before the

    consumption by users to increase the reliability and reduce bias in our systems @ixek
  10. 25 @ixek Test your features and distributions Do they match

    your expectations? From the iris data set: is the sepal length consistent? Is the width what you’d expect?
  11. 30 @ixek Test your privacy control across the pipeline Towards

    the science of security and privacy in Machine Learning. N Papernot, P McDaniel et al. https://pdfs.semanticscholar.org/ebab/687cd1be7d25392c11f89fce6a63bef7219d.pdf
  12. 43 Test the reproducibility of training Train at least two

    models on the same data: differences in aggregated metrics, sliced metrics or example-example predictions. @ixek
  13. 45 Getting your score 3. Add points for infrastructure @ixek

    2. Add points for development 1. Add points for features and data Which is your lowest score???
  14. 46 0 points: not production ready 1-2 points: might have

    reliability holes 3-4 points: reasonably tested 5-6 points: good level of testing 7+ points: very strong levels of automated testing @ixek