What is your ML test score?

What is your ML test score?

Using ML in real-world applications and production systems is a very complex task involving issues rarely encountered in toy problems, R&D environments, or offline cases. Key considerations for accessing the decay, current status, and production readiness of ML systems include testing, monitoring, and logging, but how much is enough? It’s difficult to know where to get started or even to know who should be responsible for the testing and monitoring. If you’ve heard the phrase “test in production” too often when it comes to ML, perhaps you need to change your strategy.

Tania Allard dives deep into some of the most frequent issues encountered in real-life ML applications and how you can make your systems more robust, and she explores a number of indicators pointing to decay of models or algorithms in production systems. Some of the topics covered include problems and pitfalls of ML in production; introducing a rubric to test and monitor your ML applications; and testing data and features, testing your model development, monitoring your ML applications, and model decay.

You’ll leave with a clear rubric with actionable tests and examples to ensure the quality or model in production is adequate. Engineers, DevOps, and data scientists will gain valuable guidelines to evaluate and improve the quality of their ML models before anything reaches production stage.


Tania Allard

July 18, 2019


  1. What is your machine learning test score? Tania Allard, PhD

    Developer Advocate @ Microsoft Google Developer expert - ML / Tensorflow
  2. 2 Let’s avoid disappointment @ixek

  3. 3 Scoring is also called prediction, and is the process

    of generating values based on a trained machine learning model, given some new input data. @ixek
  4. 4 Scores may refer to a quantification of a model

    or algorithm performance on various metrics. @ixek
  5. 5 @ixek So what are we talking about?

  6. 6 This is what we are covering: @ixek • Machine

    learning systems validation / quality assurance • How to establish clear testing responsibilities • How to establish a rubric to measure how good we are at testing • We are not covering generic software engineering best practices • Or specific techniques like unit-testing, smoke or pen testing • This is not a technical dive on ML learning testing strategies
  7. 7 Why do we need testing or quality assurance anyway?

  8. 8 The “subtle” differences between production systems and offline or

    R&D examples @ixek
  9. 9 The (ML) systems are continuously evolving: from collecting and

    aggregating more data, to retraining models and improving their accuracy @ixek
  10. 10 @ixek Pet projects can be a bit more forgiving

  11. 11 We can also get some good laughs... @ixek https://www.reddit.com/r/funny/comments

    /7r9ptc/i_took_a_few_shots_at_lake_louis e_today_and/dsvv1nw/
  12. 12 A high number of false negatives or type-II errors

    can lead to havoc (i.e. healthcare and financial sectors) @ixek
  13. 13 @ixek Automation bias: “The tendency to disregard or not

    search for contradictory information in light of a computer-generated solution that is accepted as correct” (Parasuraman & Riley, 1997)
  14. 14 @ixek

  15. 15 Quality control and assurance should be performed before the

    consumption by users to increase the reliability and reduce bias in our systems @ixek
  16. 16 Where do unit tests fit in software? @ixek

  17. 17 @ixek

  18. 18 @ixek If only ML looked like this

  19. 19 @ixek But they look a bit more like this

  20. 20 @ixek So what do we test?

  21. 21 @ixek What should we keep an eye on

  22. 22 Who is responsible? @ixek

  23. 23 Keeping a score @ixek For manual testing 1 point

    Automated testing 1 point
  24. 24 @ixek Features and data

  25. 25 @ixek Test your features and distributions Do they match

    your expectations? From the iris data set: is the sepal length consistent? Is the width what you’d expect?
  26. 26 @ixek The cost of each feature

  27. 27 @ixek Test the correlation between features and target

  28. 28 @ixek https://www.tylervigen.com/spurious-correlations

  29. 29 @ixek Test the correlation between features and target

  30. 30 @ixek Test your privacy control across the pipeline Towards

    the science of security and privacy in Machine Learning. N Papernot, P McDaniel et al. https://pdfs.semanticscholar.org/ebab/687cd1be7d25392c11f89fce6a63bef7219d.pdf
  31. 31 @ixek Great expectations - Python package Test all code

    that creates input features
  32. 32 Model development

  33. 33 @ixek Best practices

  34. 34 @ixek Every piece of code is peer reviewed

  35. 35 @ixek Test the impact of each tunable hyperparameter

  36. 36 @ixek Test for model staleness

  37. 37 @ixek Test against a simpler model

  38. 38 @ixek Test for implicit bias

  39. 39 Infrastructure @ixek

  40. 40 @ixek Integration of the full pipeline From ingestion through

    training and serving
  41. 41 @ixek Test model quality before serving Test against known

    output data
  42. 42 @ixek Test how quickly and safely you can rollback

  43. 43 Test the reproducibility of training Train at least two

    models on the same data: differences in aggregated metrics, sliced metrics or example-example predictions. @ixek
  44. 44 @ixek Adding up

  45. 45 Getting your score 3. Add points for infrastructure @ixek

    2. Add points for development 1. Add points for features and data Which is your lowest score???
  46. 46 0 points: not production ready 1-2 points: might have

    reliability holes 3-4 points: reasonably tested 5-6 points: good level of testing 7+ points: very strong levels of automated testing @ixek
  47. 47 Thank you @ixek

  48. Rate today’s session Session page on conference website O’Reilly Events