$30 off During Our Annual Pro Sale. View Details »

What is your ML test score?

What is your ML test score?

Using ML in real-world applications and production systems is a very complex task involving issues rarely encountered in toy problems, R&D environments, or offline cases. Key considerations for accessing the decay, current status, and production readiness of ML systems include testing, monitoring, and logging, but how much is enough? It’s difficult to know where to get started or even to know who should be responsible for the testing and monitoring. If you’ve heard the phrase “test in production” too often when it comes to ML, perhaps you need to change your strategy.

Tania Allard dives deep into some of the most frequent issues encountered in real-life ML applications and how you can make your systems more robust, and she explores a number of indicators pointing to decay of models or algorithms in production systems. Some of the topics covered include problems and pitfalls of ML in production; introducing a rubric to test and monitor your ML applications; and testing data and features, testing your model development, monitoring your ML applications, and model decay.

You’ll leave with a clear rubric with actionable tests and examples to ensure the quality or model in production is adequate. Engineers, DevOps, and data scientists will gain valuable guidelines to evaluate and improve the quality of their ML models before anything reaches production stage.

Tania Allard

July 18, 2019
Tweet

More Decks by Tania Allard

Other Decks in Programming

Transcript

  1. What is your
    machine learning
    test score?
    Tania Allard, PhD
    Developer Advocate @
    Microsoft
    Google Developer expert - ML /
    Tensorflow

    View Slide

  2. 2
    Let’s avoid disappointment
    @ixek

    View Slide

  3. 3
    Scoring is also called prediction, and is the
    process of generating values based on a trained
    machine learning model, given some new input
    data.
    @ixek

    View Slide

  4. 4
    Scores may refer to a quantification of a
    model or algorithm performance on various
    metrics.
    @ixek

    View Slide

  5. 5
    @ixek
    So what are we
    talking about?

    View Slide

  6. 6
    This is what we are covering:
    @ixek
    ● Machine learning systems validation / quality assurance
    ● How to establish clear testing responsibilities
    ● How to establish a rubric to measure how good we are at testing
    ● We are not covering generic software engineering best practices
    ● Or specific techniques like unit-testing, smoke or pen testing
    ● This is not a technical dive on ML learning testing strategies

    View Slide

  7. 7
    Why do we need testing or quality assurance
    anyway?
    @ixek

    View Slide

  8. 8
    The “subtle” differences
    between production
    systems and offline or
    R&D examples
    @ixek

    View Slide

  9. 9
    The (ML) systems are continuously evolving:
    from collecting and aggregating more data,
    to retraining models and improving their
    accuracy
    @ixek

    View Slide

  10. 10
    @ixek
    Pet projects can
    be a bit more
    forgiving

    View Slide

  11. 11
    We can also get
    some good laughs...
    @ixek
    https://www.reddit.com/r/funny/comments
    /7r9ptc/i_took_a_few_shots_at_lake_louis
    e_today_and/dsvv1nw/

    View Slide

  12. 12
    A high number of false negatives or type-II
    errors can lead to havoc (i.e. healthcare
    and financial sectors)
    @ixek

    View Slide

  13. 13
    @ixek
    Automation bias: “The tendency to disregard or
    not search for contradictory information in
    light of a computer-generated solution that is
    accepted as correct” (Parasuraman & Riley,
    1997)

    View Slide

  14. 14
    @ixek

    View Slide

  15. 15
    Quality control and assurance should
    be performed before the consumption
    by users to increase the reliability and
    reduce bias in our systems
    @ixek

    View Slide

  16. 16
    Where do unit tests fit in software?
    @ixek

    View Slide

  17. 17
    @ixek

    View Slide

  18. 18
    @ixek
    If only ML looked like this

    View Slide

  19. 19
    @ixek
    But they look a bit more like this

    View Slide

  20. 20
    @ixek
    So what do we test?

    View Slide

  21. 21
    @ixek
    What should we
    keep an eye on

    View Slide

  22. 22
    Who is
    responsible?
    @ixek

    View Slide

  23. 23
    Keeping a
    score
    @ixek
    For manual
    testing
    1 point
    Automated
    testing
    1 point

    View Slide

  24. 24
    @ixek
    Features and data

    View Slide

  25. 25
    @ixek
    Test your features and distributions
    Do they match your expectations?
    From the iris data set: is the sepal
    length consistent? Is the width what
    you’d expect?

    View Slide

  26. 26
    @ixek
    The cost of each feature

    View Slide

  27. 27
    @ixek
    Test the correlation between features and target

    View Slide

  28. 28
    @ixek https://www.tylervigen.com/spurious-correlations

    View Slide

  29. 29
    @ixek
    Test the correlation between features and target

    View Slide

  30. 30
    @ixek
    Test your privacy control across the pipeline
    Towards the science of security and privacy in Machine Learning. N Papernot, P McDaniel et al.
    https://pdfs.semanticscholar.org/ebab/687cd1be7d25392c11f89fce6a63bef7219d.pdf

    View Slide

  31. 31
    @ixek Great expectations - Python package
    Test all code that creates input features

    View Slide

  32. 32
    Model development

    View Slide

  33. 33
    @ixek
    Best practices

    View Slide

  34. 34
    @ixek
    Every piece of code is peer reviewed

    View Slide

  35. 35
    @ixek
    Test the impact of each tunable hyperparameter

    View Slide

  36. 36
    @ixek
    Test for model staleness

    View Slide

  37. 37
    @ixek
    Test against a simpler model

    View Slide

  38. 38
    @ixek
    Test for implicit bias

    View Slide

  39. 39
    Infrastructure
    @ixek

    View Slide

  40. 40
    @ixek
    Integration of the full pipeline
    From ingestion through training and serving

    View Slide

  41. 41
    @ixek
    Test model quality before serving
    Test against known output data

    View Slide

  42. 42
    @ixek
    Test how quickly and safely you can rollback

    View Slide

  43. 43
    Test the reproducibility of training
    Train at least two models on the same data:
    differences in aggregated metrics, sliced
    metrics or example-example predictions.
    @ixek

    View Slide

  44. 44
    @ixek
    Adding up

    View Slide

  45. 45
    Getting your score
    3. Add points for infrastructure
    @ixek
    2. Add points for development
    1. Add points for features and data
    Which is your lowest score???

    View Slide

  46. 46
    0 points: not production ready
    1-2 points: might have reliability holes
    3-4 points: reasonably tested
    5-6 points: good level of testing
    7+ points: very strong levels of
    automated testing
    @ixek

    View Slide

  47. 47
    Thank you
    @ixek

    View Slide

  48. Rate today’s session
    Session page on conference website O’Reilly Events App

    View Slide