Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing Machine Learning Systems in Staging by John Cragg

Shannon
January 23, 2019

Testing Machine Learning Systems in Staging by John Cragg

Testing the qualitative performance of a ML model is a difficult task, usually only done by measuring fluctuations of key business metrics in production. In this session John will discuss his teams attempts to produce a deterministic test of algorithmic performance in Depop's staging environment for their product recommendation system.

Shannon

January 23, 2019
Tweet

More Decks by Shannon

Other Decks in Technology

Transcript

  1. Product Recommendations We recommend products to users based upon their

    in app interactions. (likes, saves, messages, comments and purchases) We expect similar users to like similar products Depop Product Recommendations
  2. So how is it usually done Unit testing Monitoring the

    effects on business metrics Validating big data & ML pipelines
  3. But wait John, that’s too late! A | B testing

    can cause bad user experience
  4. What are the differences between staging and production? The size

    of the data #Products #Users Staging ~0.5 mil ~0.5 mil Production ~93 mil ~12 mil
  5. How did we do it? Similar users should like similar

    products We partitioned users and products into classes U0 U1 U2 U3 U4 User Id 30320 User Id 4356
  6. How did we do it? We created preferences between user

    and product classes U0 P1 P2 P3 P4 P0 95% 50% 15% 5% 1%
  7. User Product Interactions Lambda Interaction events are streamed to the

    datalake via Kinesis. The users and product ids, are logged in a slack channel for internal use. This bit!
  8. How can we use it? Remember: We want similar users

    to be recommended similar products.
  9. How can we use it? V1:
 Original version V2: 


    Slightly worse on the test data. This may or may not be okay
  10. How can we use it? V1: Original version V3: Performs

    very poorly on the test data. We should not continue with this deployment.
  11. How can we use it? V1: Original version V4: All

    recommendations filtered out. We should not continue with this deployment.
  12. Try to test before production Create data that reflects production

    Sanity checking models maintains good UX Conclusion