Testing Machine Learning Systems in Staging by John Cragg

Testing Machine Learning Models In Staging John Cragg 24/01/2019

Product Recommendations We recommend products to users based upon their
in app interactions. (likes, saves, messages, comments and purchases) We expect similar users to like similar products Depop Product Recommendations

Testing is important!

Testing ML Models It isn’t simple

So how is it usually done Unit testing Monitoring the
effects on business metrics Validating big data & ML pipelines

But wait John, that’s too late! A | B testing
can cause bad user experience

Can we do better?

Staging needs to mirror production

What are the differences between staging and production? It’s mock
data

What are the differences between staging and production? The data
isn’t dynamic

What are the differences between staging and production? The size
of the data #Products #Users Staging ~0.5 mil ~0.5 mil Production ~93 mil ~12 mil

How did we do it? Similar users should like similar
products We partitioned users and products into classes U0 U1 U2 U3 U4 User Id 30320 User Id 4356

How did we do it? We created preferences between user
and product classes U0 P1 P2 P3 P4 P0 95% 50% 15% 5% 1%

How did we do it? In general we compute the
preference probability as

User Product Interactions Lambda Thats me!

Staging User Proﬁle Each number in the selling tab, n,
is a product of class n.

User Product Interactions Lambda Interaction events are streamed to the
datalake via Kinesis. The users and product ids, are logged in a slack channel for internal use. This bit!

Staging Recommendations

So does it work?

Results

How can we use it? Remember: We want similar users
to be recommended similar products.

How can we use it? V1:  Original version V2:  
Slightly worse on the test data. This may or may not be okay

How can we use it? V1: Original version V3: Performs
very poorly on the test data. We should not continue with this deployment.

How can we use it? V1: Original version V4: All
recommendations ﬁltered out. We should not continue with this deployment.

Try to test before production Create data that reﬂects production
Sanity checking models maintains good UX Conclusion

Any questions? https://twitter.com/DepopEng https://engineering.depop.com We’re hiring https://www.depop.com/about/jobs

Testing Machine Learning Systems in Staging by ...

Testing Machine Learning Systems in Staging by John Cragg

Shannon

More Decks by Shannon

Other Decks in Technology

Featured

Transcript

Testing Machine Learning Models In Staging John Cragg 24/01/2019

Product Recommendations We recommend products to users based upon their

Testing is important!

Testing ML Models It isn’t simple

So how is it usually done Unit testing Monitoring the

But wait John, that’s too late! A | B testing

Can we do better?

Staging needs to mirror production

What are the differences between staging and production? It’s mock

What are the differences between staging and production? The data

What are the differences between staging and production? The size

How did we do it? Similar users should like similar

How did we do it? We created preferences between user

How did we do it? In general we compute the

User Product Interactions Lambda Thats me!

Staging User Proﬁle Each number in the selling tab, n,

User Product Interactions Lambda Interaction events are streamed to the

Staging Recommendations

So does it work?

Results

How can we use it? Remember: We want similar users

How can we use it? V1:  Original version V2:

How can we use it? V1: Original version V3: Performs

How can we use it? V1: Original version V4: All

Try to test before production Create data that reﬂects production

Any questions? https://twitter.com/DepopEng https://engineering.depop.com We’re hiring https://www.depop.com/about/jobs