Slide 1

Slide 1 text

Predicting Online Performance of Job Recommender Systems RecSys2019 Short paper Indeed, Tokyo Japan Adrien M, Tuan A, @masa_kazama, Jialin K

Slide 2

Slide 2 text

Problem and Motivation ● Online evaluation (A/B testing) is usually the most reliable way to measure the results from our experiments, but it is a slow process. ● Offline evaluation process is faster, but it is critical to make it reliable as it informs our decision to roll out new improvements in production.

Slide 3

Slide 3 text

Problem and Motivation ● What are the offline evaluation metrics we should monitor to expect an impact in production? ● What is the level of confidence we can have in the offline results? ● How should we decide to push or not push a new model to production?

Slide 4

Slide 4 text

Funnel in job recommendation Typical conversion funnel in job recommendation Impression → Click → Apply → Interview → Get a Job Focus on first half of the funnel because later funnel is very sparse. apply-rate@10 = # applies up to rank 10 # impressions up to rank 10

Slide 5

Slide 5 text

Dataset Format userId, jobId, time, isClicked, isApplied Volume 125M interactions, 250M users, 20M jobs

Slide 6

Slide 6 text

Recommendation Models ● Word2vec (w2v): an embedding model using negative sampling, where the model captures the sequence of actions ● Word2vec (w2vhs): a variant of word2vec using hierarchical softmax ● Knn: as an item-based collaborative filtering technique

Slide 7

Slide 7 text

Word2vec for user action data (Item embedding) ● Implicit data ○ Click data ○ Bookmark data ○ Apply data UserID, ItemID, TimeStamp User1, Item2, 2016/02/12 User1, Item6, 2016/02/17 User1, Item7, 2016/02/19 User2, Item2, 2016/02/12 User2, Item9, 2016/02/17 User2, Item10, 2016/02/19 User2, Item12, 2016/02/20 Ex

Slide 8

Slide 8 text

Ex. Apply data UserID, ItemID, TimeStamp User1, Item2, 2016/02/12 User1, Item6, 2016/02/17 User1, Item7, 2016/02/19 User2, Item2, 2016/02/12 User2, Item9, 2016/02/17 User2, Item10, 2016/02/19 User2, Item12, 2016/02/20 [Item2, Item6, Item7] [Item2, Item9, Item10, Item12] . . We consider a ItemID as a word and Items the user clicked as a document. We can apply word2vec.

Slide 9

Slide 9 text

Metrics Evaluation Metrics ● MAP ● MPR ● Precision@k (p@k) ● NDCG@k ● Recall@k (r@k) k in (3, 10, 20, 30, 40) Process ● During two weeks, we run an A/B test with one bucket for each model. ● Daily, we generate new recommendations based on the past data and compare the performance in production apply-rate with the offline performance (p@k, etc.)

Slide 10

Slide 10 text

Results Model apply -rate apply- rate@10 p@10 MAP MPR NDCG @10 r@10 r@100 w2v - - - - - - - - knn +17% +11% +9.3% -54% +11% -47% -38% +3.2% w2vhs +48% +46% +90% +51% -5.1% +60% +70% +65% Cross-model comparison, averaged over days; word2vec is used as baseline. The metrics in bold do not have the expected sign, e.g. online performance increased, but offline evaluation metric decreased.

Slide 11

Slide 11 text

Per-metric

Slide 12

Slide 12 text

Conclusion ● We conclude those offline evaluation metrics are reliable enough to decide to not deploy the new models when the offline performances are significantly negative; and to deploy the new models when there is a positive impact on the offline metrics. ● We recommend p@k, which showed a consistent predictive power, when the recommendation task is focused on precision.

Slide 13

Slide 13 text

OSS contribution ● Add Recall@k metric to RankingMetrics in Spark ○ https://github.com/apache/spark/pull/23881 ● Add nmslib indexer to Gensim ○ https://github.com/RaRe-Technologies/gensim/pull/2417 ● Write a tutorial for nmslib indexer in Gensim ○ https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmslibtut orial.ipynb