systems is very costly. • Good business effects are required. Challenging Issues Achieving an efficient and effective recommender system development is required.
for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline
for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective
for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective • Demo specialized for each service etc.
Evaluation Demo Execute baseline methods Dataset constructor Generate below related task’s configs for a recommender system development and run them to start the development quickly
reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR).
reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias.
reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias. Bias must be reduced from the MNAR dataset for precise evaluation
more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach
more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case Sampling MAR-like dataset is created from MNAR dataset by weighted sampling proportional to the reciprocal of item frequency**. Easy to apply in various cases Only popularity bias can be reduced * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach
Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items
Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance
Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance HomeTab • %JTQMBZFWFOJGUIFVTFSEPFTOPUVTFUIFBQQ • 5IFSFJTOPPUIFSGSBNFGPSUIFBQQ Focus on light and cold users Balance
user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult.
user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult. › By specifying the category in the feature setting like product_id in Figure 2. Data leakage is automatically removed in masala Figure 2: Category Settings in masala
metrics and lead to improvement. Metric Type Metric Performance Recall, nDCG, Mean Average Precision, etc. Aggregate Diversity The number of unique items recommended (Unique), Aggregate Entropy Individual Diversity Intra-List Similarity (ILS), Individual Entropy
list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Horror Horror Horror Horror Comedy
masala Horror Figure 3: Example of recommendation list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Author *OUSB-JTU 4JNJMBSJUZ (FOSF *OUSB-JTU 4JNJMBSJUZ Magazine Intra-List Similarity Baseline 1 4.33 90.84 2.93 Baseline 2 3.95 75.48 2.67 Horror Horror Horror Horror Comedy
per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo.
per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo. Figure 5: demo frequency label configuration in masala
masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling › Weighted random positive sampling followed by ()
Display the flame in the home of LINE app nDCG Unique Aggregate Entropy Baseline 0.051 2170 8.75 Proposal 0.063 16507 11.93 › Dataset splitting by using masala › Proposal model provided in masala, tuning to diversity by adjusting positive/negative sampling parameters Evaluated with offline reproducible test in advance
features to model without considering data leakage (20 minutes → 1 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)
features to model without considering data leakage (20 minutes → 1 minutes) Easy to check behaviors of models in the service by demo (2 hours → 10 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)
efficient & effective recommender system development. › Explained what to be careful about for effective development in dataset splitting, evaluation metrics, and model tuning. › Introduced how to do it efficiently with masala. Conclusion Future work › Investigate the trade-off between performance and diversity in more detail with online testing. › Develop a new ``recommendation/baseline`` to efficiently & effectively develop cross- recommendations.