Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Toward an efficient&effective recommender syste...

Toward an efficient&effective recommender system development

Keigo Kubo
LINE Machine Learning Solution Team Machine Learning Engineer
https://linedevday.linecorp.com/2020/jp/sessions/9641
https://linedevday.linecorp.com/2020/en/sessions/9641

LINE DevDay 2020

November 26, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda 1. Overview of LINE Recommender Systems 2. masala 3.

    Evaluation Dataset Construction 4. Evaluation Metrics 5. Model Tuning 6. Case Study of a Recommender System using masala 7. Conclusion & Future Work 
  2. LINE Services using Recommender Systems Sticker, Theme, etc. Manga Live

    Store Fortune- telling Parttime Delima etc. 
  3. Frames and Features SmartCH HomeTab • Purchase • Click •

    View • Free or Paid • Wish List • Favorite • Comment • Author • Publisher etc. Features Official App 
  4. Challenging issues in LINE Recommender Systems • Developing many recommender

    systems is very costly. • Good business effects are required. Challenging Issues 
  5. Challenging issues in LINE Recommender Systems • Developing many recommender

    systems is very costly. • Good business effects are required. Challenging Issues Achieving an efficient and effective recommender system development is required. 
  6. To Achieve Effectiveness with masala • Bias reduction in dataset

    for offline test • Appropriate handling of data leakage in dataset construction and training 
  7. To Achieve Effectiveness with masala • Bias reduction in dataset

    for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting 
  8. To Achieve Effectiveness with masala • Bias reduction in dataset

    for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline 
  9. To Achieve Effectiveness with masala • Bias reduction in dataset

    for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective 
  10. To Achieve Effectiveness with masala • Bias reduction in dataset

    for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective • Demo specialized for each service etc. 
  11. Config File Driven ML Task Collection Task 1 masala controller

    Config file Execute tasks 1. Check the config with the schema per task 2. Execute tasks in accordance with task flow Task 2 Task 3 
  12. Composite Task ``recommendation/baseline`` of masala for Efficient&Effective Recommender System Development

    Evaluation Demo Execute baseline methods Dataset constructor Generate below related task’s configs for a recommender system development and run them to start the development quickly 
  13. Dataset Splitting in Offline Recommender System Development Split user behavior

    histories (user-item interactions) Past Present Future Evaluation Set Training Set Prediction Set Validation Set 
  14. Bias in Dataset › Lack user-item interactions due to some

    reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). 
  15. Bias in Dataset › Lack user-item interactions due to some

    reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias. 
  16. Bias in Dataset › Lack user-item interactions due to some

    reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias. Bias must be reduced from the MNAR dataset for precise evaluation 
  17. Techniques for bias reduction Technique Description Pros Cons Weighting The

    more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach 
  18. Techniques for bias reduction Technique Description Pros Cons Weighting The

    more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case Sampling MAR-like dataset is created from MNAR dataset by weighted sampling proportional to the reciprocal of item frequency**. Easy to apply in various cases Only popularity bias can be reduced * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach 
  19. Characteristics of Display Frame Frame Characteristics Better User Sampling Better

    Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items 
  20. Characteristics of Display Frame Frame Characteristics Better User Sampling Better

    Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance 
  21. Characteristics of Display Frame Frame Characteristics Better User Sampling Better

    Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance HomeTab • %JTQMBZFWFOJGUIFVTFSEPFTOPUVTFUIFBQQ • 5IFSFJTOPPUIFSGSBNFGPSUIFBQQ Focus on light and cold users Balance 
  22. Bias Reduction & Consideration of Frame Characteristics by ``recommendation/baseline`` of

    masala › Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items 
  23. Bias Reduction & Consideration of Frame Characteristics by masala ›

    Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter Figure 1: Sampling configuration in masala * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items 
  24. Consideration of Data Leakage in Dataset Splitting › The more

    user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult. 
  25. Consideration of Data Leakage in Dataset Splitting › The more

    user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult. › By specifying the category in the feature setting like product_id in Figure 2. Data leakage is automatically removed in masala Figure 2: Category Settings in masala 
  26. Quantitative evaluation Find weaknesses in existing recommender systems with various

    metrics and lead to improvement. Metric Type Metric Performance Recall, nDCG, Mean Average Precision, etc. Aggregate Diversity The number of unique items recommended (Unique), Aggregate Entropy Individual Diversity Intra-List Similarity (ILS), Individual Entropy 
  27. Individual Diversity by Attribute Horror Figure 3: Example of recommendation

    list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Horror Horror Horror Horror Comedy 
  28. Individual Diversity by Attribute Figure 4: Individual diversity configuration in

    masala Horror Figure 3: Example of recommendation list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Author *OUSB-JTU 4JNJMBSJUZ (FOSF *OUSB-JTU 4JNJMBSJUZ Magazine Intra-List Similarity Baseline 1 4.33 90.84 2.93 Baseline 2 3.95 75.48 2.67 Horror Horror Horror Horror Comedy 
  29. Qualitative evaluation › Compare between an existing system and proposals

    per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo.
  30. Qualitative evaluation › Compare between an existing system and proposals

    per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo. Figure 5: demo frequency label configuration in masala 
  31. Controllable Tradeoff between Performance and Diversity (1/2) Baseline methods in

    masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling › Weighted random positive sampling followed by () 
  32. Controllable Tradeoff between Performance and Diversity (1/2) Baseline methods in

    masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling Hard Negative Sampling › Weighted random positive sampling followed by * [Y. Goldberg, et al.]: word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method › Weighted random negative sampling followed by * () (2 = !3 + $3 /4 ∑, 3 (!, 3 + $ , 3 /4) (2 ! 3 (>=0): smoothing parameter #3 ([0, 1]): smoothing parameter $3: Frequency of item 5 
  33. Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000

    25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" Positive Sampling Smoothing !" Minor 
  34. Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000

    25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique 10,000 15,000 20,000 25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" !# Positive Sampling Smoothing Negative Sampling Smoothing !" !# Minor Major 
  35. Robustness Test Build a latest dataset and re-test to check

    reproducibility of model performances nDCG (%) Unique Genre Intra-List Similarity Baseline 3.22 20,443 90.84 Proposal Diversity 3.48 21,682 75.48 Proposal Performance 3.98 18,304 73.77 Relative performance and diversity between models did not change in the latest dataset. Go to online test!! nDCG (%) Unique Genre Intra-List Similarity Baseline 3.40 18,677 89.73 Proposal Diversity 3.73 20,419 74.79 Proposal Performance 4.50 16,853 74.61 Old dataset Latest dataset 
  36. LINE Theme Recommendation in HomeTab › Multi-service recommendations are displayed.

    Display the flame in the home of LINE app nDCG Unique Aggregate Entropy Baseline 0.051 2170 8.75 Proposal 0.063 16507 11.93 › Dataset splitting by using masala › Proposal model provided in masala, tuning to diversity by adjusting positive/negative sampling parameters Evaluated with offline reproducible test in advance 
  37. 0 100 200 300 Baseline Proposal Online Test Performance 0

    0.01 0.02 0.03 0.04 Baseline Proposal 0 1,000 2,000 Baseline Proposal 0 100,000 200,000 Baseline Proposal CTR Unique items in Click CV via HomeTab Unique items in CV via HomeTab 
  38. Case Study for Efficiency by masala Easy to add new

    features to model without considering data leakage (20 minutes → 1 minutes) 
  39. Case Study for Efficiency by masala Easy to add new

    features to model without considering data leakage (20 minutes → 1 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours) 
  40. Case Study for Efficiency by masala Easy to add new

    features to model without considering data leakage (20 minutes → 1 minutes) Easy to check behaviors of models in the service by demo (2 hours → 10 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours) 
  41. Conclusion & Future Work › Introduced our efforts to an

    efficient & effective recommender system development. › Explained what to be careful about for effective development in dataset splitting, evaluation metrics, and model tuning. › Introduced how to do it efficiently with masala. Conclusion Future work › Investigate the trade-off between performance and diversity in more detail with online testing. › Develop a new ``recommendation/baseline`` to efficiently & effectively develop cross- recommendations.