Toward an efficient&effective recommender system development

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Agenda 1. Overview of LINE Recommender Systems 2. masala 3. Evaluation Dataset Construction 4. Evaluation Metrics 5. Model Tuning 6. Case Study of a Recommender System using masala 7. Conclusion & Future Work

Slide 3

Slide 3 text

Various services, various frames, and various features Overview of LINE Recommender Systems

Slide 4

Slide 4 text

LINE Services using Recommender Systems Sticker, Theme, etc. Manga Live Store Fortune- telling Parttime Delima etc.

Slide 5

Slide 5 text

Frames and Features SmartCH HomeTab • Purchase • Click • View • Free or Paid • Wish List • Favorite • Comment • Author • Publisher etc. Features Official App

Slide 6

Slide 6 text

Challenging issues in LINE Recommender Systems • Developing many recommender systems is very costly. • Good business effects are required. Challenging Issues

Slide 7

Slide 7 text

Challenging issues in LINE Recommender Systems • Developing many recommender systems is very costly. • Good business effects are required. Challenging Issues Achieving an efficient and effective recommender system development is required.

Slide 8

Slide 8 text

To Achieve Effectiveness with masala • Bias reduction in dataset for offline test

Slide 9

Slide 9 text

To Achieve Effectiveness with masala • Bias reduction in dataset for offline test • Appropriate handling of data leakage in dataset construction and training

Slide 10

Slide 10 text

To Achieve Effectiveness with masala • Bias reduction in dataset for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting

Slide 11

Slide 11 text

Slide 12

Slide 12 text

To Achieve Effectiveness with masala • Bias reduction in dataset for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Config file driven ML task collection masala

Slide 15

Slide 15 text

Config File Driven ML Task Collection Task 1 masala controller Config file Execute tasks 1. Check the config with the schema per task 2. Execute tasks in accordance with task flow Task 2 Task 3

Slide 16

Slide 16 text

Composite Task ``recommendation/baseline`` of masala for Efficient&Effective Recommender System Development Evaluation Demo Execute baseline methods Dataset constructor Generate below related task’s configs for a recommender system development and run them to start the development quickly

Slide 17

Slide 17 text

Overcoming data bias, frame characteristics, and data leakage Evaluation Dataset Construction

Slide 18

Slide 18 text

Dataset Splitting in Offline Recommender System Development Split user behavior histories (user-item interactions) Past Present Future Evaluation Set Training Set Prediction Set Validation Set

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Bias in Dataset › Lack user-item interactions due to some reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias.

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Techniques for bias reduction Technique Description Pros Cons Weighting The more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach

Slide 23

Slide 23 text

Techniques for bias reduction Technique Description Pros Cons Weighting The more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case Sampling MAR-like dataset is created from MNAR dataset by weighted sampling proportional to the reciprocal of item frequency**. Easy to apply in various cases Only popularity bias can be reduced * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Characteristics of Display Frame Frame Characteristics Better User Sampling Better Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Bias Reduction & Consideration of Frame Characteristics by ``recommendation/baseline`` of masala › Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items

Slide 28

Slide 28 text

Bias Reduction & Consideration of Frame Characteristics by masala › Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter Figure 1: Sampling configuration in masala * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Consideration of Data Leakage in Dataset Splitting › The more user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult. › By specifying the category in the feature setting like product_id in Figure 2. Data leakage is automatically removed in masala Figure 2: Category Settings in masala

Slide 31

Slide 31 text

Performance or diversity Evaluation Metrics

Slide 32

Slide 32 text

Quantitative evaluation Find weaknesses in existing recommender systems with various metrics and lead to improvement. Metric Type Metric Performance Recall, nDCG, Mean Average Precision, etc. Aggregate Diversity The number of unique items recommended (Unique), Aggregate Entropy Individual Diversity Intra-List Similarity (ILS), Individual Entropy

Slide 33

Slide 33 text

Individual Diversity by Attribute Horror Figure 3: Example of recommendation list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Horror Horror Horror Horror Comedy

Slide 34

Slide 34 text

Individual Diversity by Attribute Figure 4: Individual diversity configuration in masala Horror Figure 3: Example of recommendation list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Author *OUSB-JTU 4JNJMBSJUZ (FOSF *OUSB-JTU 4JNJMBSJUZ Magazine Intra-List Similarity Baseline 1 4.33 90.84 2.93 Baseline 2 3.95 75.48 2.67 Horror Horror Horror Horror Comedy

Slide 35

Slide 35 text

Qualitative evaluation › Compare between an existing system and proposals per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo.

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Tuning tradeoff between performance and diversity Model Tuning

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Controllable Tradeoff between Performance and Diversity (1/2) Baseline methods in masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling Hard Negative Sampling › Weighted random positive sampling followed by * [Y. Goldberg, et al.]: word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method › Weighted random negative sampling followed by * () (2 = !3 + $3 /4 ∑, 3 (!, 3 + $ , 3 /4) (2 ! 3 (>=0): smoothing parameter #3 ([0, 1]): smoothing parameter $3: Frequency of item 5

Slide 40

Slide 40 text

Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000 25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" Positive Sampling Smoothing !" Minor

Slide 41

Slide 41 text

Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000 25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique 10,000 15,000 20,000 25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" !# Positive Sampling Smoothing Negative Sampling Smoothing !" !# Minor Major

Slide 42

Slide 42 text

Robustness Test Build a latest dataset and re-test to check reproducibility of model performances nDCG (%) Unique Genre Intra-List Similarity Baseline 3.22 20,443 90.84 Proposal Diversity 3.48 21,682 75.48 Proposal Performance 3.98 18,304 73.77 Relative performance and diversity between models did not change in the latest dataset. Go to online test!! nDCG (%) Unique Genre Intra-List Similarity Baseline 3.40 18,677 89.73 Proposal Diversity 3.73 20,419 74.79 Proposal Performance 4.50 16,853 74.61 Old dataset Latest dataset

Slide 43

Slide 43 text

LINE theme recommendation in HomeTab Case Study of a Recommender System using masala

Slide 44

Slide 44 text

LINE Theme Recommendation in HomeTab › Multi-service recommendations are displayed. Display the flame in the home of LINE app nDCG Unique Aggregate Entropy Baseline 0.051 2170 8.75 Proposal 0.063 16507 11.93 › Dataset splitting by using masala › Proposal model provided in masala, tuning to diversity by adjusting positive/negative sampling parameters Evaluated with offline reproducible test in advance

Slide 45

Slide 45 text

0 100 200 300 Baseline Proposal Online Test Performance 0 0.01 0.02 0.03 0.04 Baseline Proposal 0 1,000 2,000 Baseline Proposal 0 100,000 200,000 Baseline Proposal CTR Unique items in Click CV via HomeTab Unique items in CV via HomeTab

Slide 46

Slide 46 text

Case Study for Efficiency by masala Easy to add new features to model without considering data leakage (20 minutes → 1 minutes)

Slide 47

Slide 47 text

Case Study for Efficiency by masala Easy to add new features to model without considering data leakage (20 minutes → 1 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)

Slide 48

Slide 48 text

Case Study for Efficiency by masala Easy to add new features to model without considering data leakage (20 minutes → 1 minutes) Easy to check behaviors of models in the service by demo (2 hours → 10 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)

Slide 49

Slide 49 text

More efficient and effective Conclusion & Future Work

Slide 50

Slide 50 text

Conclusion & Future Work › Introduced our efforts to an efficient & effective recommender system development. › Explained what to be careful about for effective development in dataset splitting, evaluation metrics, and model tuning. › Introduced how to do it efficiently with masala. Conclusion Future work › Investigate the trade-off between performance and diversity in more detail with online testing. › Develop a new ``recommendation/baseline`` to efficiently & effectively develop cross- recommendations.