Toward an efficient&effective recommender system development

Agenda 1. Overview of LINE Recommender Systems 2. masala 3.
Evaluation Dataset Construction 4. Evaluation Metrics 5. Model Tuning 6. Case Study of a Recommender System using masala 7. Conclusion & Future Work

Various services, various frames, and various features Overview of LINE
Recommender Systems

LINE Services using Recommender Systems Sticker, Theme, etc. Manga Live
Store Fortune- telling Parttime Delima etc.

Frames and Features SmartCH HomeTab • Purchase • Click •
View • Free or Paid • Wish List • Favorite • Comment • Author • Publisher etc. Features Official App

Challenging issues in LINE Recommender Systems • Developing many recommender
systems is very costly. • Good business effects are required. Challenging Issues

Challenging issues in LINE Recommender Systems • Developing many recommender
systems is very costly. • Good business effects are required. Challenging Issues Achieving an efficient and effective recommender system development is required.

To Achieve Effectiveness with masala • Bias reduction in dataset
for offline test

for offline test • Appropriate handling of data leakage in dataset construction and training

for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting

for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline

for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective

for offline test • Appropriate handling of data leakage in dataset construction and training • Flexible feature setting • Continuously improved recommender engine served as a baseline • Various offline evaluation metrics that give a multifaceted perspective • Demo specialized for each service etc.

Config file driven ML task collection masala

Config File Driven ML Task Collection Task 1 masala controller
Config file Execute tasks 1. Check the config with the schema per task 2. Execute tasks in accordance with task flow Task 2 Task 3

Composite Task ``recommendation/baseline`` of masala for Efficient&Effective Recommender System Development
Evaluation Demo Execute baseline methods Dataset constructor Generate below related task’s configs for a recommender system development and run them to start the development quickly

Overcoming data bias, frame characteristics, and data leakage Evaluation Dataset
Construction

Dataset Splitting in Offline Recommender System Development Split user behavior
histories (user-item interactions) Past Present Future Evaluation Set Training Set Prediction Set Validation Set

Bias in Dataset › Lack user-item interactions due to some
reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR).

reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias.

reasons such as minority › This causes biases which cannot be evaluated recommender system precisely. Most recommender system datasets are Missing Not At Random (MNAR). Missing At Random (MAR) is desirable. › Lack user-item interactions randomly › This does not cause any bias. Bias must be reduced from the MNAR dataset for precise evaluation

Techniques for bias reduction Technique Description Pros Cons Weighting The
more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach

Techniques for bias reduction Technique Description Pros Cons Weighting The
more the item is recommended in a current system, the less the impact on the evaluation*. Reduce biases for the current systems Difficult to apply in some case Sampling MAR-like dataset is created from MNAR dataset by weighted sampling proportional to the reciprocal of item frequency**. Easy to apply in various cases Only popularity bias can be reduced * [T. Schnabel, et al.]:Recommendations as Treatments: Debiasing Learning and Evaluation ** [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach

Characteristics of Display Frame Frame Characteristics Better User Sampling Better
Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items

Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance

Item Sampling Official App • Display to users of the app • There are various other frames and differentiation is necessary Balance Focus on minor items SmartCH • %JTQMBZUPVTFSTPGUIFBQQ • There is no other frame for the app #BMBODF Balance HomeTab • %JTQMBZFWFOJGUIFVTFSEPFTOPUVTFUIFBQQ • 5IFSFJTOPPUIFSGSBNFGPSUIFBQQ Focus on light and cold users Balance

Bias Reduction & Consideration of Frame Characteristics by ``recommendation/baseline`` of
masala › Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items

Bias Reduction & Consideration of Frame Characteristics by masala ›
Followed by the below . Weighted sampling with [D. Carraro, et al.]* introducing smoothing parameters ! (>=0): parameter Figure 1: Sampling configuration in masala * [D. Carraro, et al.]: Debiased Offline Evaluation of Recommender Systems: A Weighted-Sampling Approach " ([0, 1]): parameter ̂ $% = 1/|*| ̂ $+ = 1/|,| $% = (!% + /% 0)/∑3 % (!3 % + / 3 % 0) $+ = (!+ + / + 0)/∑ ̂ 4 (! ̂ 4 + / ̂ 4 0) 5 = ( ̂ $% /$% ) ( ̂ $+ /$+ )6 5 /% , /+: Frequency of user u and item i |*|, |,|: The number of users and items

Consideration of Data Leakage in Dataset Splitting › The more
user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult.

Consideration of Data Leakage in Dataset Splitting › The more
user behavior types, the more complicated the consideration of data leakage. Data leakage makes training problem easier and makes test problems difficult. › By specifying the category in the feature setting like product_id in Figure 2. Data leakage is automatically removed in masala Figure 2: Category Settings in masala

Performance or diversity Evaluation Metrics

Quantitative evaluation Find weaknesses in existing recommender systems with various
metrics and lead to improvement. Metric Type Metric Performance Recall, nDCG, Mean Average Precision, etc. Aggregate Diversity The number of unique items recommended (Unique), Aggregate Entropy Individual Diversity Intra-List Similarity (ILS), Individual Entropy

Individual Diversity by Attribute Horror Figure 3: Example of recommendation
list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Horror Horror Horror Horror Comedy

Individual Diversity by Attribute Figure 4: Individual diversity configuration in
masala Horror Figure 3: Example of recommendation list. The left is low individual diversity in genre_id compared with the right. › Be able to compare the diversity without demo ``recommendation/baseline`` of masala can provide individual diversity by any attribute Author *OUSB-JTU 4JNJMBSJUZ (FOSF *OUSB-JTU 4JNJMBSJUZ Magazine Intra-List Similarity Baseline 1 4.33 90.84 2.93 Baseline 2 3.95 75.48 2.67 Horror Horror Horror Horror Comedy

Qualitative evaluation › Compare between an existing system and proposals
per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo.

Qualitative evaluation › Compare between an existing system and proposals
per major/heavy, medium/middle, minor/light › Check validity when emphasizing diversity Display recommendation outputs in Demo ``recommendation/baseline`` in masala can provide the demo. Figure 5: demo frequency label configuration in masala

Tuning tradeoff between performance and diversity Model Tuning

Controllable Tradeoff between Performance and Diversity (1/2) Baseline methods in
masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling › Weighted random positive sampling followed by ()

Controllable Tradeoff between Performance and Diversity (1/2) Baseline methods in
masala provides the parameters that control the tradeoff. !" (>=0): smoothing parameter #" ([0, 1]): smoothing parameter $": Frequency of item % |'|: The number of items () = ∑ , " (! , " + $ , " /0) |'|(!" + $ " /0) Hard Positive Sampling Hard Negative Sampling › Weighted random positive sampling followed by * [Y. Goldberg, et al.]: word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method › Weighted random negative sampling followed by * () (2 = !3 + $3 /4 ∑, 3 (!, 3 + $ , 3 /4) (2 ! 3 (>=0): smoothing parameter #3 ([0, 1]): smoothing parameter $3: Frequency of item 5

Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000
25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" Positive Sampling Smoothing !" Minor

Controllable Tradeoff between Performance and Diversity (2/2) 10,000 15,000 20,000
25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique 10,000 15,000 20,000 25,000 30,000 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 0 0.25 0.5 0.75 nDCG Unique !" !# Positive Sampling Smoothing Negative Sampling Smoothing !" !# Minor Major

Robustness Test Build a latest dataset and re-test to check
reproducibility of model performances nDCG (%) Unique Genre Intra-List Similarity Baseline 3.22 20,443 90.84 Proposal Diversity 3.48 21,682 75.48 Proposal Performance 3.98 18,304 73.77 Relative performance and diversity between models did not change in the latest dataset. Go to online test!! nDCG (%) Unique Genre Intra-List Similarity Baseline 3.40 18,677 89.73 Proposal Diversity 3.73 20,419 74.79 Proposal Performance 4.50 16,853 74.61 Old dataset Latest dataset

LINE theme recommendation in HomeTab Case Study of a Recommender
System using masala

LINE Theme Recommendation in HomeTab › Multi-service recommendations are displayed.
Display the flame in the home of LINE app nDCG Unique Aggregate Entropy Baseline 0.051 2170 8.75 Proposal 0.063 16507 11.93 › Dataset splitting by using masala › Proposal model provided in masala, tuning to diversity by adjusting positive/negative sampling parameters Evaluated with offline reproducible test in advance

0 100 200 300 Baseline Proposal Online Test Performance 0
0.01 0.02 0.03 0.04 Baseline Proposal 0 1,000 2,000 Baseline Proposal 0 100,000 200,000 Baseline Proposal CTR Unique items in Click CV via HomeTab Unique items in CV via HomeTab

Case Study for Efficiency by masala Easy to add new
features to model without considering data leakage (20 minutes → 1 minutes)

features to model without considering data leakage (20 minutes → 1 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)

features to model without considering data leakage (20 minutes → 1 minutes) Easy to check behaviors of models in the service by demo (2 hours → 10 minutes) Easy to run in other service by coping the config and replacing only the features and data paths (2 week → 2 hours)

More efficient and effective Conclusion & Future Work

Conclusion & Future Work › Introduced our efforts to an
efficient & effective recommender system development. › Explained what to be careful about for effective development in dataset splitting, evaluation metrics, and model tuning. › Introduced how to do it efficiently with masala. Conclusion Future work › Investigate the trade-off between performance and diversity in more detail with online testing. › Develop a new ``recommendation/baseline`` to efficiently & effectively develop cross- recommendations.

Toward an efficient&effective recommender syste...

Toward an efficient&effective recommender system development

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript