and Off-Policy Evaluation Performance in Coupon Allocation Naoki Nishimura, Recruit Co., Ltd. Ken Kobayashi, Science Tokyo Kazuhide Nakata, Science Tokyo The Pacific Rim International Conference on Artificial Intelligence (PRICAI) Nov 18-24, 2024 @ Kyoto University
of coupon campaigns • Encourage customer actions towards services to increase business revenue • Model development process in coupon campaigns • Initial: Data collection through rule-based or uniform random allocation • Mid-term: Model training based on collected data, confirming improvement through off-policy evaluation and online testing • Later stage: Further model improvements based on data collection during model operation 3 Data collection (Rule-based, Uniform random) Model training Off-policy evaluation (Offline testing) Determining the ratios of online testing patterns Online testing Data collection
• Model improvement directly leads to revenue growth: Actively researched • Personalized Promotion Decision Making (Yang et al., 2022) • Models both direct and enduring effects of promotions • Uses a two-stage framework: response prediction and decision optimization • Robust Optimization (Uehara et al., 2024) • Handles uncertainty in customer response to coupons • Uses robust optimization to maximize sales uplift within budget constraints 4 Data collection (Rule-based, Uniform random) Model training Off-policy evaluation (Offline testing) Determining the ratios of online testing patterns Online testing Data collection
determining the ratios phase • Determining the ratios of online testing patterns: Less actively researched • Often determined the ratios based on past convention → There might be as much potential for revenue improvement by appropriately deciding them • Example of mixing patterns for online testing in coupon campaigns • Model-based allocation: Customers allocated based on the best past model • Uniform random allocation: Customers randomly allocated to evaluate the model performance 5 Data collection (Rule-based, Uniform random) Model training Off-policy evaluation (Offline testing) Determining the ratios of online testing patterns Online testing Data collection
of increasing uniform policy mixing ratio Model-based policy Uniform policy Allocation priority by model Coupon allocated customers Coupon non-allocated customers 6 Advantage • Data collection for future model improvement • Enables data collection for the entire customer Disadvantage • Short-term revenue loss during data collection period • The larger the performance gap between model-based and uniform policy, the more it can damage short-term revenue
We aim to quantitatively determine the policy mixing ratio considering the trade-off between short-term revenue and data collection for future model improvements Allocation priority by model Advantage • Data collection for future model improvement • Enables data collection for the entire customer Disadvantage • Short-term revenue loss during data collection period • The larger the performance gap between model-based and uniform policy, the more it can damage short-term revenue Coupon allocated customers Coupon non-allocated customers Model-based policy Uniform policy
(1/3): Naive estimator for evaluation policy (eval policy) OPE: Evaluating the performance of new policy (eval policy) decisions based on past log data Naive estimator: Average of rewards where the action selected by policy matches the action selected by the data collection policy The naive estimator overly evaluates the actions likely to be selected by the data collection policy Coupon allocated in data collection 9 Customers likely selected by the data collection policy are also allocated by the eval policy Coupon allocated customers by the eval policy Customers unlikely to be selected by the data collection policy are not allocated by the eval policy Non-allocated
(2/3): Inverse propensity score (IPS) estimator (1/2) IPS estimator: The average of rewards weighted by the action probabilities in the eval policy, multiplied by the inverse of the selection probabilities in the data collection policy The IPS estimator is an unbiased estimator when the probability of actions selected by the eval policy being chosen under the data collection policy is non-zero Reduce weights of actions likely to be selected by data collection policy Increase the weights of actions unlikely to be selected by the data collection policy 10 Coupon allocated in data collection Non-allocated Coupon allocated customers by the eval policy
(2/3): Inverse propensity score (IPS) estimator (2/2) In practice, deterministic allocation based on priority is often the case → Selection probability can be zero for deterministic policy-based data collection The IPS estimator loses its unbiased property and OPE performance※ deteriorates ※ Estimation error represented by bias and variance Cannot calculate weights for actions with zero probability in the data collection policy 11 Coupon allocated in data collection Non-allocated Coupon allocated customers by the eval policy
(3/3): Balanced inverse propensity score (BIPS) estimator BIPS estimator: The average of rewards weighted by the action probabilities in the eval policy, multiplied by the inverse of the weighted average of multiple data collection policies Mode-based policy Mixing ratio Uniform policy Mixing ratio The BIPS estimator is an unbiased estimator when any of data collection policies has a non-zero probability for the action selected by the eval policy → OPE performance improves by mixing deterministic model- based policy and probabilistic uniform policy By mixing with a uniform policy, the selection probability of the data collection policy becomes positive, independent of the model-based policy 12 Coupon allocated in data collection Non-allocated Coupon allocated customers by the eval policy
estimator enables quantitative evaluation of OPE performance under a given mixing ratio of data collection policies In practice, the decision on the mixing ratio is based on the trade-off between OPE performance and revenue, not solely on OPE performance This Research Enables quantitative determination of mixing ratio by formulating it as a bi-objective optimization problem considering both revenue and OPE performance metrics 13 Coupon allocated in data collection Non-allocated Mode-based policy Mixing ratio Uniform policy Mixing ratio Coupon allocated customers by the eval policy
performance metrics Revenue metric: As a revenue metric for the data collection policy, it becomes an unbiased estimator because it's the actual result obtained by allocating to the target population in the online test OPE performance metric: In practice, the true error between the true value and the estimator of the evaluation policy cannot be calculated Evaluate the variance part using the variance or std. of the BIPS estimator calculated based on bootstrap sampling from log data 15 Coupon non-allocated Coupon allocated in data collection Mode-based policy Mixing ratio Uniform policy Mixing ratio Coupon allocated customers by the eval policy
formulation considering trade-off between Revenue and OPE Performance Revenue metric OPE performance metric The sum of mixing ratios = 1 The mixing ratio is between 0 and 1 Evaluation methods by number of data collection policies • Two data collection policies → Grid search • Three or More Policies → Approximate Pareto set using black-box optimization solver Policy 16 Coupon allocation customer in the evaluation measure Coupon non-allocated Coupon allocated customer in data collection Policy Policy
the effectiveness of the proposed method in quantitatively evaluating the revenue and OPE performance for each policy mixing ratio Data generation: Generate revenue for 10,000 customers with and without coupon allocation Generate customer data with 4 dimensional features Revenue with coupon non-allocation: Revenue with coupon allocation: Data collection policy (1) Uniform policy: Allocation with a probability of (2) Deterministic policy: Allocation with (3) Deterministic policy: Allocation with └ Positive correlation with revenue, respectively Evaluation metrics Revenue metric : Sum of the revenue of the data collection policy for each mixing ratio OPE performance metric : Square error between the BIPS estimator and the true value at each mixing ratio Optimization software Optuna v3.6.0 NSGA-II algorithm was used (1,000 trials) Experiment setting Evaluation policy (1) Positive correlation with the data collection policy The probability of assignment is 0.8 when otherwise 0.2 (2) Negative correlation with the data collection policy The probability of assignment is 0.2 when otherwise 0.2 18
of policy mixing ratios) • Revenue: As the mixing ratio of the uniform policy increases, revenue decreases. • OPE Performance: As the mixing ratio of the uniform policy increases, performance improves. • Impact of similarity between data collection policy and evaluation policy: When the evaluation policy is close to (strongly correlated with) the data collection policy, OPE performance is good even with a small mixing ratio of the uniform policy. OPE performance (error) Revenue Eval policy with positive correlation to the data collection policy Eval policy with negative correlation to the data collection policy 19 The upper left is preferable. Uniform policy ratio Uniform policy ratio OPE performance (error) Revenue Uniform policy ratio
negative correlation to the data collection policy • Pre-assessment phase: Evaluate performance gap between eval and best policies • Pareto front analysis phase: Visualize trade-off between revenue and OPE performance • Situational strategies: • Small performance gap → Uniform policy mixing rate↑, Reliability↑ Revenue↓ • Large performance gap → Uniform policy mixing rate↓, Reliability↓ Revenue↑ • Optimal balance: Maximize revenue, subject to variance <= performance gap Practical application: Decision making based on the proposed method OPE performance (error) Revenue 20 The upper left is preferable. Uniform policy ratio OPE performance (error) Revenue Eval policy with positive correlation to the data collection policy
Formulated the problem of adjusting the trade-off between short-term revenue and future OPE performance in coupon allocation as a bi-objective optimization problem to determine the mixing ratio between multiple policies • Demonstrated a method for determining mixing ratios using Pareto optimal solutions of the bi-objective optimization problem Future work • Application of more advanced OPE methods than BIPS estimator • Application to tasks other than coupon allocation where there's a trade-off between policy exploration and exploitation 22