Slide 1

Slide 1 text

慶應義塾⼤学 計量経済学ワークショップ AI事業本部 AdEconチーム 加藤真大 2020年10月27日 1

Slide 2

Slide 2 text

Table of Contents n Adaptive Experimental Design and Off-Policy Evaluation from Dependent Samples • Kato, Ishihara, Honda, and Narita, NeurIPS Causal Discovery & Causality- Inspired Machine Learning workshop 2020. Adaptive Experimental Design for Efficient Treatment Effect Estimation https://arxiv.org/abs/2002.05308, NeurIPS main conference rejected [score 6/6/6], • Kato (2020a), Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples https://arxiv.org/pdf/2010.03792.pdf 2

Slide 3

Slide 3 text

Table of Contents • Kato (2020b), Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales https://arxiv.org/pdf/2006.06982.pdf • Kato and Kaneko (2020), Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy https://arxiv.org/abs/2010.13554 3

Slide 4

Slide 4 text

Table of Contents n Off-policy Evaluation under Covariate Shift Kato*, Uehara*, and Yasui (*equal contribution). NeurIPS 2020 (spotlight). Off-Policy Evaluation and Learning for External Validity under a Covariate Shift https://arxiv.org/abs/2002.11642 4

Slide 5

Slide 5 text

Table of Contents n Prediction under Delayed Feedback • Yasui, Morishita, Fujita, and Shibata. WWW 2019. A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback. https://dl.acm.org/doi/10.1145/3366423.3380032 • Kato and Yasui. (2020). Learning Classifiers under Delayed Feedback with a Time Window Assumption https://arxiv.org/abs/2009.13092 5

Slide 6

Slide 6 text

Adaptive Experimental Design and Off-Policy Evaluation from Dependent Samples 6

Slide 7

Slide 7 text

n Goal: conduct an RCT or A/B testing as small sample size as possible. n Parameter of interest: Average Treatment Effect (ATE) n How can we gain efficiency? → Adjust the assignment probability of treatment, a.k.a., propensity score, to minimize the asymptotic variance of an estimator of the ATE. Cf. van der Laan. (2008). and Hahn, Hirano, and Karlan. (2011). ü Such a problem setting is called adaptive experimental design. • Adaptive experimental design has a close relation ship with Off-policy evaluation from dependent samples and Multi-armed bandit problem (Best-arm identification) Motivation 7

Slide 8

Slide 8 text

• For a time series = 1,2, … • At a period , we observe an individual with a covariate ! ∈ . • For the individual, we assign a treatment ! ∈ {0,1} with !(! ∣ !). • Then, we immediately observe the outcome ! . Here, we assume a potential outcome framework; that is, ! = 1 ! = 1 ! 1 + 1 ! = 0 !(0) for a potential outcome !(). • The data-generating-process is summarized as !, !, ! ∼ ! , . ATE is defined as " = ! 1 − [!(0)] Problem Setting 8

Slide 9

Slide 9 text

n Let us assume # = $ = ⋯ = ( ∣ ). n Then, the semiparametric efficiency bound of an ATE estimator is ? @ ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " $ Asymptotic Variance of an ATE estimator 9

Slide 10

Slide 10 text

n There are ATE estimators achieving the efficiency bound. Therefore, we minimize the efficiency bound by adjusting 1 ! . = Let us regard the efficiency bound as a function of 1 ! . The minimizer is given as ∗ 1 ! = &'( ! 1 ! &'( ! 1 ! ) &'( ! 0 ! . Adaptive Experimental Design 10

Slide 11

Slide 11 text

n Given ATE estimators, which achieve the efficiency bound, assigning a treatment with an efficient probability ∗ 1 ! minimizes the asymptotic variance of an ATE estimator. n When ! ! is given, this strategy is feasible. → However, in general, the conditional variance is unknown. • During an experiment, we estimate the conditional variance and assign a treatment with an estimated efficient probability. = We adaptively design the experiment. Adaptive Experimental Design 11

Slide 12

Slide 12 text

n van der Laan (2008). Hahn, Hirano, and Karlan (2011). • Both papers are first published at 2008. n van der Laan (2008): • Sequentially estimate the conditional variance and construct an ATE estimator using martingale property. n Hahn, Hirano, and Karlan (2011): • Separate an experiment into two stages: 1. at the first stage, estimate the conditional variance; 2. at the second stage, assign a treatment based on the estimator of the first stage. Existing Studies 12

Slide 13

Slide 13 text

n van der Laan (2008) and Kato, Ishihara, Honda, and Narita (2020) sequentially estimate the conditional variance. 1. At each period , using samples Ω!*# = { +, +, + }+,# !*#, we construct an estimator of conditional variance (! ∣ !) as D ( ∣ !). 2. Then, we assign a treatment as ! 1 !, Ω!*# = - .(#∣1!) - .(#∣1!)) - .("∣1!) . ! 1 !, Ω!*# determines the probability !(1 ∣ !). n Dependency problem: • Here, the samples { +, +, + }+,# !*# is not i.i.d. • How to estimate the ATE from the dependent samples? Dependency Problem 13

Slide 14

Slide 14 text

n van der Laan (2008), Hadad et al. (2019), and Kato et al. (2020) construct an estimator using martingale property. n Define ℎ! = # 3!,# 4!* 5 6!"# #,1! 8!(#∣1!,9!"#) − # 3!,# 4!* 5 6!"# #,1! 8!("∣1!,9!"#) + H !*# 1, ! − H !*# 0, ! , where H !*# , ! is an estimator of [! ∣ !] only using Ω!*# . n Then, we define an ATE estimator as H : = # : ∑!,# : ℎ! . n Owing to the martingale property, we can derive the asymptotic normality even if the samples have dependency. Construction of Estimators under Dependency 14

Slide 15

Slide 15 text

n Theorem 1 of Kato et al. (2020): Suppose that (i) pointwise convergence in probability of H !*# and D !*# ; that is, for all ∈ and ∈ , H !*# , → ; [! ∣ ] and D !*#( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function. (ii) boundedness of the random variables. Then, H : − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . • As we explain later, the step-wise estimator . !$% also solves non-Donsker problem. Asymptotic Normality 15

Slide 16

Slide 16 text

n Thus, by setting ! = ∗, we can minimize the asymptotic variance $ = ! 1 ! ∗ 1 ! + ! 0 ! 1 − ∗ 1 ! + ! 1 ! − ! 0 ! − " # . • Hadad et al. (2019) and Kato et al. (2020) independently derived these results, but van der Laan (2008) has already derived similar result. • Hadad et al. (2019) proposed adaptive reweighting to stabilize the estimator. • Kato et al. (2020) derived non-asymptotic deviation bound using a concentration inequality based on law of iterated logarithm. • Kato (2020a) find that a doubly robust type estimator work as Hadad’s adaptive weight. Efficient Asymptotic Variance 16

Slide 17

Slide 17 text

n Kato, Ishihara, Honda, and Narita. (2020). Adaptive Experimental Design for Efficient Treatment Effect Estimation. • We also proposed a method for an adaptive experimental design. • As well as van der Laan’s approach, sequentially estimate the conditional variance and assign a treatment. van der Laan (2008) and Kato et al. (2020) 17

Slide 18

Slide 18 text

Kato, Ishihara, Honda, and Narita (2020) have the following contributions. • Summarize the ATE estimation from dependent samples. • Derive asymptotic distribution (asymptotic result) * partially overlapped with the result of van der Laan (2008). • Derive a regret bound (non-asymptotic result) • Show a non-asymptotic result deviation bound. • Based on the concentration inequality, we develop a method for efficient sequential testing. Our Contributions 18

Slide 19

Slide 19 text

n Sequential testing is a type of hypothesis testing. • In sequential testing, we conduct hypothesis testing at every period. • If we reject the null hypothesis, we stop the experiment. We can expect that sequential testing can reduce the sample size when the null hypothesis can be rejected. n Conventional method: • Bonferroni (BF) method: known to be too conservative. There are other methods for sequential testing, such as BH method. n These methods are based on multiple testing. → Recently, concentration inequality based methods are proposed. Sequential Testing 19

Slide 20

Slide 20 text

n Balsubramani and Ramdas (2016) proposed using law of iterated logarithm (LIL)-based concentration inequality for sequential testing. n LIL-based concentration inequality has gathered interest in bandit community recently (Jamieson et al. (2014)). n We modify the result of Balsubramani and Ramdas (2016) for efficient ATE estimation. • Compared with other types of sequential testing such as Bonferroni method, we can more appropriately control the error of test in concentration inequality based sequential testing. Sequential Testing and Concentration Inequality 20

Slide 21

Slide 21 text

n We obtained the following bound for our proposed ATE estimator H ! , which holds for all with probability ≥ 1 − : H ! − " ≤ " + 2 U ∗ ! log log U ! ∗ + log 4 , where is a constant, " ≈ log = > , U ∗ ! ≈ ∑+,# ! + $, and + = ℎ+ − " . Based on this confidence bound, we can conduct sequential hypothesis testing because the above bound holds for all . Non-asymptotic Bound of an ATE Estimator 21

Slide 22

Slide 22 text

n ℋ": " = and ℋ#: " ≠ . n Control Type I error at . 1. Set ! = 1.1 log # ? + 2 ∑+,# ! + log @AB ∑$%# ! D$ ? . 2. If H ! > ! , reject the null hypothesis. • The coefficient 1.1 comes from the report of Balsubramani and Ramdas (2016) . Procedure of Sequential Testing 22

Slide 23

Slide 23 text

n The confidence interval ! is also known as adaptive confidence interval. n Recently, there are two directions are proposed for sequential testing. • Adaptive confidence interval (Zhao et al. (2015)) • Always valid -value (Johari, Pekelis, and Walsh (2015)) Both generate the same rejection region in hypothesis testing. Adaptive Confidence Interval and Always Valid -value 23

Slide 24

Slide 24 text

n Standard hypothesis testing • We need to predetermine the sample size = the sample size is not random variable. n Sequential hypothesis testing • We do not have to determine the sample size. • The sample size also means stopping time, which is a random variable. The sample size decision problem in standard hypothesis testing becomes optimal stopping problem in sequential hypothesis testing. Ø Advantage: when we can reject the null hypothesis, we can reject it as soon as possible. Ø Disadvantage: difficult to control the sample size when the null hypothesis cannot be rejected. Sample Size and Optimal Stopping Time 24

Slide 25

Slide 25 text

• The left is the case where the null hypothesis can be rejected. • The right is the case where the null hypothesis cannot be rejected. • In = 150 and 300, we conduct the standard testing at the period. • In “LIL” and “BF”, we conduct sequential testing by LIL-based and BF-based methods. • “Testing” shows the % of the hypothesis is rejected. • In “LIL” and “BF”, we show the stopping time over 1000 trilas. • The total sample size is 500. • When the experiment does not stop, we report 500. Experimental Results 25

Slide 26

Slide 26 text

In general, in semiparametric inference, the nuisance estimators are required to satisfy some conditions, such as Donsker’s condition. n Recently, Chernozhukov et al. (2016) proposed cross-fitting for semiparametric inference with non-Donsker type nuisance estimators. n Similar ideas have been proposed by Klaassen (1987), Zheng and van der Laan (2011). n van der Laan and Lendle (2014) further established a method for semiparametric inference with non-Donsker type nuisance estimators for not-i.i.d. case. • The idea is also using martingale properties. Donsker’s Condition 26

Slide 27

Slide 27 text

n The semiparametric inference with non-Donsker type nuisance estimator and not-i.i.d. samples is investigated by van der Laan and Lendle (2014). • We call the martingale-based method adaptive-fitting. • In adaptive-fitting, we use a property that the step—wisely constructed nuisance estimators can be regarded as constants in the expectation conditioned on past information. n The ATE estimator can be generalized into off-policy evaluation (OPE) estimators. • In OPE, we want to estimate the policy value from samples generated by bandit algorithms. Adaptive-Fitting 27

Slide 28

Slide 28 text

n Consider a score function ℎ(!, !, !; , ) = 1 ! = 1 ! − 1, ! (1 ∣ !) − 1 ! = 1 ! − 1, ! (0 ∣ !) + 1, ! − 0, ! . n Cross-fitting: • Suppose that samples are i.i.d. • Separate a dataset into two subsets, # and $ . • Then, for nuisance estimators H #(, ) and D #(, ) constructed from # , we define an estimator f $ = ∑!∈& ℎ(!, !, !; H #, D #). • In [ℎ(!, !, !; H #, D #) ∣ #], we can deal H # and D # as if they were constants. Example: Doubly Robust Estimator 28

Slide 29

Slide 29 text

On the other hand, n Adaptive-fitting: • Suppose that samples are not i.i.d. • Then, for nuisance estimators H +*#(, ) and D +*#(, ) constructed from Ω+*# , we define an estimator ̈ ! = ∑+,# ! ℎ(+, +, +; H +*#, D +*#). • In [ℎ(+, +, +; H +*#, D +*#) ∣ Ω+*#], we can deal H +*# and D +*# as if they were constants. More detailed discussion are shown in Kato (2020a) https://arxiv.org/pdf/2010.03792.pdf. Example: Doubly Robust Estimator 29

Slide 30

Slide 30 text

Illustrates the difference between cross-fitting and adaptive-fitting. • The left graph shows cross-fitting; the right graph shows adaptive-fitting. • The y-axis represents samples used for constructing an estimator with nuisance estimators constructed by samples represented by the x-axis. Illustration of Cross-fitting and Adaptive-fitting 30

Slide 31

Slide 31 text

n The ATE estimator can be generalized into OPE estimators. n Summarize various OPE estimators from dependent samples. • In OPE, we want to estimate ∑'∈ H ! ! ! , where H ! is known as an evaluation policy and is an action space. • We have a dataset !, !, ! !,# : ∼ ! , Ω!*# , , where ! , Ω!*# is known as a behavior policy. • The samples are not i.i.d. • For the following estimator, we assume that the behavior policy ! , Ω!*# converges to a time invariant policy ( ∣ ) in probability. Various OPE Estimators 31

Slide 32

Slide 32 text

n Define U ℎ! IJK() = 1 ! = ! !( ∣ !, Ω!*#) . • An IPW estimator is defined as H : IJK = ∑!,# : ∑'∈ H ! U ℎ! IJK() . • Let us note that we use the true behavior policy !( ∣ !, Ω!*#). n It is difficult to replace the true behavior policy with its estimator unlike an i.i.d. case, such as Hirano, Imbens, and Ridder (2003). n In addition, the asymptotic variance is larger than the efficiency bound. We call this estimator an Adaptive IPW (AdaIPW) estimator. Inverse Probability Weighting (IPW) Estimator 32

Slide 33

Slide 33 text

n Define U ℎ! 3IJK() = 1 ! = ! − H !*# , ! !( ∣ !, Ω!*#) + H !*# , ! . • An AIPW estimator is H : 3IJK = ∑!,# : ∑'∈ H ! U ℎ! 3IJK() . n Unlike a doubly robust estimator, we use the true behavior policy in the IPW component. n Note that the estimator is unbiased. We call this estimator an Adaptive AIPW (A2IPW) estimator. Augmented IPW (AIPW) Estimator 33

Slide 34

Slide 34 text

n Both AdaIPW and A2IPW estimators use the true behavior policies and are unbiased. n Hadad et al. (2019) pointed out that using the true behavior policy makes the estimator unstable because ! takes an extreme value at a period. • Hadad et al. (2019) proposed using adaptive weight for stabilization. n An adaptive-weighted A2IPW estimator is defined as H : LK3IJK = ∑!,# : !, Ω!*# H ! U ℎ! 3IJK ∑!,# : !, Ω!*# , where !, Ω!*# is called an adaptive weight. Adaptive Reweighting AIPW Estimator 34

Slide 35

Slide 35 text

n Define U ℎ! ML = 1 ! = ! − H !*# , ! D !*# ! + H !*# , ! , where D !*# is an estimator of ! constructed from Ω!*# • An DR estimator is H : ML = ∑!,# : ∑'∈ H ! U ℎ! ML() . We call this estimator an Adaptive DR (ADR) estimator. Doubly Robust (DR) Estimator 35

Slide 36

Slide 36 text

n Theorem 1 of Kato (2020a): Suppose that (i) pointwise convergence in probability of !*# ; that is, for all ∈ and ∈ , !( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function; (ii) H !*# − $ D !*# − ! $ = ; *#/$ ; (iii) boundedness of the random variables. Then, j : ML − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . Asymptotic Normality of an ADR Estimator 36

Slide 37

Slide 37 text

• Denote naively constructed AIPW and DR estimators as AIPW and DR; that is, estimators without cross-fitting and adaptive-fitting. • Denote AIPW and DR estimators with adaptive-fitting as A2IPW and ADR. • Denote AIPW and DR estimators with cross-fitting as AIPWCF and DRCF. • Denote IPW estimators with the true behavior policy and estimated behavior policy as IPW and EIPW, respectively. • Denote a direct method estimator ∑!,# : ∑'∈ H ! H :( ∣ !) as DM. Theoretical Comparison 37

Slide 38

Slide 38 text

n In the dependent setting, IPW, A2IPW, and ADR have asymptotic normality. n IPW does not achieve the efficiency bound. ⇄ A2IPW and ADR achieve the efficiency bound. n A2IPW does not require a specific convergence rate for nuisance estimators (van der Laan (2008), Hadad et al. (2019), Kato et al. (2020)). n ADR requires a specific convergence rate for nuisance estimators. See van der Laan and Lendle (2014) and Kato (2020a). • The cross-fitting of Chernozhukov (2016) also assumesk k H !*# − $ D !*# − ! $ = ; *#/$ . - If D !*# = ! , this condition holds if H !*# − $ =; 1 . n Unlike an i.i.d. case, EIPW does not have asymptotic normality. Theoretical Comparison 38

Slide 39

Slide 39 text

n Hadad et al. (2019) proposed adaptive reweighting for the A2IPW because the true behavior policy causes instability. n However, the ADR may absorb such an instability because it replaces the true behavior policy with its estimator. Ex. Suppose that # 1 ∣ = 0.5, $ 1 ∣ = 0.1, O 1 ∣ = 0.95 for ∈ . Although the behavior policy is unstable, the estimator D " , D # , D $ may not so unstable if we estimate them from # and $ . n This result is similar to findings on an IPW estimator with estimated propensity score. Empirical Comparison 39

Slide 40

Slide 40 text

• The left graph is an estimation error of OPE from the LinUCB algorithm. • The right graph is an estimation error of OPE from the LinTS algorithm. Empirical Comparison 40

Slide 41

Slide 41 text

n Why do we assume that ! , Ω!*# converges to a time invariant policy ( ∣ ) in probability? → To use the central limit theorem of martingales, which requires the convergence of the variance (See Hall and Heyde, 1980). n Are there any strategy without assuming the convergence of the policy? • Standardization of martingales (Luedtke and van der Laan (2016), Kato (2020b). • Batch-based approach (Kato and Kaneko (2020)). More detailed discussion is shown in Luedtke and van der Laan (2016). Other Strategy for Asymptotic Normality 41

Slide 42

Slide 42 text

n Now we are tackling an extension of Kato et al. (2020) to best arm identification. n Kaufman, Cappé, and Garivier (2016) showed that • Given two Gaussian arms with different means and variances. • Let us denote the variances as # $ and $ $. • The optimal strategy for finding the best arm is pulling the arm with ratio P# P#)P& : P& P#)P& . The optimality is defined for a non-asymptotic bandit setting. n We want to extend the proposed method for (asymptotic) experimental design to (non-asymptotic) best arm identification. • How to bridge the asymptotic setting and non-asymptotic setting? On-going: Semiparametric Best Arm Identification 42

Slide 43

Slide 43 text

43 Off-policy Evaluation under Covariate Shift

Slide 44

Slide 44 text

n Goal: Our goal is estimating a policy value of some evaluation policy H from the historical data obtained by a logging policy Q. : State (Age, Height etc for people in the US ) : Action (Therapy) : Reward (Recover rate in COViD19) n In a standard setting, the target of estimation is 1,3,4 ∼; S 8' ;(T∣',S)[] OPE 44 ∼ () Y '

Slide 45

Slide 45 text

n Historical data U, U, U U,# V($!: n U : State Ex. Age, BMI etc for people in the US n The target of estimation is 1,3,4 ∼W S 8' ;(T∣',S)[] There is a distribution shift between historical and evaluation data. Such a situation is called covariate shift (Shimodaira (2000)). OPE Under a Covariate Shift 45 ∼ () n Evaluation data U U,# V')*: n U : State Ex. Age, BMI etc for people in Japan ∼ () ' 3

Slide 46

Slide 46 text

n We have to correct the difference of policy densities with 8' 8+ . n We have to correct the difference of covariate densities with ; S W S . • This value is unknown. n A naïve importance sampling estimator is 1 X+! p U,# V($! p '∈ H U Q U U U 1 U = U . This estimator is consistent. But can we do better? The above estimator is mentioned by Hirano, Imbens, and Ridder (2003). Naïve Estimator 46

Slide 47

Slide 47 text

Kato*, Uehara*, and Yasui (*equal contribution). NeurIPS 2020 (spotlight). Off-Policy Evaluation and Learning for External Validity under a Covariate Shift n Contributions Q. What is the lower bound in estimating the policy value? A. We derive the lower bound following semiparametric theory. Q. Achieve this lower bound? A. The doubly robust estimator can achieve it under mild nonparametric rate conditions. Q. Policy Learning? A: Use a doubly robust score. Efficient OPE under Covariate Shift 47

Slide 48

Slide 48 text

n Recently, nonparametric density ratio estimation methods are proposed. See Sugiyama, Suzuki, and Kanamori. (2012). n However, when using the density ratio as importance weighting for covariate shift, the approximated sample average does not have asymptotic normality ( -consistency) Ex. For H = j 1,∼;(S) U r W 1, ; 1, , where U ∈ ℝ and j 1,∼;(S) is a sample approximation using U ∼ (), it is difficult to show that H − Y,∼W S U → 0, $ , when r W 1, ; 1, does not satisfy Donsker’s condition. Naïve Covariate Shift 48

Slide 49

Slide 49 text

n We use cross-fitting and separate samples into two datasets. • Separate U, U, U U,# V($! into # and $ with sample sizes # X+! and $ X+!. • Separate U U,# V')* into # and $ with sample sizes # H.Z and $ H.Z. n Doubly robust estimator under a covariate shift: U H = 1 % 45! 3 67% 8( )*+ 3 9∈ ̂ 8, )*+,8, -./ ( ∣ 6 )1 6 = 6 − . 8, )*+ , + 1 % 3=> 3 67% 8( -./ 3 9∈ . 8, )*+ , , where ̂ V& ($!,V& ')* ( ∣ U) and H V& ($! , U are estimators of W 1, ;(1,) 8' U 8+('∣1,) and [() ∣ U] constructed from $ and $ . Proposed Estimator 49

Slide 50

Slide 50 text

n Has asymptotic normality. n Achieves the semiparametric lower bound. • The efficiency bound calculated based on the results of van der Vaart (1998) and Wooldridge (2001). (Here skip the detail) n Has the double robustness • Either ̂ V& ($!,V& ')* ( ∣ U) or H V& ($! , is consistent, the proposed estimator is also consistent. n The doubly robust estimator is useful for policy learning. Theoretical Properties 50

Slide 51

Slide 51 text

51 Prediction under Delayed Feedback

Slide 52

Slide 52 text

n For a time series = 1,2, … , we have a dataset !, f ! !,# : . n ! is a feature, and f ! ∈ {−1, +1} is a binary label. n The label f ! may not be true and can be flipped; → several time after the observation, the first observed label can be flipped into the true binary label ! . Problem Setting 52

Slide 53

Slide 53 text

n We want to predict whether a user buy an item (! = +1) using user’s feature ! by showing an advertisement. n First, we cannot obtain a response; therefore, we regard the label is negative, f ! = −1. → If the user buy an item of the advertisement several time after the impression, the label turns into positive, f ! = ! = +1. Example: Advertisement Optimization 53

Slide 54

Slide 54 text

n There are two approaches for this problem. 1. Assume that old samples have true labels. Parametric approach: survival analysis models (Chapelle (KDD 2014)). The performance is very poor. Nonparametric (?) approach: only using old samples. → By additionally assuming stationarity of time-series, importance weighting (Yasui et al. (WWW 2019)). unbiased risk approach (Kato and Yasui (2020)) The convergence rate only depends on the sample size of old samples. 2. Maximize the worst-case performance under an adversarial attack. Cesa-Bianchi, Gentile, and Mansour (2019) Two Approaches 54

Slide 55

Slide 55 text

n Prediction under delayed feedback is a special case of learning with noisy labels and learning from positive and unlabeled (PU) data. n As a special case of learning with noisy labels, du Plessis and following work proposed methods for PU learning. n Therefore, the methods for delayed feedback are similar to those for learning with noisy labels and PU learning. Ex. Yasui et al. (2019) is similar to Liu and Tau (2015). Kato and Yasui (2020) is similar to du Plessis et al. (2015). Machine Learning Perspective 55

Slide 56

Slide 56 text

56 Reference

Slide 57

Slide 57 text

• Klaassen, C. A. J. Consistent estimation of the influ- ence function of locally asymptotically linear esti- mators. Annals of statistics, 1987. • van der Laan, M. J. The construction and analysis of adaptive group sequential designs. 2008. • van der Laan, M. J. and Lendle, S. D. Online targeted learning. 2014. • Luedtke, A. R. and van der Laan, M. J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 2016. • Zheng, W. and van der Laan, M. J. Cross-validated targeted minimum-loss-based estimation. In Tar- geted Learning: Causal Inference for Observational and Experimental Data, Springer Series in Statistics. 2011. • Johari, R., Pekelis, L., and Walsh, D. J. Always valid inference: Bringing sequential analysis to a/b test- ing. arXiv preprint arXiv:1512.04922, 2015. • Balsubramani, A. and Ramdas, A. Sequential non- parametric testing with the law of the iterated log- arithm. In UAI, 2016. Reference of Adaptive Experimental Design 57

Slide 58

Slide 58 text

• Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., and Athey, S. Confidence intervals for policy eval- uation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. • Hahn, J., Hirano, K., and Karlan, D. Adaptive exper- imental design using the propensity score. Journal of Business and Economic Statistics, 29(1):96–108, 2011. • Zhao, S., Zhou, E., Sabharwal, A., and Ermon, S. Adaptive concentration inequalities for sequential decision problems. In NeurIPS, pp. 1343–1351. Cur- ran Associates, Inc., 2016. • Chernozhukov, V., Chetverikov, D., Demirer, M., Du- flo, E., Hansen, C., Newey, W., and Robins, J. Dou- ble/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21: C1–C68, 2018. • Kaufmann, E., Cappé, O., and Garivier, A. On the com- plexity of best-arm identification in multi-armed bandit models. JMLR, 2016. Reference of Adaptive Experimental Design 58

Slide 59

Slide 59 text

• Chapelle, O. Modeling delayed feedback in display advertising. In KDD, 2014. • Yasui, S., Morishita, G., Komei, F., and Shibata, M. A feedback shift correction in predicting conversion rates under delayed feedback. In WWW, 2020. • du Plessis, M., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, 2015. • Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and cooperation in nonstochastic bandits. JMLR, 2019. Reference of Delayed Feedback 59

Slide 60

Slide 60 text

• Shimodaira, H. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. • Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. 2012. • Hirano, K., Imbens, G., and Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71:1161–1189, 2003a. • Wooldridge, J. M. Asymptotic properties of weighted m -estimators for standard stratified samples. Econometric Theory, 2001. • van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998. Reference of OPE under Covariate Shift 60