Upgrade to Pro — share decks privately, control downloads, hide ads and more …

慶應計量経済学ワークショップ10/27

MasaKat0
October 27, 2020

 慶應計量経済学ワークショップ10/27

慶應計量経済学ワークショップ

MasaKat0

October 27, 2020
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Table of Contents n Adaptive Experimental Design and Off-Policy Evaluation

    from Dependent Samples • Kato, Ishihara, Honda, and Narita, NeurIPS Causal Discovery & Causality- Inspired Machine Learning workshop 2020. Adaptive Experimental Design for Efficient Treatment Effect Estimation https://arxiv.org/abs/2002.05308, NeurIPS main conference rejected [score 6/6/6], • Kato (2020a), Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples https://arxiv.org/pdf/2010.03792.pdf 2
  2. Table of Contents • Kato (2020b), Confidence Interval for Off-Policy

    Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales https://arxiv.org/pdf/2006.06982.pdf • Kato and Kaneko (2020), Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy https://arxiv.org/abs/2010.13554 3
  3. Table of Contents n Off-policy Evaluation under Covariate Shift Kato*,

    Uehara*, and Yasui (*equal contribution). NeurIPS 2020 (spotlight). Off-Policy Evaluation and Learning for External Validity under a Covariate Shift https://arxiv.org/abs/2002.11642 4
  4. Table of Contents n Prediction under Delayed Feedback • Yasui,

    Morishita, Fujita, and Shibata. WWW 2019. A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback. https://dl.acm.org/doi/10.1145/3366423.3380032 • Kato and Yasui. (2020). Learning Classifiers under Delayed Feedback with a Time Window Assumption https://arxiv.org/abs/2009.13092 5
  5. n Goal: conduct an RCT or A/B testing as small

    sample size as possible. n Parameter of interest: Average Treatment Effect (ATE) n How can we gain efficiency? → Adjust the assignment probability of treatment, a.k.a., propensity score, to minimize the asymptotic variance of an estimator of the ATE. Cf. van der Laan. (2008). and Hahn, Hirano, and Karlan. (2011). ü Such a problem setting is called adaptive experimental design. • Adaptive experimental design has a close relation ship with Off-policy evaluation from dependent samples and Multi-armed bandit problem (Best-arm identification) Motivation 7
  6. • For a time series = 1,2, … • At

    a period , we observe an individual with a covariate ! ∈ . • For the individual, we assign a treatment ! ∈ {0,1} with !(! ∣ !). • Then, we immediately observe the outcome ! . Here, we assume a potential outcome framework; that is, ! = 1 ! = 1 ! 1 + 1 ! = 0 !(0) for a potential outcome !(). • The data-generating-process is summarized as !, !, ! ∼ ! , . ATE is defined as " = ! 1 − [!(0)] Problem Setting 8
  7. n Let us assume # = $ = ⋯ =

    ( ∣ ). n Then, the semiparametric efficiency bound of an ATE estimator is ? @ ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " $ Asymptotic Variance of an ATE estimator 9
  8. n There are ATE estimators achieving the efficiency bound. Therefore,

    we minimize the efficiency bound by adjusting 1 ! . = Let us regard the efficiency bound as a function of 1 ! . The minimizer is given as ∗ 1 ! = &'( ! 1 ! &'( ! 1 ! ) &'( ! 0 ! . Adaptive Experimental Design 10
  9. n Given ATE estimators, which achieve the efficiency bound, assigning

    a treatment with an efficient probability ∗ 1 ! minimizes the asymptotic variance of an ATE estimator. n When ! ! is given, this strategy is feasible. → However, in general, the conditional variance is unknown. • During an experiment, we estimate the conditional variance and assign a treatment with an estimated efficient probability. = We adaptively design the experiment. Adaptive Experimental Design 11
  10. n van der Laan (2008). Hahn, Hirano, and Karlan (2011).

    • Both papers are first published at 2008. n van der Laan (2008): • Sequentially estimate the conditional variance and construct an ATE estimator using martingale property. n Hahn, Hirano, and Karlan (2011): • Separate an experiment into two stages: 1. at the first stage, estimate the conditional variance; 2. at the second stage, assign a treatment based on the estimator of the first stage. Existing Studies 12
  11. n van der Laan (2008) and Kato, Ishihara, Honda, and

    Narita (2020) sequentially estimate the conditional variance. 1. At each period , using samples Ω!*# = { +, +, + }+,# !*#, we construct an estimator of conditional variance (! ∣ !) as D ( ∣ !). 2. Then, we assign a treatment as ! 1 !, Ω!*# = - .(#∣1!) - .(#∣1!)) - .("∣1!) . ! 1 !, Ω!*# determines the probability !(1 ∣ !). n Dependency problem: • Here, the samples { +, +, + }+,# !*# is not i.i.d. • How to estimate the ATE from the dependent samples? Dependency Problem 13
  12. n van der Laan (2008), Hadad et al. (2019), and

    Kato et al. (2020) construct an estimator using martingale property. n Define ℎ! = # 3!,# 4!* 5 6!"# #,1! 8!(#∣1!,9!"#) − # 3!,# 4!* 5 6!"# #,1! 8!("∣1!,9!"#) + H !*# 1, ! − H !*# 0, ! , where H !*# , ! is an estimator of [! ∣ !] only using Ω!*# . n Then, we define an ATE estimator as H : = # : ∑!,# : ℎ! . n Owing to the martingale property, we can derive the asymptotic normality even if the samples have dependency. Construction of Estimators under Dependency 14
  13. n Theorem 1 of Kato et al. (2020): Suppose that

    (i) pointwise convergence in probability of H !*# and D !*# ; that is, for all ∈ and ∈ , H !*# , → ; [! ∣ ] and D !*#( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function. (ii) boundedness of the random variables. Then, H : − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . • As we explain later, the step-wise estimator . !$% also solves non-Donsker problem. Asymptotic Normality 15
  14. n Thus, by setting ! = ∗, we can minimize

    the asymptotic variance $ = ! 1 ! ∗ 1 ! + ! 0 ! 1 − ∗ 1 ! + ! 1 ! − ! 0 ! − " # . • Hadad et al. (2019) and Kato et al. (2020) independently derived these results, but van der Laan (2008) has already derived similar result. • Hadad et al. (2019) proposed adaptive reweighting to stabilize the estimator. • Kato et al. (2020) derived non-asymptotic deviation bound using a concentration inequality based on law of iterated logarithm. • Kato (2020a) find that a doubly robust type estimator work as Hadad’s adaptive weight. Efficient Asymptotic Variance 16
  15. n Kato, Ishihara, Honda, and Narita. (2020). Adaptive Experimental Design

    for Efficient Treatment Effect Estimation. • We also proposed a method for an adaptive experimental design. • As well as van der Laan’s approach, sequentially estimate the conditional variance and assign a treatment. van der Laan (2008) and Kato et al. (2020) 17
  16. Kato, Ishihara, Honda, and Narita (2020) have the following contributions.

    • Summarize the ATE estimation from dependent samples. • Derive asymptotic distribution (asymptotic result) * partially overlapped with the result of van der Laan (2008). • Derive a regret bound (non-asymptotic result) • Show a non-asymptotic result deviation bound. • Based on the concentration inequality, we develop a method for efficient sequential testing. Our Contributions 18
  17. n Sequential testing is a type of hypothesis testing. •

    In sequential testing, we conduct hypothesis testing at every period. • If we reject the null hypothesis, we stop the experiment. We can expect that sequential testing can reduce the sample size when the null hypothesis can be rejected. n Conventional method: • Bonferroni (BF) method: known to be too conservative. There are other methods for sequential testing, such as BH method. n These methods are based on multiple testing. → Recently, concentration inequality based methods are proposed. Sequential Testing 19
  18. n Balsubramani and Ramdas (2016) proposed using law of iterated

    logarithm (LIL)-based concentration inequality for sequential testing. n LIL-based concentration inequality has gathered interest in bandit community recently (Jamieson et al. (2014)). n We modify the result of Balsubramani and Ramdas (2016) for efficient ATE estimation. • Compared with other types of sequential testing such as Bonferroni method, we can more appropriately control the error of test in concentration inequality based sequential testing. Sequential Testing and Concentration Inequality 20
  19. n We obtained the following bound for our proposed ATE

    estimator H ! , which holds for all with probability ≥ 1 − : H ! − " ≤ " + 2 U ∗ ! log log U ! ∗ + log 4 , where is a constant, " ≈ log = > , U ∗ ! ≈ ∑+,# ! + $, and + = ℎ+ − " . Based on this confidence bound, we can conduct sequential hypothesis testing because the above bound holds for all . Non-asymptotic Bound of an ATE Estimator 21
  20. n ℋ": " = and ℋ#: " ≠ . n

    Control Type I error at . 1. Set ! = 1.1 log # ? + 2 ∑+,# ! + log @AB ∑$%# ! D$ ? . 2. If H ! > ! , reject the null hypothesis. • The coefficient 1.1 comes from the report of Balsubramani and Ramdas (2016) . Procedure of Sequential Testing 22
  21. n The confidence interval ! is also known as adaptive

    confidence interval. n Recently, there are two directions are proposed for sequential testing. • Adaptive confidence interval (Zhao et al. (2015)) • Always valid -value (Johari, Pekelis, and Walsh (2015)) Both generate the same rejection region in hypothesis testing. Adaptive Confidence Interval and Always Valid -value 23
  22. n Standard hypothesis testing • We need to predetermine the

    sample size = the sample size is not random variable. n Sequential hypothesis testing • We do not have to determine the sample size. • The sample size also means stopping time, which is a random variable. The sample size decision problem in standard hypothesis testing becomes optimal stopping problem in sequential hypothesis testing. Ø Advantage: when we can reject the null hypothesis, we can reject it as soon as possible. Ø Disadvantage: difficult to control the sample size when the null hypothesis cannot be rejected. Sample Size and Optimal Stopping Time 24
  23. • The left is the case where the null hypothesis

    can be rejected. • The right is the case where the null hypothesis cannot be rejected. • In = 150 and 300, we conduct the standard testing at the period. • In “LIL” and “BF”, we conduct sequential testing by LIL-based and BF-based methods. • “Testing” shows the % of the hypothesis is rejected. • In “LIL” and “BF”, we show the stopping time over 1000 trilas. • The total sample size is 500. • When the experiment does not stop, we report 500. Experimental Results 25
  24. In general, in semiparametric inference, the nuisance estimators are required

    to satisfy some conditions, such as Donsker’s condition. n Recently, Chernozhukov et al. (2016) proposed cross-fitting for semiparametric inference with non-Donsker type nuisance estimators. n Similar ideas have been proposed by Klaassen (1987), Zheng and van der Laan (2011). n van der Laan and Lendle (2014) further established a method for semiparametric inference with non-Donsker type nuisance estimators for not-i.i.d. case. • The idea is also using martingale properties. Donsker’s Condition 26
  25. n The semiparametric inference with non-Donsker type nuisance estimator and

    not-i.i.d. samples is investigated by van der Laan and Lendle (2014). • We call the martingale-based method adaptive-fitting. • In adaptive-fitting, we use a property that the step—wisely constructed nuisance estimators can be regarded as constants in the expectation conditioned on past information. n The ATE estimator can be generalized into off-policy evaluation (OPE) estimators. • In OPE, we want to estimate the policy value from samples generated by bandit algorithms. Adaptive-Fitting 27
  26. n Consider a score function ℎ(!, !, !; , )

    = 1 ! = 1 ! − 1, ! (1 ∣ !) − 1 ! = 1 ! − 1, ! (0 ∣ !) + 1, ! − 0, ! . n Cross-fitting: • Suppose that samples are i.i.d. • Separate a dataset into two subsets, # and $ . • Then, for nuisance estimators H #(, ) and D #(, ) constructed from # , we define an estimator f $ = ∑!∈& ℎ(!, !, !; H #, D #). • In [ℎ(!, !, !; H #, D #) ∣ #], we can deal H # and D # as if they were constants. Example: Doubly Robust Estimator 28
  27. On the other hand, n Adaptive-fitting: • Suppose that samples

    are not i.i.d. • Then, for nuisance estimators H +*#(, ) and D +*#(, ) constructed from Ω+*# , we define an estimator ̈ ! = ∑+,# ! ℎ(+, +, +; H +*#, D +*#). • In [ℎ(+, +, +; H +*#, D +*#) ∣ Ω+*#], we can deal H +*# and D +*# as if they were constants. More detailed discussion are shown in Kato (2020a) https://arxiv.org/pdf/2010.03792.pdf. Example: Doubly Robust Estimator 29
  28. Illustrates the difference between cross-fitting and adaptive-fitting. • The left

    graph shows cross-fitting; the right graph shows adaptive-fitting. • The y-axis represents samples used for constructing an estimator with nuisance estimators constructed by samples represented by the x-axis. Illustration of Cross-fitting and Adaptive-fitting 30
  29. n The ATE estimator can be generalized into OPE estimators.

    n Summarize various OPE estimators from dependent samples. • In OPE, we want to estimate ∑'∈ H ! ! ! , where H ! is known as an evaluation policy and is an action space. • We have a dataset !, !, ! !,# : ∼ ! , Ω!*# , , where ! , Ω!*# is known as a behavior policy. • The samples are not i.i.d. • For the following estimator, we assume that the behavior policy ! , Ω!*# converges to a time invariant policy ( ∣ ) in probability. Various OPE Estimators 31
  30. n Define U ℎ! IJK() = 1 ! = !

    !( ∣ !, Ω!*#) . • An IPW estimator is defined as H : IJK = ∑!,# : ∑'∈ H ! U ℎ! IJK() . • Let us note that we use the true behavior policy !( ∣ !, Ω!*#). n It is difficult to replace the true behavior policy with its estimator unlike an i.i.d. case, such as Hirano, Imbens, and Ridder (2003). n In addition, the asymptotic variance is larger than the efficiency bound. We call this estimator an Adaptive IPW (AdaIPW) estimator. Inverse Probability Weighting (IPW) Estimator 32
  31. n Define U ℎ! 3IJK() = 1 ! = !

    − H !*# , ! !( ∣ !, Ω!*#) + H !*# , ! . • An AIPW estimator is H : 3IJK = ∑!,# : ∑'∈ H ! U ℎ! 3IJK() . n Unlike a doubly robust estimator, we use the true behavior policy in the IPW component. n Note that the estimator is unbiased. We call this estimator an Adaptive AIPW (A2IPW) estimator. Augmented IPW (AIPW) Estimator 33
  32. n Both AdaIPW and A2IPW estimators use the true behavior

    policies and are unbiased. n Hadad et al. (2019) pointed out that using the true behavior policy makes the estimator unstable because ! takes an extreme value at a period. • Hadad et al. (2019) proposed using adaptive weight for stabilization. n An adaptive-weighted A2IPW estimator is defined as H : LK3IJK = ∑!,# : !, Ω!*# H ! U ℎ! 3IJK ∑!,# : !, Ω!*# , where !, Ω!*# is called an adaptive weight. Adaptive Reweighting AIPW Estimator 34
  33. n Define U ℎ! ML = 1 ! = !

    − H !*# , ! D !*# ! + H !*# , ! , where D !*# is an estimator of ! constructed from Ω!*# • An DR estimator is H : ML = ∑!,# : ∑'∈ H ! U ℎ! ML() . We call this estimator an Adaptive DR (ADR) estimator. Doubly Robust (DR) Estimator 35
  34. n Theorem 1 of Kato (2020a): Suppose that (i) pointwise

    convergence in probability of !*# ; that is, for all ∈ and ∈ , !( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function; (ii) H !*# − $ D !*# − ! $ = ; *#/$ ; (iii) boundedness of the random variables. Then, j : ML − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . Asymptotic Normality of an ADR Estimator 36
  35. • Denote naively constructed AIPW and DR estimators as AIPW

    and DR; that is, estimators without cross-fitting and adaptive-fitting. • Denote AIPW and DR estimators with adaptive-fitting as A2IPW and ADR. • Denote AIPW and DR estimators with cross-fitting as AIPWCF and DRCF. • Denote IPW estimators with the true behavior policy and estimated behavior policy as IPW and EIPW, respectively. • Denote a direct method estimator ∑!,# : ∑'∈ H ! H :( ∣ !) as DM. Theoretical Comparison 37
  36. n In the dependent setting, IPW, A2IPW, and ADR have

    asymptotic normality. n IPW does not achieve the efficiency bound. ⇄ A2IPW and ADR achieve the efficiency bound. n A2IPW does not require a specific convergence rate for nuisance estimators (van der Laan (2008), Hadad et al. (2019), Kato et al. (2020)). n ADR requires a specific convergence rate for nuisance estimators. See van der Laan and Lendle (2014) and Kato (2020a). • The cross-fitting of Chernozhukov (2016) also assumesk k H !*# − $ D !*# − ! $ = ; *#/$ . - If D !*# = ! , this condition holds if H !*# − $ =; 1 . n Unlike an i.i.d. case, EIPW does not have asymptotic normality. Theoretical Comparison 38
  37. n Hadad et al. (2019) proposed adaptive reweighting for the

    A2IPW because the true behavior policy causes instability. n However, the ADR may absorb such an instability because it replaces the true behavior policy with its estimator. Ex. Suppose that # 1 ∣ = 0.5, $ 1 ∣ = 0.1, O 1 ∣ = 0.95 for ∈ . Although the behavior policy is unstable, the estimator D " , D # , D $ may not so unstable if we estimate them from # and $ . n This result is similar to findings on an IPW estimator with estimated propensity score. Empirical Comparison 39
  38. • The left graph is an estimation error of OPE

    from the LinUCB algorithm. • The right graph is an estimation error of OPE from the LinTS algorithm. Empirical Comparison 40
  39. n Why do we assume that ! , Ω!*# converges

    to a time invariant policy ( ∣ ) in probability? → To use the central limit theorem of martingales, which requires the convergence of the variance (See Hall and Heyde, 1980). n Are there any strategy without assuming the convergence of the policy? • Standardization of martingales (Luedtke and van der Laan (2016), Kato (2020b). • Batch-based approach (Kato and Kaneko (2020)). More detailed discussion is shown in Luedtke and van der Laan (2016). Other Strategy for Asymptotic Normality 41
  40. n Now we are tackling an extension of Kato et

    al. (2020) to best arm identification. n Kaufman, Cappé, and Garivier (2016) showed that • Given two Gaussian arms with different means and variances. • Let us denote the variances as # $ and $ $. • The optimal strategy for finding the best arm is pulling the arm with ratio P# P#)P& : P& P#)P& . The optimality is defined for a non-asymptotic bandit setting. n We want to extend the proposed method for (asymptotic) experimental design to (non-asymptotic) best arm identification. • How to bridge the asymptotic setting and non-asymptotic setting? On-going: Semiparametric Best Arm Identification 42
  41. n Goal: Our goal is estimating a policy value of

    some evaluation policy H from the historical data obtained by a logging policy Q. : State (Age, Height etc for people in the US ) : Action (Therapy) : Reward (Recover rate in COViD19) n In a standard setting, the target of estimation is 1,3,4 ∼; S 8' ;(T∣',S)[] OPE 44 ∼ () Y '
  42. n Historical data U, U, U U,# V($!: n U

    : State Ex. Age, BMI etc for people in the US n The target of estimation is 1,3,4 ∼W S 8' ;(T∣',S)[] There is a distribution shift between historical and evaluation data. Such a situation is called covariate shift (Shimodaira (2000)). OPE Under a Covariate Shift 45 ∼ () n Evaluation data U U,# V')*: n U : State Ex. Age, BMI etc for people in Japan ∼ () ' 3
  43. n We have to correct the difference of policy densities

    with 8' 8+ . n We have to correct the difference of covariate densities with ; S W S . • This value is unknown. n A naïve importance sampling estimator is 1 X+! p U,# V($! p '∈ H U Q U U U 1 U = U . This estimator is consistent. But can we do better? The above estimator is mentioned by Hirano, Imbens, and Ridder (2003). Naïve Estimator 46
  44. Kato*, Uehara*, and Yasui (*equal contribution). NeurIPS 2020 (spotlight). Off-Policy

    Evaluation and Learning for External Validity under a Covariate Shift n Contributions Q. What is the lower bound in estimating the policy value? A. We derive the lower bound following semiparametric theory. Q. Achieve this lower bound? A. The doubly robust estimator can achieve it under mild nonparametric rate conditions. Q. Policy Learning? A: Use a doubly robust score. Efficient OPE under Covariate Shift 47
  45. n Recently, nonparametric density ratio estimation methods are proposed. See

    Sugiyama, Suzuki, and Kanamori. (2012). n However, when using the density ratio as importance weighting for covariate shift, the approximated sample average does not have asymptotic normality ( -consistency) Ex. For H = j 1,∼;(S) U r W 1, ; 1, , where U ∈ ℝ and j 1,∼;(S) is a sample approximation using U ∼ (), it is difficult to show that H − Y,∼W S U → 0, $ , when r W 1, ; 1, does not satisfy Donsker’s condition. Naïve Covariate Shift 48
  46. n We use cross-fitting and separate samples into two datasets.

    • Separate U, U, U U,# V($! into # and $ with sample sizes # X+! and $ X+!. • Separate U U,# V')* into # and $ with sample sizes # H.Z and $ H.Z. n Doubly robust estimator under a covariate shift: U H = 1 % 45! 3 67% 8( )*+ 3 9∈ ̂ 8, )*+,8, -./ ( ∣ 6 )1 6 = 6 − . 8, )*+ , + 1 % 3=> 3 67% 8( -./ 3 9∈ . 8, )*+ , , where ̂ V& ($!,V& ')* ( ∣ U) and H V& ($! , U are estimators of W 1, ;(1,) 8' U 8+('∣1,) and [() ∣ U] constructed from $ and $ . Proposed Estimator 49
  47. n Has asymptotic normality. n Achieves the semiparametric lower bound.

    • The efficiency bound calculated based on the results of van der Vaart (1998) and Wooldridge (2001). (Here skip the detail) n Has the double robustness • Either ̂ V& ($!,V& ')* ( ∣ U) or H V& ($! , is consistent, the proposed estimator is also consistent. n The doubly robust estimator is useful for policy learning. Theoretical Properties 50
  48. n For a time series = 1,2, … , we

    have a dataset !, f ! !,# : . n ! is a feature, and f ! ∈ {−1, +1} is a binary label. n The label f ! may not be true and can be flipped; → several time after the observation, the first observed label can be flipped into the true binary label ! . Problem Setting 52
  49. n We want to predict whether a user buy an

    item (! = +1) using user’s feature ! by showing an advertisement. n First, we cannot obtain a response; therefore, we regard the label is negative, f ! = −1. → If the user buy an item of the advertisement several time after the impression, the label turns into positive, f ! = ! = +1. Example: Advertisement Optimization 53
  50. n There are two approaches for this problem. 1. Assume

    that old samples have true labels. Parametric approach: survival analysis models (Chapelle (KDD 2014)). The performance is very poor. Nonparametric (?) approach: only using old samples. → By additionally assuming stationarity of time-series, importance weighting (Yasui et al. (WWW 2019)). unbiased risk approach (Kato and Yasui (2020)) The convergence rate only depends on the sample size of old samples. 2. Maximize the worst-case performance under an adversarial attack. Cesa-Bianchi, Gentile, and Mansour (2019) Two Approaches 54
  51. n Prediction under delayed feedback is a special case of

    learning with noisy labels and learning from positive and unlabeled (PU) data. n As a special case of learning with noisy labels, du Plessis and following work proposed methods for PU learning. n Therefore, the methods for delayed feedback are similar to those for learning with noisy labels and PU learning. Ex. Yasui et al. (2019) is similar to Liu and Tau (2015). Kato and Yasui (2020) is similar to du Plessis et al. (2015). Machine Learning Perspective 55
  52. • Klaassen, C. A. J. Consistent estimation of the influ-

    ence function of locally asymptotically linear esti- mators. Annals of statistics, 1987. • van der Laan, M. J. The construction and analysis of adaptive group sequential designs. 2008. • van der Laan, M. J. and Lendle, S. D. Online targeted learning. 2014. • Luedtke, A. R. and van der Laan, M. J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 2016. • Zheng, W. and van der Laan, M. J. Cross-validated targeted minimum-loss-based estimation. In Tar- geted Learning: Causal Inference for Observational and Experimental Data, Springer Series in Statistics. 2011. • Johari, R., Pekelis, L., and Walsh, D. J. Always valid inference: Bringing sequential analysis to a/b test- ing. arXiv preprint arXiv:1512.04922, 2015. • Balsubramani, A. and Ramdas, A. Sequential non- parametric testing with the law of the iterated log- arithm. In UAI, 2016. Reference of Adaptive Experimental Design 57
  53. • Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S.,

    and Athey, S. Confidence intervals for policy eval- uation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. • Hahn, J., Hirano, K., and Karlan, D. Adaptive exper- imental design using the propensity score. Journal of Business and Economic Statistics, 29(1):96–108, 2011. • Zhao, S., Zhou, E., Sabharwal, A., and Ermon, S. Adaptive concentration inequalities for sequential decision problems. In NeurIPS, pp. 1343–1351. Cur- ran Associates, Inc., 2016. • Chernozhukov, V., Chetverikov, D., Demirer, M., Du- flo, E., Hansen, C., Newey, W., and Robins, J. Dou- ble/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21: C1–C68, 2018. • Kaufmann, E., Cappé, O., and Garivier, A. On the com- plexity of best-arm identification in multi-armed bandit models. JMLR, 2016. Reference of Adaptive Experimental Design 58
  54. • Chapelle, O. Modeling delayed feedback in display advertising. In

    KDD, 2014. • Yasui, S., Morishita, G., Komei, F., and Shibata, M. A feedback shift correction in predicting conversion rates under delayed feedback. In WWW, 2020. • du Plessis, M., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, 2015. • Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and cooperation in nonstochastic bandits. JMLR, 2019. Reference of Delayed Feedback 59
  55. • Shimodaira, H. Improving predictive inference under covariate shift by

    weighting the log- likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. • Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. 2012. • Hirano, K., Imbens, G., and Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71:1161–1189, 2003a. • Wooldridge, J. M. Asymptotic properties of weighted m -estimators for standard stratified samples. Econometric Theory, 2001. • van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998. Reference of OPE under Covariate Shift 60