Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benign Overfitting in Conditional Average Treat...

MasaKat0
April 06, 2022

Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

TOPML2022.

MasaKat0

April 06, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear

    Regression TOPML Workshop April 5th 2022 Masahiro Kato The University of Tokyo / CyberAgent, Inc. AILab Masaaki Imaizumi The University of Tokyo
  2. Introduction Ø Kato and Imaizumi (2022) “Benign-Overfitting in Conditional Average

    Treatment Effect Prediction with Linear Regression” on arXiv. n Problem setting: • Goal: prediction of the conditional average treatment effect (CATE). • Interpolating estimators that fit the observations perfectly. n Finding: • In CATE prediction, compared with Bartlett et al. (2020), additional assumptions or modified algorithms are required for benign overfitting. 2
  3. CATE Prediction n Potential outcome framework (Rubin (1974)). • Binary

    treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0) • For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦! . • An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}. Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥). • For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦$ . We cannot observe the outcome that the individual is not assigned. n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% %&# ' . 3
  4. Linear Regression Model of the CATE n Assume a linear

    regression model for each potential outcome 𝑦! , 𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃! ∗ + 𝜀. n The causal effect between treatments is captured by the CATE. • The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦$|𝑥 : 𝜏 𝑥 = 𝔼 𝑦# |𝑥 − 𝔼 𝑦$ |𝑥 = 𝑥( 𝜃# ∗ − 𝜃$ ∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃# ∗ − 𝜃$ ∗ . Ø Goal: predict the CATE by using an estimator of 𝜃∗. 4
  5. Excess Risk and Benign Overfitting n To evaluate the performance,

    we define the excess risk for an estimator 9 𝜃 as 𝑅 9 𝜃 = 𝔼*,, 𝑦# − 𝑦$ − 𝑥( 9 𝜃 - − 𝑦# − 𝑦$ − 𝑥(𝜃∗ - . n We also consider an overparametrized situation: 𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 . Ex. Benign overfitting framework by Bartlett et al. (2020). • Consider a standard regression setting with interpolating estimators. • 𝑅 9 𝜃 goes to zero under some conditions on the covariance of 𝑥. 5
  6. Excess Risk and Benign Overfitting n Apply the framework of

    Bartlett et al (2020) to CATE prediction. n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥. • The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of the original covariates, 𝑥. n Owing to this bias, we need to add some modifications on the results of Bartlett et al. (2020) to show benign overfitting in the CATE prediction. • Note that the excess risk defines over the original distribution of 𝑥. (This problem is an instance of the distribution shift or covariate shift) 6
  7. T-Learner and IPW-Learner n Consider two CATE prediction methods using

    interpolating estimators. 1. T-Learner: This method consists of two separate linear models. i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦$|𝑥], separately. ii. Predict the CATE by using the difference of the two estimators. 2. Inverse probability weight (IPW)-Learner: This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥). i. Construct an unbiased estimator ̂ 𝜏(𝑥) of 𝜏(𝑥) as # .&# , "(.&#|*) − # .&$ , "(.&$|*) . ii. Construct a predictor by regressing ̂ 𝜏(𝑥) on the covariates 𝑥. 7
  8. Effective Rank n Show upper bounds using the effective rank

    (Bartlett et al. (2020). • Denote the covariance operator of covariates 𝑥 by Σ. • The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯. • If ∑2&# 3 𝜆2 < ∞ and 𝜆45# > 0 for 𝑘 ≥ 0, define 𝑟4 Σ = ∑264 𝜆2 𝜆45# 𝑎𝑛𝑑 𝑅4 Σ = ∑264 𝜆2 - ∑264 𝜆2 - • Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ! . The effective rank is a measure of the complexity of the covariances. 8
  9. Interpolating Estimator with T-Learner Ø The T-Learner. n Define interpolating

    estimators for 𝑦# and 𝑦$ as 𝑥( 9 𝜃# and 𝑥( 9 𝜃$ , where 9 𝜃! = 𝑋! (𝑋! 7𝑋! (𝑦! 𝑎 ∈ {1,0}. • 𝑋! (𝑦! ) is a covariate matrix (outcome vector) with assigned treatment 𝑎. n We define an estimator of 𝜃∗ as a difference of the above estimators: 9 𝜃8-:;<=>;= = 9 𝜃# − 9 𝜃$ 9
  10. There exist 𝑏, 𝑐 > 1 such that if 𝛿

    < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 ? 𝜃! "-$%&'(%' ≤ # )∈{,,.} 𝑐 𝜃) ∗ 1ℬ!,2 Σ) + Σ − 𝜁) ∗Σ) 𝜃) ∗ 1 + 𝑐 𝜃, ∗ 𝜃, ∗ ℬ!,2 Σ +𝑐 log 1/𝛿 {𝒱! Σ + 𝜃, ∗ + 𝜃, ∗ 𝒱! Σ }, where 𝜁) ∗ = arg min3∈ℝ! Σ − 𝜁Σ) , ℬ!,2 Σ = Σ max 5" 6 ! , $78 # $ ! , 𝒱! Σ = 9∗ ! + ! :&∗ 6 . Upper Bound of the T-Learner 10 Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022) sample selection bias (distribution shift)
  11. Upper Bound of the T-Learner n Benign overfitting depends on

    the existence of sample selection bias. Ø Case 1: the treatment assignment does not depend on the covariates. • No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs → Σ − 𝜁! ∗Σ! 𝜃! ∗ - = 0 • 𝑅 9 𝜃 goes to zero under the same conditions used in Bartlett et al. (2020). Ø Case 2: the treatment assignment depends on the covariates. • Σ − 𝜁! ∗Σ! 𝜃! ∗ - in the upper bound does not go to zero. • The convergence of the excess risk 𝑅 9 𝜃 is not guaranteed. 11
  12. Interpolating Estimator with IPW-Learner Ø The IPW-Learner. n Suppose that

    the propensity score 𝑝(𝑑 = 1|𝑥) is known. n Obtain an unbiased estimator of the CATE, ̂ 𝜏 𝑥 = # .&# , "(.&#|*) − # .&$ , "(.&$|*) . • This estimator is called an IPW estimator. n Regress ̂ 𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator: 9 𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂ 𝜏 12
  13. Upper Bound of the IPW-Learner There exist 𝑏, 𝑐 >

    1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 9 𝜃' ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D Σ! + 𝑐 log 1/𝛿 𝒱' Σ . n This results is thanks to the unbiasedness of the IPW estimator, ̂ 𝜏 𝑥 . n Under appropriate conditions of the covariance operator, the prediction risk goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥). 13 Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)
  14. Conclusion n CATE prediction with an interpolating estimator. Ø Sample

    selection bias (distribution shift of the covariates). ü T-Learner: the distribution shift affects the upper bound. → Benign overfitting does not occur for the change of the covariance. ü IPW-Learner: we correct the distribution shift by the importance weight. → Benign overfitting occurs as well as the case in Bartlett et al. (2020). ? Conditions for benign overfitting of the T-learner (sup-norm convergence?). Thank you! ([email protected]) 14
  15. Reference • Bartlett, P. L., Long, P. M., Lugosi, G.,

    and Tsigler, A. (2020), “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology. • Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press. • Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences. • Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional Random Feature Regression,” — (2021b), “Overparameterization Improves Robustness to Covariate Shift in High Dimensions,” in Conference on Neural Information Processing Systems. 15