Treatment Effect Prediction with Linear Regression” on arXiv. n Problem setting: • Goal: prediction of the conditional average treatment effect (CATE). • Interpolating estimators that fit the observations perfectly. n Finding: • In CATE prediction, compared with Bartlett et al. (2020), additional assumptions or modified algorithms are required for benign overfitting. 2
treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0) • For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦! . • An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}. Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥). • For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦$ . We cannot observe the outcome that the individual is not assigned. n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% %&# ' . 3
regression model for each potential outcome 𝑦! , 𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃! ∗ + 𝜀. n The causal effect between treatments is captured by the CATE. • The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦$|𝑥 : 𝜏 𝑥 = 𝔼 𝑦# |𝑥 − 𝔼 𝑦$ |𝑥 = 𝑥( 𝜃# ∗ − 𝜃$ ∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃# ∗ − 𝜃$ ∗ . Ø Goal: predict the CATE by using an estimator of 𝜃∗. 4
we define the excess risk for an estimator 9 𝜃 as 𝑅 9 𝜃 = 𝔼*,, 𝑦# − 𝑦$ − 𝑥( 9 𝜃 - − 𝑦# − 𝑦$ − 𝑥(𝜃∗ - . n We also consider an overparametrized situation: 𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 . Ex. Benign overfitting framework by Bartlett et al. (2020). • Consider a standard regression setting with interpolating estimators. • 𝑅 9 𝜃 goes to zero under some conditions on the covariance of 𝑥. 5
Bartlett et al (2020) to CATE prediction. n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥. • The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of the original covariates, 𝑥. n Owing to this bias, we need to add some modifications on the results of Bartlett et al. (2020) to show benign overfitting in the CATE prediction. • Note that the excess risk defines over the original distribution of 𝑥. (This problem is an instance of the distribution shift or covariate shift) 6
interpolating estimators. 1. T-Learner: This method consists of two separate linear models. i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦$|𝑥], separately. ii. Predict the CATE by using the difference of the two estimators. 2. Inverse probability weight (IPW)-Learner: This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥). i. Construct an unbiased estimator ̂ 𝜏(𝑥) of 𝜏(𝑥) as # .&# , "(.&#|*) − # .&$ , "(.&$|*) . ii. Construct a predictor by regressing ̂ 𝜏(𝑥) on the covariates 𝑥. 7
(Bartlett et al. (2020). • Denote the covariance operator of covariates 𝑥 by Σ. • The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯. • If ∑2&# 3 𝜆2 < ∞ and 𝜆45# > 0 for 𝑘 ≥ 0, define 𝑟4 Σ = ∑264 𝜆2 𝜆45# 𝑎𝑛𝑑 𝑅4 Σ = ∑264 𝜆2 - ∑264 𝜆2 - • Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ! . The effective rank is a measure of the complexity of the covariances. 8
estimators for 𝑦# and 𝑦$ as 𝑥( 9 𝜃# and 𝑥( 9 𝜃$ , where 9 𝜃! = 𝑋! (𝑋! 7𝑋! (𝑦! 𝑎 ∈ {1,0}. • 𝑋! (𝑦! ) is a covariate matrix (outcome vector) with assigned treatment 𝑎. n We define an estimator of 𝜃∗ as a difference of the above estimators: 9 𝜃8-:;<=>;= = 9 𝜃# − 9 𝜃$ 9
the existence of sample selection bias. Ø Case 1: the treatment assignment does not depend on the covariates. • No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs → Σ − 𝜁! ∗Σ! 𝜃! ∗ - = 0 • 𝑅 9 𝜃 goes to zero under the same conditions used in Bartlett et al. (2020). Ø Case 2: the treatment assignment depends on the covariates. • Σ − 𝜁! ∗Σ! 𝜃! ∗ - in the upper bound does not go to zero. • The convergence of the excess risk 𝑅 9 𝜃 is not guaranteed. 11
the propensity score 𝑝(𝑑 = 1|𝑥) is known. n Obtain an unbiased estimator of the CATE, ̂ 𝜏 𝑥 = # .&# , "(.&#|*) − # .&$ , "(.&$|*) . • This estimator is called an IPW estimator. n Regress ̂ 𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator: 9 𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂ 𝜏 12
1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 9 𝜃' ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D Σ! + 𝑐 log 1/𝛿 𝒱' Σ . n This results is thanks to the unbiasedness of the IPW estimator, ̂ 𝜏 𝑥 . n Under appropriate conditions of the covariance operator, the prediction risk goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥). 13 Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)
selection bias (distribution shift of the covariates). ü T-Learner: the distribution shift affects the upper bound. → Benign overfitting does not occur for the change of the covariance. ü IPW-Learner: we correct the distribution shift by the importance weight. → Benign overfitting occurs as well as the case in Bartlett et al. (2020). ? Conditions for benign overfitting of the T-learner (sup-norm convergence?). Thank you! (mkato.csecon@gmail.com) 14
and Tsigler, A. (2020), “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology. • Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press. • Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences. • Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional Random Feature Regression,” — (2021b), “Overparameterization Improves Robustness to Covariate Shift in High Dimensions,” in Conference on Neural Information Processing Systems. 15