89

# Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

TOPML2022.

April 06, 2022

## Transcript

1. ### Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear

Regression TOPML Workshop April 5th 2022 Masahiro Kato The University of Tokyo / CyberAgent, Inc. AILab Masaaki Imaizumi The University of Tokyo
2. ### Introduction Ø Kato and Imaizumi (2022) “Benign-Overfitting in Conditional Average

Treatment Effect Prediction with Linear Regression” on arXiv. n Problem setting: • Goal: prediction of the conditional average treatment effect (CATE). • Interpolating estimators that fit the observations perfectly. n Finding: • In CATE prediction, compared with Bartlett et al. (2020), additional assumptions or modified algorithms are required for benign overfitting. 2
3. ### CATE Prediction n Potential outcome framework (Rubin (1974)). • Binary

treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0) • For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦! . • An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}. Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥). • For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦\$ . We cannot observe the outcome that the individual is not assigned. n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% %&# ' . 3
4. ### Linear Regression Model of the CATE n Assume a linear

regression model for each potential outcome 𝑦! , 𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃! ∗ + 𝜀. n The causal effect between treatments is captured by the CATE. • The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦\$|𝑥 : 𝜏 𝑥 = 𝔼 𝑦# |𝑥 − 𝔼 𝑦\$ |𝑥 = 𝑥( 𝜃# ∗ − 𝜃\$ ∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃# ∗ − 𝜃\$ ∗ . Ø Goal: predict the CATE by using an estimator of 𝜃∗. 4
5. ### Excess Risk and Benign Overfitting n To evaluate the performance,

we define the excess risk for an estimator 9 𝜃 as 𝑅 9 𝜃 = 𝔼*,, 𝑦# − 𝑦\$ − 𝑥( 9 𝜃 - − 𝑦# − 𝑦\$ − 𝑥(𝜃∗ - . n We also consider an overparametrized situation: 𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 . Ex. Benign overfitting framework by Bartlett et al. (2020). • Consider a standard regression setting with interpolating estimators. • 𝑅 9 𝜃 goes to zero under some conditions on the covariance of 𝑥. 5
6. ### Excess Risk and Benign Overfitting n Apply the framework of

Bartlett et al (2020) to CATE prediction. n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥. • The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of the original covariates, 𝑥. n Owing to this bias, we need to add some modifications on the results of Bartlett et al. (2020) to show benign overfitting in the CATE prediction. • Note that the excess risk defines over the original distribution of 𝑥. (This problem is an instance of the distribution shift or covariate shift) 6
7. ### T-Learner and IPW-Learner n Consider two CATE prediction methods using

interpolating estimators. 1. T-Learner: This method consists of two separate linear models. i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦\$|𝑥], separately. ii. Predict the CATE by using the difference of the two estimators. 2. Inverse probability weight (IPW)-Learner: This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥). i. Construct an unbiased estimator ̂ 𝜏(𝑥) of 𝜏(𝑥) as # .&# , "(.&#|*) − # .&\$ , "(.&\$|*) . ii. Construct a predictor by regressing ̂ 𝜏(𝑥) on the covariates 𝑥. 7
8. ### Effective Rank n Show upper bounds using the effective rank

(Bartlett et al. (2020). • Denote the covariance operator of covariates 𝑥 by Σ. • The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯. • If ∑2&# 3 𝜆2 < ∞ and 𝜆45# > 0 for 𝑘 ≥ 0, define 𝑟4 Σ = ∑264 𝜆2 𝜆45# 𝑎𝑛𝑑 𝑅4 Σ = ∑264 𝜆2 - ∑264 𝜆2 - • Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ! . The effective rank is a measure of the complexity of the covariances. 8
9. ### Interpolating Estimator with T-Learner Ø The T-Learner. n Define interpolating

estimators for 𝑦# and 𝑦\$ as 𝑥( 9 𝜃# and 𝑥( 9 𝜃\$ , where 9 𝜃! = 𝑋! (𝑋! 7𝑋! (𝑦! 𝑎 ∈ {1,0}. • 𝑋! (𝑦! ) is a covariate matrix (outcome vector) with assigned treatment 𝑎. n We define an estimator of 𝜃∗ as a difference of the above estimators: 9 𝜃8-:;<=>;= = 9 𝜃# − 9 𝜃\$ 9
10. ### There exist 𝑏, 𝑐 > 1 such that if 𝛿

< 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 ? 𝜃! "-\$%&'(%' ≤ # )∈{,,.} 𝑐 𝜃) ∗ 1ℬ!,2 Σ) + Σ − 𝜁) ∗Σ) 𝜃) ∗ 1 + 𝑐 𝜃, ∗ 𝜃, ∗ ℬ!,2 Σ +𝑐 log 1/𝛿 {𝒱! Σ + 𝜃, ∗ + 𝜃, ∗ 𝒱! Σ }, where 𝜁) ∗ = arg min3∈ℝ! Σ − 𝜁Σ) , ℬ!,2 Σ = Σ max 5" 6 ! , \$78 # \$ ! , 𝒱! Σ = 9∗ ! + ! :&∗ 6 . Upper Bound of the T-Learner 10 Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022) sample selection bias (distribution shift)
11. ### Upper Bound of the T-Learner n Benign overfitting depends on

the existence of sample selection bias. Ø Case 1: the treatment assignment does not depend on the covariates. • No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs → Σ − 𝜁! ∗Σ! 𝜃! ∗ - = 0 • 𝑅 9 𝜃 goes to zero under the same conditions used in Bartlett et al. (2020). Ø Case 2: the treatment assignment depends on the covariates. • Σ − 𝜁! ∗Σ! 𝜃! ∗ - in the upper bound does not go to zero. • The convergence of the excess risk 𝑅 9 𝜃 is not guaranteed. 11
12. ### Interpolating Estimator with IPW-Learner Ø The IPW-Learner. n Suppose that

the propensity score 𝑝(𝑑 = 1|𝑥) is known. n Obtain an unbiased estimator of the CATE, ̂ 𝜏 𝑥 = # .&# , "(.&#|*) − # .&\$ , "(.&\$|*) . • This estimator is called an IPW estimator. n Regress ̂ 𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator: 9 𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂ 𝜏 12
13. ### Upper Bound of the IPW-Learner There exist 𝑏, 𝑐 >

1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 9 𝜃' ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D Σ! + 𝑐 log 1/𝛿 𝒱' Σ . n This results is thanks to the unbiasedness of the IPW estimator, ̂ 𝜏 𝑥 . n Under appropriate conditions of the covariance operator, the prediction risk goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥). 13 Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)
14. ### Conclusion n CATE prediction with an interpolating estimator. Ø Sample

selection bias (distribution shift of the covariates). ü T-Learner: the distribution shift affects the upper bound. → Benign overfitting does not occur for the change of the covariance. ü IPW-Learner: we correct the distribution shift by the importance weight. → Benign overfitting occurs as well as the case in Bartlett et al. (2020). ? Conditions for benign overfitting of the T-learner (sup-norm convergence?). Thank you! (mkato.csecon@gmail.com) 14
15. ### Reference • Bartlett, P. L., Long, P. M., Lugosi, G.,

and Tsigler, A. (2020), “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology. • Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press. • Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences. • Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional Random Feature Regression,” — (2021b), “Overparameterization Improves Robustness to Covariate Shift in High Dimensions,” in Conference on Neural Information Processing Systems. 15