Slide 1

Slide 1 text

Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression TOPML Workshop April 5th 2022 Masahiro Kato The University of Tokyo / CyberAgent, Inc. AILab Masaaki Imaizumi The University of Tokyo

Slide 2

Slide 2 text

Introduction Ø Kato and Imaizumi (2022) “Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression” on arXiv. n Problem setting: • Goal: prediction of the conditional average treatment effect (CATE). • Interpolating estimators that fit the observations perfectly. n Finding: • In CATE prediction, compared with Bartlett et al. (2020), additional assumptions or modified algorithms are required for benign overfitting. 2

Slide 3

Slide 3 text

CATE Prediction n Potential outcome framework (Rubin (1974)). • Binary treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0) • For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦! . • An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}. Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥). • For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦$ . We cannot observe the outcome that the individual is not assigned. n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% % ' . 3

Slide 4

Slide 4 text

Linear Regression Model of the CATE n Assume a linear regression model for each potential outcome 𝑦! , 𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃! ∗ + 𝜀. n The causal effect between treatments is captured by the CATE. • The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦$|𝑥 : 𝜏 𝑥 = 𝔼 𝑦# |𝑥 − 𝔼 𝑦$ |𝑥 = 𝑥( 𝜃# ∗ − 𝜃$ ∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃# ∗ − 𝜃$ ∗ . Ø Goal: predict the CATE by using an estimator of 𝜃∗. 4

Slide 5

Slide 5 text

Excess Risk and Benign Overfitting n To evaluate the performance, we define the excess risk for an estimator 9 𝜃 as 𝑅 9 𝜃 = 𝔼*,, 𝑦# − 𝑦$ − 𝑥( 9 𝜃 - − 𝑦# − 𝑦$ − 𝑥(𝜃∗ - . n We also consider an overparametrized situation: 𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 . Ex. Benign overfitting framework by Bartlett et al. (2020). • Consider a standard regression setting with interpolating estimators. • 𝑅 9 𝜃 goes to zero under some conditions on the covariance of 𝑥. 5

Slide 6

Slide 6 text

Excess Risk and Benign Overfitting n Apply the framework of Bartlett et al (2020) to CATE prediction. n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥. • The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of the original covariates, 𝑥. n Owing to this bias, we need to add some modifications on the results of Bartlett et al. (2020) to show benign overfitting in the CATE prediction. • Note that the excess risk defines over the original distribution of 𝑥. (This problem is an instance of the distribution shift or covariate shift) 6

Slide 7

Slide 7 text

T-Learner and IPW-Learner n Consider two CATE prediction methods using interpolating estimators. 1. T-Learner: This method consists of two separate linear models. i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦$|𝑥], separately. ii. Predict the CATE by using the difference of the two estimators. 2. Inverse probability weight (IPW)-Learner: This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥). i. Construct an unbiased estimator ̂ 𝜏(𝑥) of 𝜏(𝑥) as # . , "(.|*) − # .&$ , "(.&$|*) . ii. Construct a predictor by regressing ̂ 𝜏(𝑥) on the covariates 𝑥. 7

Slide 8

Slide 8 text

Effective Rank n Show upper bounds using the effective rank (Bartlett et al. (2020). • Denote the covariance operator of covariates 𝑥 by Σ. • The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯. • If ∑2 3 𝜆2 < ∞ and 𝜆45# > 0 for 𝑘 ≥ 0, define 𝑟4 Σ = ∑264 𝜆2 𝜆45# 𝑎𝑛𝑑 𝑅4 Σ = ∑264 𝜆2 - ∑264 𝜆2 - • Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ! . The effective rank is a measure of the complexity of the covariances. 8

Slide 9

Slide 9 text

Interpolating Estimator with T-Learner Ø The T-Learner. n Define interpolating estimators for 𝑦# and 𝑦$ as 𝑥( 9 𝜃# and 𝑥( 9 𝜃$ , where 9 𝜃! = 𝑋! (𝑋! 7𝑋! (𝑦! 𝑎 ∈ {1,0}. • 𝑋! (𝑦! ) is a covariate matrix (outcome vector) with assigned treatment 𝑎. n We define an estimator of 𝜃∗ as a difference of the above estimators: 9 𝜃8-:;<=>;= = 9 𝜃# − 9 𝜃$ 9

Slide 10

Slide 10 text

There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 ? 𝜃! "-$%&'(%' ≤ # )∈{,,.} 𝑐 𝜃) ∗ 1ℬ!,2 Σ) + Σ − 𝜁) ∗Σ) 𝜃) ∗ 1 + 𝑐 𝜃, ∗ 𝜃, ∗ ℬ!,2 Σ +𝑐 log 1/𝛿 {𝒱! Σ + 𝜃, ∗ + 𝜃, ∗ 𝒱! Σ }, where 𝜁) ∗ = arg min3∈ℝ! Σ − 𝜁Σ) , ℬ!,2 Σ = Σ max 5" 6 ! , $78 # $ ! , 𝒱! Σ = 9∗ ! + ! :&∗ 6 . Upper Bound of the T-Learner 10 Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022) sample selection bias (distribution shift)

Slide 11

Slide 11 text

Upper Bound of the T-Learner n Benign overfitting depends on the existence of sample selection bias. Ø Case 1: the treatment assignment does not depend on the covariates. • No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs → Σ − 𝜁! ∗Σ! 𝜃! ∗ - = 0 • 𝑅 9 𝜃 goes to zero under the same conditions used in Bartlett et al. (2020). Ø Case 2: the treatment assignment depends on the covariates. • Σ − 𝜁! ∗Σ! 𝜃! ∗ - in the upper bound does not go to zero. • The convergence of the excess risk 𝑅 9 𝜃 is not guaranteed. 11

Slide 12

Slide 12 text

Interpolating Estimator with IPW-Learner Ø The IPW-Learner. n Suppose that the propensity score 𝑝(𝑑 = 1|𝑥) is known. n Obtain an unbiased estimator of the CATE, ̂ 𝜏 𝑥 = # . , "(.|*) − # .&$ , "(.&$|*) . • This estimator is called an IPW estimator. n Regress ̂ 𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator: 9 𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂ 𝜏 12

Slide 13

Slide 13 text

Upper Bound of the IPW-Learner There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ = min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 − 𝛿, 𝑅 9 𝜃' ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D Σ! + 𝑐 log 1/𝛿 𝒱' Σ . n This results is thanks to the unbiasedness of the IPW estimator, ̂ 𝜏 𝑥 . n Under appropriate conditions of the covariance operator, the prediction risk goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥). 13 Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)

Slide 14

Slide 14 text

Conclusion n CATE prediction with an interpolating estimator. Ø Sample selection bias (distribution shift of the covariates). ü T-Learner: the distribution shift affects the upper bound. → Benign overfitting does not occur for the change of the covariance. ü IPW-Learner: we correct the distribution shift by the importance weight. → Benign overfitting occurs as well as the case in Bartlett et al. (2020). ? Conditions for benign overfitting of the T-learner (sup-norm convergence?). Thank you! ([email protected]) 14

Slide 15

Slide 15 text

Reference • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology. • Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press. • Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences. • Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional Random Feature Regression,” — (2021b), “Overparameterization Improves Robustness to Covariate Shift in High Dimensions,” in Conference on Neural Information Processing Systems. 15