160

# Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

TOPML2022. April 06, 2022

## Transcript

1. Benign Overfitting in Conditional Average Treatment
Effect Prediction with Linear Regression
TOPML Workshop April 5th 2022
Masahiro Kato
The University of Tokyo / CyberAgent, Inc. AILab
Masaaki Imaizumi
The University of Tokyo

2. Introduction
Ø Kato and Imaizumi (2022) “Benign-Overfitting in Conditional Average
Treatment Effect Prediction with Linear Regression” on arXiv.
n Problem setting:
• Goal: prediction of the conditional average treatment effect (CATE).
• Interpolating estimators that fit the observations perfectly.
n Finding:
• In CATE prediction, compared with Bartlett et al. (2020), additional
assumptions or modified algorithms are required for benign overfitting.
2

3. CATE Prediction
n Potential outcome framework (Rubin (1974)).
• Binary treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0)
• For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦!
.
• An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}.
Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥).
• For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦\$
.
We cannot observe the outcome that the individual is not assigned.
n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% %
' .
3

4. Linear Regression Model of the CATE
n Assume a linear regression model for each potential outcome 𝑦!
,
𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃!
∗ + 𝜀.
n The causal effect between treatments is captured by the CATE.
• The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦\$|𝑥 :
𝜏 𝑥 = 𝔼 𝑦#
|𝑥 − 𝔼 𝑦\$
|𝑥 = 𝑥( 𝜃#
∗ − 𝜃\$
∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃#
∗ − 𝜃\$
∗ .
Ø Goal: predict the CATE by using an estimator of 𝜃∗.
4

5. Excess Risk and Benign Overfitting
n To evaluate the performance, we define the excess risk for an estimator 9
𝜃 as
𝑅 9
𝜃 = 𝔼*,, 𝑦# − 𝑦\$ − 𝑥( 9
𝜃
-
− 𝑦# − 𝑦\$ − 𝑥(𝜃∗ - .
n We also consider an overparametrized situation:
𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 .
Ex. Benign overfitting framework by Bartlett et al. (2020).
• Consider a standard regression setting with interpolating estimators.
• 𝑅 9
𝜃 goes to zero under some conditions on the covariance of 𝑥.
5

6. Excess Risk and Benign Overfitting
n Apply the framework of Bartlett et al (2020) to CATE prediction.
n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥.
• The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of
the original covariates, 𝑥.
n Owing to this bias, we need to add some modifications on the results of
Bartlett et al. (2020) to show benign overfitting in the CATE prediction.
• Note that the excess risk defines over the original distribution of 𝑥.
(This problem is an instance of the distribution shift or covariate shift)
6

7. T-Learner and IPW-Learner
n Consider two CATE prediction methods using interpolating estimators.
1. T-Learner: This method consists of two separate linear models.
i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦\$|𝑥], separately.
ii. Predict the CATE by using the difference of the two estimators.
2. Inverse probability weight (IPW)-Learner:
This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥).
i. Construct an unbiased estimator ̂
𝜏(𝑥) of 𝜏(𝑥) as # . ,
"(.|*)
− # .&\$ ,
"(.&\$|*)
.
ii. Construct a predictor by regressing ̂
𝜏(𝑥) on the covariates 𝑥.
7

8. Effective Rank
n Show upper bounds using the effective rank (Bartlett et al. (2020).
• Denote the covariance operator of covariates 𝑥 by Σ.
• The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯.
• If ∑2
3 𝜆2
< ∞ and 𝜆45#
> 0 for 𝑘 ≥ 0, define
𝑟4 Σ =
∑264
𝜆2
𝜆45#
𝑎𝑛𝑑 𝑅4 Σ =
∑264
𝜆2
-
∑264
𝜆2
-
• Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ!
.
The effective rank is a measure of the complexity of the covariances.
8

9. Interpolating Estimator with T-Learner
Ø The T-Learner.
n Define interpolating estimators for 𝑦#
and 𝑦\$
as 𝑥( 9
𝜃#
and 𝑥( 9
𝜃\$
, where
9
𝜃! = 𝑋!
(𝑋!
7𝑋!
(𝑦! 𝑎 ∈ {1,0}.
• 𝑋!
(𝑦!
) is a covariate matrix (outcome vector) with assigned treatment 𝑎.
n We define an estimator of 𝜃∗ as a difference of the above estimators:
9
𝜃8-:;<=>;= = 9
𝜃# − 9
𝜃\$
9

10. There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ =
min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐#
, then under some regularity conditions, the
excess risk of the predictor satisfies with probability at least 1 − 𝛿,
𝑅 ?
𝜃!
"-\$%&'(%' ≤ #
)∈{,,.}
𝑐 𝜃)
∗ 1ℬ!,2 Σ) + Σ − 𝜁)
∗Σ) 𝜃)
∗ 1 + 𝑐 𝜃,
∗ 𝜃,
∗ ℬ!,2 Σ
+𝑐 log 1/𝛿 {𝒱! Σ + 𝜃,
∗ + 𝜃,
∗ 𝒱! Σ },
where 𝜁)
∗ = arg min3∈ℝ! Σ − 𝜁Σ)
, ℬ!,2 Σ = Σ max 5" 6
!
,
\$78 #
\$
!
, 𝒱! Σ = 9∗
!
+ !
:&∗ 6
.
Upper Bound of the T-Learner
10
Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022)
sample selection bias
(distribution shift)

11. Upper Bound of the T-Learner
n Benign overfitting depends on the existence of sample selection bias.
Ø Case 1: the treatment assignment does not depend on the covariates.
• No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs
→ Σ − 𝜁!
∗Σ! 𝜃!
∗ - = 0
• 𝑅 9
𝜃 goes to zero under the same conditions used in Bartlett et al. (2020).
Ø Case 2: the treatment assignment depends on the covariates.
• Σ − 𝜁!
∗Σ!
𝜃!
∗ - in the upper bound does not go to zero.
• The convergence of the excess risk 𝑅 9
𝜃 is not guaranteed.
11

12. Interpolating Estimator with IPW-Learner
Ø The IPW-Learner.
n Suppose that the propensity score 𝑝(𝑑 = 1|𝑥) is known.
n Obtain an unbiased estimator of the CATE, ̂
𝜏 𝑥 = # . ,
"(.|*)
− # .&\$ ,
"(.&\$|*)
.
• This estimator is called an IPW estimator.
n Regress ̂
𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator:
9
𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂
𝜏
12

13. Upper Bound of the IPW-Learner
There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ =
min{𝑘 ≥ 0: 𝑟4
Σ ≥ 𝑏𝑛} < 𝑛/𝑐#
, then under some regularity conditions, the
excess risk of the predictor satisfies with probability at least 1 − 𝛿,
𝑅 9
𝜃'
ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D
Σ!
+ 𝑐 log 1/𝛿 𝒱'
Σ .
n This results is thanks to the unbiasedness of the IPW estimator, ̂
𝜏 𝑥 .
n Under appropriate conditions of the covariance operator, the prediction risk
goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥).
13
Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)

14. Conclusion
n CATE prediction with an interpolating estimator.
Ø Sample selection bias (distribution shift of the covariates).
ü T-Learner: the distribution shift affects the upper bound.
→ Benign overfitting does not occur for the change of the covariance.
ü IPW-Learner: we correct the distribution shift by the importance weight.
→ Benign overfitting occurs as well as the case in Bartlett et al. (2020).
? Conditions for benign overfitting of the T-learner (sup-norm convergence?).
Thank you! ([email protected])
14

15. Reference
• Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign overfitting in linear
regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070.
• Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and
nonrandomized studies,” Journal of Educational Psychology.
• Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and
Biomedical Sciences: An Introduction, Cambridge University Press.
• Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating
heterogeneous treatment
effects using machine learning,” Proceedings of the National Academy of Sciences.
• Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional
Random Feature
Regression,”
— (2021b), “Overparameterization Improves Robustness to Covariate Shift in High
Dimensions,” in Conference on Neural Information Processing Systems.
15