Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Benign Overfitting in Conditional Average Treat...

Avatar for MasaKat0 MasaKat0
April 06, 2022

Benign Overfitting in Conditional Average Treatment Effect Prediction with Linearย Regression

TOPML2022.

Avatar for MasaKat0

MasaKat0

April 06, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear

    Regression TOPML Workshop April 5th 2022 Masahiro Kato The University of Tokyo / CyberAgent, Inc. AILab Masaaki Imaizumi The University of Tokyo
  2. Introduction ร˜ Kato and Imaizumi (2022) โ€œBenign-Overfitting in Conditional Average

    Treatment Effect Prediction with Linear Regressionโ€ on arXiv. n Problem setting: โ€ข Goal: prediction of the conditional average treatment effect (CATE). โ€ข Interpolating estimators that fit the observations perfectly. n Finding: โ€ข In CATE prediction, compared with Bartlett et al. (2020), additional assumptions or modified algorithms are required for benign overfitting. 2
  3. CATE Prediction n Potential outcome framework (Rubin (1974)). โ€ข Binary

    treatment ๐‘Ž โˆˆ 1,0 , e.g., new medicine (๐‘Ž = 1) and placebo (๐‘Ž = 0) โ€ข For each treatment ๐‘Ž โˆˆ {1,0}, we define the potential outcome ๐‘ฆ! . โ€ข An individual with covariate ๐‘ฅ โˆˆ โ„" gets a treatment ๐‘‘ โˆˆ {1,0}. Propensity score: the treatment assignment probability ๐‘(๐‘‘ = ๐‘Ž|๐‘ฅ). โ€ข For the treatment, we observe the outcome ๐‘ฆ = 1 ๐‘‘ = 1 ๐‘ฆ# + 1 ๐‘‘ = 0 ๐‘ฆ$ . We cannot observe the outcome that the individual is not assigned. n For ๐‘› individulas, the observations are given as ๐‘ฆ%, ๐‘‘%, ๐‘ฅ% %&# ' . 3
  4. Linear Regression Model of the CATE n Assume a linear

    regression model for each potential outcome ๐‘ฆ! , ๐‘ฆ! = ๐”ผ ๐‘ฆ!|๐‘ฅ + ๐œ€ = ๐‘ฅ(๐œƒ! โˆ— + ๐œ€. n The causal effect between treatments is captured by the CATE. โ€ข The CATE is defined as a difference of ๐”ผ ๐‘ฆ#|๐‘ฅ and ๐”ผ ๐‘ฆ$|๐‘ฅ : ๐œ ๐‘ฅ = ๐”ผ ๐‘ฆ# |๐‘ฅ โˆ’ ๐”ผ ๐‘ฆ$ |๐‘ฅ = ๐‘ฅ( ๐œƒ# โˆ— โˆ’ ๐œƒ$ โˆ— = ๐‘ฅ(๐œƒโˆ— ๐œƒโˆ— = ๐œƒ# โˆ— โˆ’ ๐œƒ$ โˆ— . ร˜ Goal: predict the CATE by using an estimator of ๐œƒโˆ—. 4
  5. Excess Risk and Benign Overfitting n To evaluate the performance,

    we define the excess risk for an estimator 9 ๐œƒ as ๐‘… 9 ๐œƒ = ๐”ผ*,, ๐‘ฆ# โˆ’ ๐‘ฆ$ โˆ’ ๐‘ฅ( 9 ๐œƒ - โˆ’ ๐‘ฆ# โˆ’ ๐‘ฆ$ โˆ’ ๐‘ฅ(๐œƒโˆ— - . n We also consider an overparametrized situation: ๐’ ๐’”๐’‚๐’Ž๐’‘๐’๐’† ๐’”๐’Š๐’›๐’† โ‰ค ๐’‘ ๐’•๐’‰๐’† ๐’๐’–๐’Ž๐’ƒ๐’†๐’“ ๐’๐’‡ ๐’•๐’‰๐’† ๐’‘๐’‚๐’“๐’‚๐’Ž๐’†๐’•๐’†๐’“ . Ex. Benign overfitting framework by Bartlett et al. (2020). โ€ข Consider a standard regression setting with interpolating estimators. โ€ข ๐‘… 9 ๐œƒ goes to zero under some conditions on the covariance of ๐‘ฅ. 5
  6. Excess Risk and Benign Overfitting n Apply the framework of

    Bartlett et al (2020) to CATE prediction. n There is a sample selection bias, i.e., ๐‘ ๐‘‘ = ๐‘Ž ๐‘ฅ depends on covariates ๐‘ฅ. โ€ข The distribution of the observed covariates, 1 ๐‘‘ = ๐‘Ž ๐‘ฅ, changes from that of the original covariates, ๐‘ฅ. n Owing to this bias, we need to add some modifications on the results of Bartlett et al. (2020) to show benign overfitting in the CATE prediction. โ€ข Note that the excess risk defines over the original distribution of ๐‘ฅ. (This problem is an instance of the distribution shift or covariate shift) 6
  7. T-Learner and IPW-Learner n Consider two CATE prediction methods using

    interpolating estimators. 1. T-Learner: This method consists of two separate linear models. i. Estimate linear regression models for ๐”ผ[๐‘ฆ#|๐‘ฅ] and ๐”ผ[๐‘ฆ$|๐‘ฅ], separately. ii. Predict the CATE by using the difference of the two estimators. 2. Inverse probability weight (IPW)-Learner: This method utilizes the propensity score ๐‘(๐‘‘ = ๐‘Ž|๐‘ฅ). i. Construct an unbiased estimator ฬ‚ ๐œ(๐‘ฅ) of ๐œ(๐‘ฅ) as # .&# , "(.&#|*) โˆ’ # .&$ , "(.&$|*) . ii. Construct a predictor by regressing ฬ‚ ๐œ(๐‘ฅ) on the covariates ๐‘ฅ. 7
  8. Effective Rank n Show upper bounds using the effective rank

    (Bartlett et al. (2020). โ€ข Denote the covariance operator of covariates ๐‘ฅ by ฮฃ. โ€ข The eigenvalues ๐œ†2 = ๐œ‡2(ฮฃ) for j = 1,2, โ€ฆ such that ๐œ†# > ๐œ†- > โ‹ฏ. โ€ข If โˆ‘2&# 3 ๐œ†2 < โˆž and ๐œ†45# > 0 for ๐‘˜ โ‰ฅ 0, define ๐‘Ÿ4 ฮฃ = โˆ‘264 ๐œ†2 ๐œ†45# ๐‘Ž๐‘›๐‘‘ ๐‘…4 ฮฃ = โˆ‘264 ๐œ†2 - โˆ‘264 ๐œ†2 - โ€ข Denote the covariance of the observed covariates 1[๐‘‘ = ๐‘Ž]๐‘ฅ by ฮฃ! . The effective rank is a measure of the complexity of the covariances. 8
  9. Interpolating Estimator with T-Learner ร˜ The T-Learner. n Define interpolating

    estimators for ๐‘ฆ# and ๐‘ฆ$ as ๐‘ฅ( 9 ๐œƒ# and ๐‘ฅ( 9 ๐œƒ$ , where 9 ๐œƒ! = ๐‘‹! (๐‘‹! 7๐‘‹! (๐‘ฆ! ๐‘Ž โˆˆ {1,0}. โ€ข ๐‘‹! (๐‘ฆ! ) is a covariate matrix (outcome vector) with assigned treatment ๐‘Ž. n We define an estimator of ๐œƒโˆ— as a difference of the above estimators: 9 ๐œƒ8-:;<=>;= = 9 ๐œƒ# โˆ’ 9 ๐œƒ$ 9
  10. There exist ๐‘, ๐‘ > 1 such that if ๐›ฟ

    < 1 with log(1/๐›ฟ) < ๐‘›/๐‘ and ๐‘˜โˆ— = min{๐‘˜ โ‰ฅ 0: ๐‘Ÿ4 ฮฃ โ‰ฅ ๐‘๐‘›} < ๐‘›/๐‘# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 โˆ’ ๐›ฟ, ๐‘… ? ๐œƒ! "-$%&'(%' โ‰ค # )โˆˆ{,,.} ๐‘ ๐œƒ) โˆ— 1โ„ฌ!,2 ฮฃ) + ฮฃ โˆ’ ๐œ) โˆ—ฮฃ) ๐œƒ) โˆ— 1 + ๐‘ ๐œƒ, โˆ— ๐œƒ, โˆ— โ„ฌ!,2 ฮฃ +๐‘ log 1/๐›ฟ {๐’ฑ! ฮฃ + ๐œƒ, โˆ— + ๐œƒ, โˆ— ๐’ฑ! ฮฃ }, where ๐œ) โˆ— = arg min3โˆˆโ„! ฮฃ โˆ’ ๐œฮฃ) , โ„ฌ!,2 ฮฃ = ฮฃ max 5" 6 ! , $78 # $ ! , ๐’ฑ! ฮฃ = 9โˆ— ! + ! :&โˆ— 6 . Upper Bound of the T-Learner 10 Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022) sample selection bias (distribution shift)
  11. Upper Bound of the T-Learner n Benign overfitting depends on

    the existence of sample selection bias. ร˜ Case 1: the treatment assignment does not depend on the covariates. โ€ข No sample selection bias (๐‘ ๐‘‘ = 1 ๐‘ฅ = ๐‘(๐‘‘ = 1)), e.x., RCTs โ†’ ฮฃ โˆ’ ๐œ! โˆ—ฮฃ! ๐œƒ! โˆ— - = 0 โ€ข ๐‘… 9 ๐œƒ goes to zero under the same conditions used in Bartlett et al. (2020). ร˜ Case 2: the treatment assignment depends on the covariates. โ€ข ฮฃ โˆ’ ๐œ! โˆ—ฮฃ! ๐œƒ! โˆ— - in the upper bound does not go to zero. โ€ข The convergence of the excess risk ๐‘… 9 ๐œƒ is not guaranteed. 11
  12. Interpolating Estimator with IPW-Learner ร˜ The IPW-Learner. n Suppose that

    the propensity score ๐‘(๐‘‘ = 1|๐‘ฅ) is known. n Obtain an unbiased estimator of the CATE, ฬ‚ ๐œ ๐‘ฅ = # .&# , "(.&#|*) โˆ’ # .&$ , "(.&$|*) . โ€ข This estimator is called an IPW estimator. n Regress ฬ‚ ๐œ ๐‘ฅ on ๐‘ฅ to estimate ๐œƒโˆ—with an interpolating estimator: 9 ๐œƒABC-:;<=>;= = ๐‘‹(๐‘‹ 7๐‘‹( ฬ‚ ๐œ 12
  13. Upper Bound of the IPW-Learner There exist ๐‘, ๐‘ >

    1 such that if ๐›ฟ < 1 with log(1/๐›ฟ) < ๐‘›/๐‘ and ๐‘˜โˆ— = min{๐‘˜ โ‰ฅ 0: ๐‘Ÿ4 ฮฃ โ‰ฅ ๐‘๐‘›} < ๐‘›/๐‘# , then under some regularity conditions, the excess risk of the predictor satisfies with probability at least 1 โˆ’ ๐›ฟ, ๐‘… 9 ๐œƒ' ABC- :;<=>;= โ‰ค ๐‘ ๐œƒโˆ— -โ„ฌ',D ฮฃ! + ๐‘ log 1/๐›ฟ ๐’ฑ' ฮฃ . n This results is thanks to the unbiasedness of the IPW estimator, ฬ‚ ๐œ ๐‘ฅ . n Under appropriate conditions of the covariance operator, the prediction risk goes to zero, regardless of the treatment assignment rule ๐‘(๐‘‘ = 1|๐‘ฅ). 13 Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)
  14. Conclusion n CATE prediction with an interpolating estimator. ร˜ Sample

    selection bias (distribution shift of the covariates). รผ T-Learner: the distribution shift affects the upper bound. โ†’ Benign overfitting does not occur for the change of the covariance. รผ IPW-Learner: we correct the distribution shift by the importance weight. โ†’ Benign overfitting occurs as well as the case in Bartlett et al. (2020). ? Conditions for benign overfitting of the T-learner (sup-norm convergence?). Thank you! ([email protected]) 14
  15. Reference โ€ข Bartlett, P. L., Long, P. M., Lugosi, G.,

    and Tsigler, A. (2020), โ€œBenign overfitting in linear regression,โ€ Proceedings of the National Academy of Sciences, 117, 30063โ€“30070. โ€ข Rubin, D. B. (1974), โ€œEstimating causal effects of treatments in randomized and nonrandomized studies,โ€ Journal of Educational Psychology. โ€ข Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press. โ€ข Kรผnzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), โ€œMetalearners for estimating heterogeneous treatment effects using machine learning,โ€ Proceedings of the National Academy of Sciences. โ€ข Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), โ€œCovariate Shift in High-Dimensional Random Feature Regression,โ€ โ€” (2021b), โ€œOverparameterization Improves Robustness to Covariate Shift in High Dimensions,โ€ in Conference on Neural Information Processing Systems. 15