Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

MasaKat0
April 06, 2022

Benign Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression

TOPML2022.

MasaKat0

April 06, 2022
Tweet

More Decks by MasaKat0

Other Decks in Research

Transcript

  1. Benign Overfitting in Conditional Average Treatment
    Effect Prediction with Linear Regression
    TOPML Workshop April 5th 2022
    Masahiro Kato
    The University of Tokyo / CyberAgent, Inc. AILab
    Masaaki Imaizumi
    The University of Tokyo

    View full-size slide

  2. Introduction
    Ø Kato and Imaizumi (2022) “Benign-Overfitting in Conditional Average
    Treatment Effect Prediction with Linear Regression” on arXiv.
    n Problem setting:
    • Goal: prediction of the conditional average treatment effect (CATE).
    • Interpolating estimators that fit the observations perfectly.
    n Finding:
    • In CATE prediction, compared with Bartlett et al. (2020), additional
    assumptions or modified algorithms are required for benign overfitting.
    2

    View full-size slide

  3. CATE Prediction
    n Potential outcome framework (Rubin (1974)).
    • Binary treatment 𝑎 ∈ 1,0 , e.g., new medicine (𝑎 = 1) and placebo (𝑎 = 0)
    • For each treatment 𝑎 ∈ {1,0}, we define the potential outcome 𝑦!
    .
    • An individual with covariate 𝑥 ∈ ℝ" gets a treatment 𝑑 ∈ {1,0}.
    Propensity score: the treatment assignment probability 𝑝(𝑑 = 𝑎|𝑥).
    • For the treatment, we observe the outcome 𝑦 = 1 𝑑 = 1 𝑦# + 1 𝑑 = 0 𝑦$
    .
    We cannot observe the outcome that the individual is not assigned.
    n For 𝑛 individulas, the observations are given as 𝑦%, 𝑑%, 𝑥% %
    ' .
    3

    View full-size slide

  4. Linear Regression Model of the CATE
    n Assume a linear regression model for each potential outcome 𝑦!
    ,
    𝑦! = 𝔼 𝑦!|𝑥 + 𝜀 = 𝑥(𝜃!
    ∗ + 𝜀.
    n The causal effect between treatments is captured by the CATE.
    • The CATE is defined as a difference of 𝔼 𝑦#|𝑥 and 𝔼 𝑦$|𝑥 :
    𝜏 𝑥 = 𝔼 𝑦#
    |𝑥 − 𝔼 𝑦$
    |𝑥 = 𝑥( 𝜃#
    ∗ − 𝜃$
    ∗ = 𝑥(𝜃∗ 𝜃∗ = 𝜃#
    ∗ − 𝜃$
    ∗ .
    Ø Goal: predict the CATE by using an estimator of 𝜃∗.
    4

    View full-size slide

  5. Excess Risk and Benign Overfitting
    n To evaluate the performance, we define the excess risk for an estimator 9
    𝜃 as
    𝑅 9
    𝜃 = 𝔼*,, 𝑦# − 𝑦$ − 𝑥( 9
    𝜃
    -
    − 𝑦# − 𝑦$ − 𝑥(𝜃∗ - .
    n We also consider an overparametrized situation:
    𝒏 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 ≤ 𝒑 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 .
    Ex. Benign overfitting framework by Bartlett et al. (2020).
    • Consider a standard regression setting with interpolating estimators.
    • 𝑅 9
    𝜃 goes to zero under some conditions on the covariance of 𝑥.
    5

    View full-size slide

  6. Excess Risk and Benign Overfitting
    n Apply the framework of Bartlett et al (2020) to CATE prediction.
    n There is a sample selection bias, i.e., 𝑝 𝑑 = 𝑎 𝑥 depends on covariates 𝑥.
    • The distribution of the observed covariates, 1 𝑑 = 𝑎 𝑥, changes from that of
    the original covariates, 𝑥.
    n Owing to this bias, we need to add some modifications on the results of
    Bartlett et al. (2020) to show benign overfitting in the CATE prediction.
    • Note that the excess risk defines over the original distribution of 𝑥.
    (This problem is an instance of the distribution shift or covariate shift)
    6

    View full-size slide

  7. T-Learner and IPW-Learner
    n Consider two CATE prediction methods using interpolating estimators.
    1. T-Learner: This method consists of two separate linear models.
    i. Estimate linear regression models for 𝔼[𝑦#|𝑥] and 𝔼[𝑦$|𝑥], separately.
    ii. Predict the CATE by using the difference of the two estimators.
    2. Inverse probability weight (IPW)-Learner:
    This method utilizes the propensity score 𝑝(𝑑 = 𝑎|𝑥).
    i. Construct an unbiased estimator ̂
    𝜏(𝑥) of 𝜏(𝑥) as # . ,
    "(.|*)
    − # .&$ ,
    "(.&$|*)
    .
    ii. Construct a predictor by regressing ̂
    𝜏(𝑥) on the covariates 𝑥.
    7

    View full-size slide

  8. Effective Rank
    n Show upper bounds using the effective rank (Bartlett et al. (2020).
    • Denote the covariance operator of covariates 𝑥 by Σ.
    • The eigenvalues 𝜆2 = 𝜇2(Σ) for j = 1,2, … such that 𝜆# > 𝜆- > ⋯.
    • If ∑2
    3 𝜆2
    < ∞ and 𝜆45#
    > 0 for 𝑘 ≥ 0, define
    𝑟4 Σ =
    ∑264
    𝜆2
    𝜆45#
    𝑎𝑛𝑑 𝑅4 Σ =
    ∑264
    𝜆2
    -
    ∑264
    𝜆2
    -
    • Denote the covariance of the observed covariates 1[𝑑 = 𝑎]𝑥 by Σ!
    .
    The effective rank is a measure of the complexity of the covariances.
    8

    View full-size slide

  9. Interpolating Estimator with T-Learner
    Ø The T-Learner.
    n Define interpolating estimators for 𝑦#
    and 𝑦$
    as 𝑥( 9
    𝜃#
    and 𝑥( 9
    𝜃$
    , where
    9
    𝜃! = 𝑋!
    (𝑋!
    7𝑋!
    (𝑦! 𝑎 ∈ {1,0}.
    • 𝑋!
    (𝑦!
    ) is a covariate matrix (outcome vector) with assigned treatment 𝑎.
    n We define an estimator of 𝜃∗ as a difference of the above estimators:
    9
    𝜃8-:;<=>;= = 9
    𝜃# − 9
    𝜃$
    9

    View full-size slide

  10. There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ =
    min{𝑘 ≥ 0: 𝑟4 Σ ≥ 𝑏𝑛} < 𝑛/𝑐#
    , then under some regularity conditions, the
    excess risk of the predictor satisfies with probability at least 1 − 𝛿,
    𝑅 ?
    𝜃!
    "-$%&'(%' ≤ #
    )∈{,,.}
    𝑐 𝜃)
    ∗ 1ℬ!,2 Σ) + Σ − 𝜁)
    ∗Σ) 𝜃)
    ∗ 1 + 𝑐 𝜃,
    ∗ 𝜃,
    ∗ ℬ!,2 Σ
    +𝑐 log 1/𝛿 {𝒱! Σ + 𝜃,
    ∗ + 𝜃,
    ∗ 𝒱! Σ },
    where 𝜁)
    ∗ = arg min3∈ℝ! Σ − 𝜁Σ)
    , ℬ!,2 Σ = Σ max 5" 6
    !
    ,
    $78 #
    $
    !
    , 𝒱! Σ = 9∗
    !
    + !
    :&∗ 6
    .
    Upper Bound of the T-Learner
    10
    Upper bounds of T-Learner (Theorem 4.3 of Kato and Imaizumi (2022)
    sample selection bias
    (distribution shift)

    View full-size slide

  11. Upper Bound of the T-Learner
    n Benign overfitting depends on the existence of sample selection bias.
    Ø Case 1: the treatment assignment does not depend on the covariates.
    • No sample selection bias (𝑝 𝑑 = 1 𝑥 = 𝑝(𝑑 = 1)), e.x., RCTs
    → Σ − 𝜁!
    ∗Σ! 𝜃!
    ∗ - = 0
    • 𝑅 9
    𝜃 goes to zero under the same conditions used in Bartlett et al. (2020).
    Ø Case 2: the treatment assignment depends on the covariates.
    • Σ − 𝜁!
    ∗Σ!
    𝜃!
    ∗ - in the upper bound does not go to zero.
    • The convergence of the excess risk 𝑅 9
    𝜃 is not guaranteed.
    11

    View full-size slide

  12. Interpolating Estimator with IPW-Learner
    Ø The IPW-Learner.
    n Suppose that the propensity score 𝑝(𝑑 = 1|𝑥) is known.
    n Obtain an unbiased estimator of the CATE, ̂
    𝜏 𝑥 = # . ,
    "(.|*)
    − # .&$ ,
    "(.&$|*)
    .
    • This estimator is called an IPW estimator.
    n Regress ̂
    𝜏 𝑥 on 𝑥 to estimate 𝜃∗with an interpolating estimator:
    9
    𝜃ABC-:;<=>;= = 𝑋(𝑋 7𝑋( ̂
    𝜏
    12

    View full-size slide

  13. Upper Bound of the IPW-Learner
    There exist 𝑏, 𝑐 > 1 such that if 𝛿 < 1 with log(1/𝛿) < 𝑛/𝑐 and 𝑘∗ =
    min{𝑘 ≥ 0: 𝑟4
    Σ ≥ 𝑏𝑛} < 𝑛/𝑐#
    , then under some regularity conditions, the
    excess risk of the predictor satisfies with probability at least 1 − 𝛿,
    𝑅 9
    𝜃'
    ABC- :;<=>;= ≤ 𝑐 𝜃∗ -ℬ',D
    Σ!
    + 𝑐 log 1/𝛿 𝒱'
    Σ .
    n This results is thanks to the unbiasedness of the IPW estimator, ̂
    𝜏 𝑥 .
    n Under appropriate conditions of the covariance operator, the prediction risk
    goes to zero, regardless of the treatment assignment rule 𝑝(𝑑 = 1|𝑥).
    13
    Upper bounds of T-Learner (Theorem 5.3 of Kato and Imaizumi (2022)

    View full-size slide

  14. Conclusion
    n CATE prediction with an interpolating estimator.
    Ø Sample selection bias (distribution shift of the covariates).
    ü T-Learner: the distribution shift affects the upper bound.
    → Benign overfitting does not occur for the change of the covariance.
    ü IPW-Learner: we correct the distribution shift by the importance weight.
    → Benign overfitting occurs as well as the case in Bartlett et al. (2020).
    ? Conditions for benign overfitting of the T-learner (sup-norm convergence?).
    Thank you! ([email protected])
    14

    View full-size slide

  15. Reference
    • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign overfitting in linear
    regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070.
    • Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and
    nonrandomized studies,” Journal of Educational Psychology.
    • Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and
    Biomedical Sciences: An Introduction, Cambridge University Press.
    • Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019), “Metalearners for estimating
    heterogeneous treatment
    effects using machine learning,” Proceedings of the National Academy of Sciences.
    • Tripuraneni, N., Adlam, B., and Pennington, J. (2021a), “Covariate Shift in High-Dimensional
    Random Feature
    Regression,”
    — (2021b), “Overparameterization Improves Robustness to Covariate Shift in High
    Dimensions,” in Conference on Neural Information Processing Systems.
    15

    View full-size slide