$30 off During Our Annual Pro Sale. View Details »

Individualizing treatment effects: transportability and model selection

Gael Varoquaux
September 22, 2023

Individualizing treatment effects: transportability and model selection

For efficient interventions, we would like to know the causal effect of the intervention on a given individual: the individual treatment effect. Given the proper set of covariates, such quantity can be computed with machine-learning models: contrasting the predicted outcome for the individual with and without the treatment. I will analyse in detail how to best compute such quantities: what choice of covariates to minimize the variance, how to empirically select the best machine-learning model, and how a good choice of population-level summaries of treatment effect is least sensitive to heterogeneity.

Gael Varoquaux

September 22, 2023
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Individualizing treatment effects:
    Transportability and model selection
    Ga¨
    el Varoquaux

    View Slide

  2. Causal effects on an individual
    Relies on comparing treated outcome vs non-treated outcome
    In Randomized Control Trials
    Treated and non-treated statistically identical
    On observational data
    Extrapolate across similar treated and non-treated individuals
    Challenge: personalizing this population measure
    G Varoquaux 1

    View Slide

  3. Automated decisions
    But also policy recommendation
    Many problems are better solved by organizational choices
    than shiny magic boxes
    Requires solid evidence of causal effects
    G Varoquaux 2

    View Slide

  4. Individualizing treatment effects
    1 Good choice of predictors for counterfactuals
    2 Good choice of variables to generalize
    3 Good choice of summary measure
    G Varoquaux 3

    View Slide

  5. 1 Good choice of predictors for
    counterfactuals
    Model selection for causal machine learning
    [Doutreligne and Varoquaux 2023]
    How to select predictive models
    for decision making or causal inference?

    View Slide

  6. Intuitions: Predictors and causal effects
    Prognostic model: predicting a health outcome
    Health covariate
    Outcome
    G Varoquaux 5

    View Slide

  7. Intuitions: Predictors and causal effects
    Prognostic model: predicting a health outcome
    Health covariate
    Outcome
    Prediction function of intervention (treated Y0(x) vs untreated Y1(x))
    G Varoquaux 5

    View Slide

  8. Intuitions: Predictors and causal effects
    Prognostic model: predicting a health outcome
    Health covariate
    Outcome
    Prediction function of intervention (treated Y0(x) vs untreated Y1(x))
    For decisions: Individual treatment effect:
    comparing predicted outcomes for the same individuals
    G Varoquaux 5

    View Slide

  9. Intuitions: causal model selection & distribution shift
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    Baseline health
    Healthy individuals did not receive the treatment
    G Varoquaux 6

    View Slide

  10. Intuitions: causal model selection & distribution shift
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Baseline health
    Healthy individuals did not receive the treatment
    The model associates treatment to negative outcomes
    G Varoquaux 6

    View Slide

  11. Intuitions: causal model selection & distribution shift
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Baseline health
    Healthy individuals did not receive the treatment
    The model associates treatment to negative outcomes
    A worse predictor gives better causal inference
    G Varoquaux 6

    View Slide

  12. Intuitions: causal model selection & distribution shift
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Untreated outcom
    eY0
    (x)
    Treated outcom
    eY1
    (x)
    ˆ
    µa
    (x)
    Baseline health
    Standard cross-validation / predictive accuracy not good
    Must weight equally errors on treated vs untreated outcome
    Healthy individuals did not receive the treatment
    The model associates treatment to negative outcomes
    A worse predictor gives better causal inference
    G Varoquaux 6

    View Slide

  13. Formalism: potential outcomes
    Health covariate
    Outcome
    Potential outcomes: (y0(X), y1(X)) ∼ D⋆ unobserved
    Observations: (Y, X, A) ∼ D outcome, covariate, treatment
    Estimation goal: CATE τ(x)def
    = ¾Y1,Y0∼D⋆ [Y1 − Y0|X = x]
    G Varoquaux 7

    View Slide

  14. Causal risk: oracle vs observable
    Using a predictor ˆ
    y = f (X) to estimate the CATE induces a risk:
    τ-risk(f) = ¾X∼p(X) [(τ(X) − ˆ
    τf
    (X))2] oracle: uses τ
    Standard R2 (ML practice):
    µ-risk(f) = ¾ (Y − f (X; A))2 on D not D⋆, Y not (Y0, Y1)
    The challenge
    Compensate for difference between D⋆ and D
    G Varoquaux 8

    View Slide

  15. Causal risk: inverse propensity weighting
    (Propensity score) e(x)def
    = [A = 1|X = x].
    Adjusted risk – IPW: [Wager and Athey]
    τ-risk⋆
    IPW
    (f) = ¾ Y
    A − e(X)
    e(X)(1 − e(X))
    − τf
    (X)
    2
    G Varoquaux 9

    View Slide

  16. Causal risk: the R-risk
    Lemma – rewriting of outcome model:
    (R-decomposition) y(a) = m(x) + a − e(x) τ(x) + ε(x; a)
    (Conditional mean outcome) m(x)def
    = ¾Y∼D [Y|X = x],
    (Propensity score) e(x)def
    = [A = 1|X = x].
    Adjusted risk – R-risk: [Nie and Wager 2021]
    R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf
    (X) 2
    What’s the price to pay with estimated e and m?
    G Varoquaux 10

    View Slide

  17. Procedure: Selecting predictors for causal inference
    Model-selection procedure
    1. Compute m and e on train set (with standard ML tools)
    2. On test set, use adjusted risk (“doubly robust”):
    R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf
    (X) 2
    G Varoquaux 11

    View Slide

  18. Empirical evaluation
    4 datasets
    1 simulation: sampling many senario
    response functions, treatment allocations
    3 canonical real-world data
    with simulated counterfactuals
    G Varoquaux 12

    View Slide

  19. Empirical evaluation: which risk?
    1.0 0.5 0.0 0.5
    risk
    riskIPW
    risk*
    IPW
    riskIPW
    risk*
    IPW
    U risk
    U risk*
    R risk
    R risk*
    Strong Overlap
    1.0 0.5 0.0 0.5
    Weak Overlap
    Relative ( , Risk) compared to mean over all metrics Kendall's
    Twins
    (N= 11 984)
    ACIC 2016
    (N=4 802)
    Caussim
    (N=5 000)
    ACIC 2018
    (N=5 000)
    The R-risk is best small price payed for estimated e and m
    doubly robust properties
    G Varoquaux 13

    View Slide

  20. Empirical evaluation: “details” of the procedure
    Which data to learn the nuisance e, m?
    3 splits: learn ˆ
    f / learn ˆ
    e and ˆ
    m / compute risk
    Drawback: less data for learning
    or
    2 splits: learn ˆ
    f, ˆ
    e, and ˆ
    m on same data
    Drawback: correlated errors
    2 splits ✓
    G Varoquaux 14

    View Slide

  21. Empirical evaluation: “details” of the procedure
    Which data to learn the nuisance e, m?
    3 splits: learn ˆ
    f / learn ˆ
    e and ˆ
    m / compute risk
    Drawback: less data for learning
    or
    2 splits: learn ˆ
    f, ˆ
    e, and ˆ
    m on same data
    Drawback: correlated errors
    2 splits ✓
    What train/test fraction?
    More data to train, or to test? 10% left-out data for test ✓
    G Varoquaux 14

    View Slide

  22. Empirical evaluation: “details” of the procedure
    Which data to learn the nuisance e, m?
    3 splits: learn ˆ
    f / learn ˆ
    e and ˆ
    m / compute risk
    Drawback: less data for learning
    or
    2 splits: learn ˆ
    f, ˆ
    e, and ˆ
    m on same data
    Drawback: correlated errors
    2 splits ✓
    What train/test fraction?
    More data to train, or to test? 10% left-out data for test ✓
    How to choose models for e and m?
    Which learner? Standard model selection ✓
    G Varoquaux 14

    View Slide

  23. Prediction
    to support decision
    when predictors should be causal
    normal cross-validation not suited
    Use R-risk with nuisances
    estimated on train set
    [Doutreligne and Varoquaux 2023]
    G Varoquaux 15

    View Slide

  24. 2 Good choice of variables to generalize
    From identifiability to a bias-variance tradeoff
    [Colnet... 2023a]
    Reweighting the RCT for generalization:
    finite sample error and variable selection

    View Slide

  25. Transporting causal effects: RCT and target data
    Source data (RCT)
    estimate causal effects
    Target data
    make decisions
    G Varoquaux 17

    View Slide

  26. Transporting causal effects: principle
    Model treatment heterogeneity
    To apply on shifted population
    G Varoquaux 18
    Review [Colnet... 2024]

    View Slide

  27. Assumptions needed for identifiability (oracle)
    1. Internal validity of trial
    2. Overlap between trial and target population
    3. Transportability / Conditional Ignorability
    R(Y(1) − Y(0) | X = x) = T(Y(1) − Y(0) | X = x)
    Trial distribution Target distribution
    The covariates X capture all systematic variations in treatment effect
    Which covariates?
    mod: effect modifiers shift: shifted between populations
    Necessary: mod ∩ shift variables that are shifted and modify effect
    People include more, to be safe
    G Varoquaux 19

    View Slide

  28. Reweighting estimator (IPSW)
    Re-weight the trial data:
    ˆ
    τn,m
    =
    1
    n
    i∈Trial
    ˆ
    wn,m(Xi
    )
    Transport weights
    Yi
    Ai
    − Yi
    (1 − Ai
    )
    Trial estimate
    Consistency of non-oracle IPSW under assumptions 1, 2, 3
    For categorical covariates [Colnet... 2023a]
    G Varoquaux 20

    View Slide

  29. Finite-sample error of IPSW
    Quadratic risk of non-oracle IPSW for n trial and m target samples
    ¾ ˆ
    τπ,n,m − τ 2 ≤
    2Vso
    n + 1
    +
    Var [τ(X)]
    m
    +
    2
    m(n + 1)
    ¾
    R
    p
    T
    (X) (1 − p
    T
    (X))
    p
    R
    (X)2
    V
    HT
    (X)
    + 2 1 − min
    x
    p
    R
    (x)
    n
    ¾
    T
    [τ(X)2] 1 +
    2
    m
    G Varoquaux 21
    [Colnet... 2023a]

    View Slide

  30. Adding shifted covariates which are not treatment effect modifiers
    Add to minimal set X, V (shifted non modifiers):
    Makes estimating propensity weights harder
    lim
    n→∞
    n Var
    r
    ˆ
    τ∗
    t
    ,n
    (x, v)
    Variance with added V
    =
    v∈V
    p
    T
    (v)2
    p
    R
    (v)
    lim
    n→∞
    n Var
    R
    ˆ
    τ∗
    T
    ,n
    (X)
    Variance without added V
    The stronger the shift, the bigger the variance inflation
    G Varoquaux 22
    [Colnet... 2023a]

    View Slide

  31. Adding treatment effect modifiers which are not shifted
    Add to minimal set X, V (modifiers non shifted):
    Gives less variance on τ
    lim
    n→∞
    n Var
    R
    ˆ
    τ∗
    T
    ,n
    (X, V)
    Variance with added V
    = lim
    n→∞
    n Var
    R
    ˆ
    τ∗
    T
    ,n
    (X)
    Variance without added V
    − ¾
    R
    p
    T
    (X)
    p
    R
    (X)
    Var [τ(X, V) | X]
    The stronger the variance explained by V,
    the bigger the variance deflation
    G Varoquaux 23
    [Colnet... 2023a]

    View Slide

  32. Covariates choice
    Types of covariates:
    shift: covariate shifted between pop
    mod: treatment effect modifiers
    To reweigh / stratify for transport:
    - Minimal set: shift ∩ mod
    - Reduced variance: mod
    [Colnet... 2023a]
    G Varoquaux 24

    View Slide

  33. 3 Good choice of summary measure
    Risk ratio, odds ratio, risk difference...
    [Colnet... 2023b]
    Risk ratio, odds ratio, risk difference...
    Which causal measure is easier to generalize?

    View Slide

  34. Summarizing across individuals
    Even individualizing requires local averaging
    G Varoquaux 26

    View Slide

  35. Summary metrics
    Risk difference τ
    RD
    :=¾ Y(1) − ¾ Y(0)
    Risk ratio τ
    RR
    :=
    ¾ Y(1)
    ¾ Y(0)
    For binary outcomes (in addition to the above):
    Survival ratio τ
    SR
    :=
    Y(1) = 0
    Y(0) = 0
    Odds ratio τ
    OR
    :=
    [Y(1) = 1]
    [Y(1) = 0]
    [Y(0) = 1]
    [Y(0) = 0]
    −1
    G Varoquaux 27

    View Slide

  36. Summary metrics and population heterogeneity
    Full population versus subpopulations:
    τ
    RD
    τ
    RR
    τ
    SR
    τ
    OR
    All patients (P
    S
    ) −0.0452 0.6 1.05 0.57
    Moderate: X = 1 −0.006 0.6 1.01 0.6
    Mild: X = 0 −0.080 0.6 1.1 0.545
    Stroke outcome, data from [MacMahon... 1990]
    Want summary metrics most stables across subgroups
    G Varoquaux 28

    View Slide

  37. Summary metrics: appealing properties
    Collapsibility:
    τ
    RD
    = p
    S
    (X = 1) · τ
    RD
    (X = 1) + p
    S
    (X = 0) · τ
    RD
    (X = 0)
    % individuals with X = 1 in PS
    % individuals with X = 0 in PS
    Total measure is a weighted combination
    of sup-population measures
    G Varoquaux 29

    View Slide

  38. Summary metrics: appealing properties
    Collapsibility:
    τ
    RD
    = p
    S
    (X = 1) · τ
    RD
    (X = 1) + p
    S
    (X = 0) · τ
    RD
    (X = 0)
    % individuals with X = 1 in PS
    % individuals with X = 0 in PS
    Total measure is a weighted combination
    of sup-population measures
    Logic-respecting:
    min
    x
    (τ(x)) ≤ τ ≤ max
    x
    (τ(x))
    The odds ratio is not logic-respecting (thus not collapsible)
    G Varoquaux 29

    View Slide

  39. Transporting causal measures across populations
    Transporting τ, collapsible, from S to T
    τT = ¾
    S
    pT
    (X)
    pS
    (X)
    g
    T
    (Y(0), X) τS (X) Re-weighting
    population ratio collapsibility weights of τ
    X contains all the covariates that are shifted and treatment effect
    modulators
    G Varoquaux 30

    View Slide

  40. Separating baseline from treatment effect
    Directions of heterogeneity?
    Need only covariates that explain τ,
    and not the full variance of Y(0), Y(1)
    G Varoquaux 31

    View Slide

  41. For binary outcomes
    You only die once
    Waging war makes smoking comparatively less risky
    Probabilities are not additive
    G Varoquaux 32

    View Slide

  42. For binary outcomes: a decomposition
    Y(a) = 1 | X = x = b(x) + a 1 − b (x) mb
    (x) − b (x) mg (x)
    where mg(x) := Y(1) = 0 | Y(0) = 1, X = x good effect
    and mb
    (x) := Y(1) = 1 | Y(0) = 0, X = x bad effect
    “good”: if untreated has outcome, if treated does not
    Decompose summary metrics:
    τRD
    = ¾ [(1 − b (X)) mb
    (X)]−¾ b (X) mg (X) , τNNT
    =
    1
    ¾ [(1 − b (X)) mb
    (X)] − ¾ b (X) mg (X)
    τRR
    = 1+
    ¾ [(1 − b (X)) mb
    (X)]
    ¾ [b(X)]

    ¾ b(X)mg (X)
    ¾ [b(X)]
    , τSR
    = 1−
    ¾ [(1 − b (X)) mb
    (X)]
    ¾ [1 − b(X)]
    +
    ¾ b (X) mg (X)
    ¾ [1 − b(X)]
    G Varoquaux 33

    View Slide

  43. For binary outcomes: a decomposition with monotonous effect
    Assuming treatment is beneficial ( x, mb
    (x) = 0):
    Risk ratio simplifies τ
    RR
    = 1 −
    ¾ b(X)mg(X)
    ¾ [b(X)]
    - No general separation of b and m ⌢
    - On covariates X affecting only the baseline level b ⌣
    τ
    RR
    (x) = 1 −
    b(x)mg(x)
    b(x)
    = 1 − mg(x).
    constant effect on subgroups stratified by covariates X affecting only the baseline
    G Varoquaux 34

    View Slide

  44. For binary outcomes: a decomposition with monotonous effect
    Assuming treatment is beneficial ( x, mb
    (x) = 0):
    Risk ratio simplifies τ
    RR
    = 1 −
    ¾ b(X)mg(X)
    ¾ [b(X)]
    Assuming treatment is harmful ( x, mg(x) = 0):
    Survival ratio simplifies τ
    SR
    = 1 −
    ¾ [(1 − b(X)) mb
    (X)]
    ¾ [1 − b(X)]
    An age-old question:
    [Sheps 1958] Shall We Count the Living or the Dead?
    G Varoquaux 34

    View Slide

  45. Choice of covariates to transport causal effects
    Good metric separates baseline risk from treatment modulation
    Continuous outcome
    Risk difference is collapsible and separates out baseline risk
    ⇒ Need only shifted covariates that modulate the difference
    Binary outcome
    Beneficial effect ⇒ Risk Ratio
    ⇒ Need only shifted covariates that modulate the benefice
    Harmful effect ⇒ Survival Ratio
    ⇒ Need only shifted covariates that modulate the harm
    G Varoquaux 35

    View Slide

  46. Choice of covariates to transport causal effects
    Good metric separates baseline risk from treatment modulation
    Important because it defines heterogeneity across individuals
    τ
    RD
    τ
    RR
    τ
    SR
    τ
    OR
    All patients (P
    S
    ) −0.0452 0.6 1.05 0.57
    Moderate: X = 1 −0.006 0.6 1.01 0.6
    Mild: X = 0 −0.080 0.6 1.1 0.545
    G Varoquaux 35

    View Slide

  47. The soda team: Machine learning for health and social sciences
    Machine learning for statistics
    Causal inference, biases, missing values
    Health and social sciences
    Epidemiology, education, psychology
    Tabular relational learning
    Relational databases, data lakes
    Data-science software
    scikit-learn, joblib, skrub
    G Varoquaux 36

    View Slide

  48. Individualizing treatment effects
    Using predictors
    Careful about model selection: [Doutreligne and Varoquaux 2023]
    - heteroegeneous error between treated and non treated
    - use R-risk
    (also careful not to use post-treatment information)
    Stratification in subgroups
    Use all covariable that capture effet heterogeneity
    but not population heterogeneity [Colnet... 2023a]
    Good metrics separate baseline risk from treatment effect
    Binary outcomes require thinking [Colnet... 2023b]
    G Varoquaux 37
    @GaelVaroquaux

    View Slide

  49. References I
    B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Reweighting the rct for generalization:
    finite sample error and variable selection. arXiv:2208.07614, 2023a.
    B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Risk ratio, odds ratio, risk difference...
    which causal measure is easier to generalize? arXiv:2303.16008, 2023b.
    B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang.
    Causal inference methods for combining randomized trials and observational studies:
    a review. Statistical Science, 2024.
    M. Doutreligne and G. Varoquaux. How to select predictive models for decision making
    or causal inference? 2023. URL https://hal.science/hal-03946902.
    S. MacMahon, R. Peto, R. Collins, J. Godwin, J. Cutler, P. Sorlie, R. Abbott, J. Neaton, A. Dyer,
    and J. Stamler. Blood pressure, stroke, and coronary heart disease: part 1, prolonged
    differences in blood pressure: prospective observational studies corrected for the
    regression dilution bias. The Lancet, 335(8692):765–774, 1990.
    X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects.
    Biometrika, 108(2):299–319, 2021.
    G Varoquaux 38

    View Slide

  50. References II
    M. C. Sheps. Shall we count the living or the dead? New England Journal of Medicine, 259
    (25):1210–1214, 1958. doi: 10.1056/NEJM195812182592505. URL
    https://doi.org/10.1056/NEJM195812182592505. PMID: 13622912.
    G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological
    failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022.
    G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic
    value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022.
    S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects
    using random forests. Journal of the American Statistical Association, 113(523):
    1228–1242.
    G Varoquaux 39

    View Slide