Individualizing treatment effects: transportability and model selection

Individualizing treatment effects: Transportability and model selection Ga¨ el Varoquaux

Causal effects on an individual Relies on comparing treated outcome
vs non-treated outcome In Randomized Control Trials Treated and non-treated statistically identical On observational data Extrapolate across similar treated and non-treated individuals Challenge: personalizing this population measure G Varoquaux 1

Automated decisions But also policy recommendation Many problems are better
solved by organizational choices than shiny magic boxes Requires solid evidence of causal effects G Varoquaux 2

Individualizing treatment effects 1 Good choice of predictors for counterfactuals
2 Good choice of variables to generalize 3 Good choice of summary measure G Varoquaux 3

1 Good choice of predictors for counterfactuals Model selection for
causal machine learning [Doutreligne and Varoquaux 2023] How to select predictive models for decision making or causal inference?

Intuitions: Predictors and causal effects Prognostic model: predicting a health
outcome Health covariate Outcome G Varoquaux 5

outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) G Varoquaux 5

outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) For decisions: Individual treatment effect: comparing predicted outcomes for the same individuals G Varoquaux 5

Intuitions: causal model selection & distribution shift Untreated outcom eY0
(x) Treated outcom eY1 (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) Baseline health Healthy individuals did not receive the treatment G Varoquaux 6

(x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes G Varoquaux 6

(x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6

(x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Standard cross-validation / predictive accuracy not good Must weight equally errors on treated vs untreated outcome Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6

Formalism: potential outcomes Health covariate Outcome Potential outcomes: (y0(X), y1(X))
∼ D⋆ unobserved Observations: (Y, X, A) ∼ D outcome, covariate, treatment Estimation goal: CATE τ(x)def = ¾Y1,Y0∼D⋆ [Y1 − Y0|X = x] G Varoquaux 7

Causal risk: oracle vs observable Using a predictor ˆ y
= f (X) to estimate the CATE induces a risk: τ-risk(f) = ¾X∼p(X) [(τ(X) − ˆ τf (X))2] oracle: uses τ Standard R2 (ML practice): µ-risk(f) = ¾ (Y − f (X; A))2 on D not D⋆, Y not (Y0, Y1) The challenge Compensate for difference between D⋆ and D G Varoquaux 8

Causal risk: inverse propensity weighting (Propensity score) e(x)def = [A
= 1|X = x]. Adjusted risk – IPW: [Wager and Athey] τ-risk⋆ IPW (f) = ¾ Y A − e(X) e(X)(1 − e(X)) − τf (X) 2 G Varoquaux 9

Causal risk: the R-risk Lemma – rewriting of outcome model:
(R-decomposition) y(a) = m(x) + a − e(x) τ(x) + ε(x; a) (Conditional mean outcome) m(x)def = ¾Y∼D [Y|X = x], (Propensity score) e(x)def = [A = 1|X = x]. Adjusted risk – R-risk: [Nie and Wager 2021] R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 What’s the price to pay with estimated e and m? G Varoquaux 10

Procedure: Selecting predictors for causal inference Model-selection procedure 1. Compute
m and e on train set (with standard ML tools) 2. On test set, use adjusted risk (“doubly robust”): R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 G Varoquaux 11

Empirical evaluation 4 datasets 1 simulation: sampling many senario response
functions, treatment allocations 3 canonical real-world data with simulated counterfactuals G Varoquaux 12

Empirical evaluation: which risk? 1.0 0.5 0.0 0.5 risk riskIPW
risk* IPW riskIPW risk* IPW U risk U risk* R risk R risk* Strong Overlap 1.0 0.5 0.0 0.5 Weak Overlap Relative ( , Risk) compared to mean over all metrics Kendall's Twins (N= 11 984) ACIC 2016 (N=4 802) Caussim (N=5 000) ACIC 2018 (N=5 000) The R-risk is best small price payed for estimated e and m doubly robust properties G Varoquaux 13

Empirical evaluation: “details” of the procedure Which data to learn
the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ G Varoquaux 14

the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ G Varoquaux 14

the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ How to choose models for e and m? Which learner? Standard model selection ✓ G Varoquaux 14

Prediction to support decision when predictors should be causal normal
cross-validation not suited Use R-risk with nuisances estimated on train set [Doutreligne and Varoquaux 2023] G Varoquaux 15

2 Good choice of variables to generalize From identifiability to
a bias-variance tradeoff [Colnet... 2023a] Reweighting the RCT for generalization: finite sample error and variable selection

Transporting causal effects: RCT and target data Source data (RCT)
estimate causal effects Target data make decisions G Varoquaux 17

Transporting causal effects: principle Model treatment heterogeneity To apply on
shifted population G Varoquaux 18 Review [Colnet... 2024]

Assumptions needed for identifiability (oracle) 1. Internal validity of trial
2. Overlap between trial and target population 3. Transportability / Conditional Ignorability R(Y(1) − Y(0) | X = x) = T(Y(1) − Y(0) | X = x) Trial distribution Target distribution The covariates X capture all systematic variations in treatment effect Which covariates? mod: effect modifiers shift: shifted between populations Necessary: mod ∩ shift variables that are shifted and modify effect People include more, to be safe G Varoquaux 19

Reweighting estimator (IPSW) Re-weight the trial data: ˆ τn,m =
1 n i∈Trial ˆ wn,m(Xi ) Transport weights Yi Ai − Yi (1 − Ai ) Trial estimate Consistency of non-oracle IPSW under assumptions 1, 2, 3 For categorical covariates [Colnet... 2023a] G Varoquaux 20

Finite-sample error of IPSW Quadratic risk of non-oracle IPSW for
n trial and m target samples ¾ ˆ τπ,n,m − τ 2 ≤ 2Vso n + 1 + Var [τ(X)] m + 2 m(n + 1) ¾ R p T (X) (1 − p T (X)) p R (X)2 V HT (X) + 2 1 − min x p R (x) n ¾ T [τ(X)2] 1 + 2 m G Varoquaux 21 [Colnet... 2023a]

Adding shifted covariates which are not treatment effect modifiers Add
to minimal set X, V (shifted non modifiers): Makes estimating propensity weights harder lim n→∞ n Var r ˆ τ∗ t ,n (x, v) Variance with added V = v∈V p T (v)2 p R (v) lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V The stronger the shift, the bigger the variance inflation G Varoquaux 22 [Colnet... 2023a]

Adding treatment effect modifiers which are not shifted Add to
minimal set X, V (modifiers non shifted): Gives less variance on τ lim n→∞ n Var R ˆ τ∗ T ,n (X, V) Variance with added V = lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V − ¾ R p T (X) p R (X) Var [τ(X, V) | X] The stronger the variance explained by V, the bigger the variance deflation G Varoquaux 23 [Colnet... 2023a]

Covariates choice Types of covariates: shift: covariate shifted between pop
mod: treatment effect modifiers To reweigh / stratify for transport: - Minimal set: shift ∩ mod - Reduced variance: mod [Colnet... 2023a] G Varoquaux 24

3 Good choice of summary measure Risk ratio, odds ratio,
risk difference... [Colnet... 2023b] Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize?

Summarizing across individuals Even individualizing requires local averaging G Varoquaux
26

Summary metrics Risk difference τ RD :=¾ Y(1) − ¾
Y(0) Risk ratio τ RR := ¾ Y(1) ¾ Y(0) For binary outcomes (in addition to the above): Survival ratio τ SR := Y(1) = 0 Y(0) = 0 Odds ratio τ OR := [Y(1) = 1] [Y(1) = 0] [Y(0) = 1] [Y(0) = 0] −1 G Varoquaux 27

Summary metrics and population heterogeneity Full population versus subpopulations: τ
RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 Stroke outcome, data from [MacMahon... 1990] Want summary metrics most stables across subgroups G Varoquaux 28

Summary metrics: appealing properties Collapsibility: τ RD = p S
(X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures G Varoquaux 29

Summary metrics: appealing properties Collapsibility: τ RD = p S
(X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures Logic-respecting: min x (τ(x)) ≤ τ ≤ max x (τ(x)) The odds ratio is not logic-respecting (thus not collapsible) G Varoquaux 29

Transporting causal measures across populations Transporting τ, collapsible, from S
to T τT = ¾ S pT (X) pS (X) g T (Y(0), X) τS (X) Re-weighting population ratio collapsibility weights of τ X contains all the covariates that are shifted and treatment effect modulators G Varoquaux 30

Separating baseline from treatment effect Directions of heterogeneity? Need only
covariates that explain τ, and not the full variance of Y(0), Y(1) G Varoquaux 31

For binary outcomes You only die once Waging war makes
smoking comparatively less risky Probabilities are not additive G Varoquaux 32

For binary outcomes: a decomposition Y(a) = 1 | X
= x = b(x) + a 1 − b (x) mb (x) − b (x) mg (x) where mg(x) := Y(1) = 0 | Y(0) = 1, X = x good effect and mb (x) := Y(1) = 1 | Y(0) = 0, X = x bad effect “good”: if untreated has outcome, if treated does not Decompose summary metrics: τRD = ¾ [(1 − b (X)) mb (X)]−¾ b (X) mg (X) , τNNT = 1 ¾ [(1 − b (X)) mb (X)] − ¾ b (X) mg (X) τRR = 1+ ¾ [(1 − b (X)) mb (X)] ¾ [b(X)] − ¾ b(X)mg (X) ¾ [b(X)] , τSR = 1− ¾ [(1 − b (X)) mb (X)] ¾ [1 − b(X)] + ¾ b (X) mg (X) ¾ [1 − b(X)] G Varoquaux 33

For binary outcomes: a decomposition with monotonous effect Assuming treatment
is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] - No general separation of b and m ⌢ - On covariates X affecting only the baseline level b ⌣ τ RR (x) = 1 − b(x)mg(x) b(x) = 1 − mg(x). constant effect on subgroups stratified by covariates X affecting only the baseline G Varoquaux 34

For binary outcomes: a decomposition with monotonous effect Assuming treatment
is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] Assuming treatment is harmful ( x, mg(x) = 0): Survival ratio simplifies τ SR = 1 − ¾ [(1 − b(X)) mb (X)] ¾ [1 − b(X)] An age-old question: [Sheps 1958] Shall We Count the Living or the Dead? G Varoquaux 34

Choice of covariates to transport causal effects Good metric separates
baseline risk from treatment modulation Continuous outcome Risk difference is collapsible and separates out baseline risk ⇒ Need only shifted covariates that modulate the difference Binary outcome Beneficial effect ⇒ Risk Ratio ⇒ Need only shifted covariates that modulate the benefice Harmful effect ⇒ Survival Ratio ⇒ Need only shifted covariates that modulate the harm G Varoquaux 35

Choice of covariates to transport causal effects Good metric separates
baseline risk from treatment modulation Important because it defines heterogeneity across individuals τ RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 G Varoquaux 35

The soda team: Machine learning for health and social sciences
Machine learning for statistics Causal inference, biases, missing values Health and social sciences Epidemiology, education, psychology Tabular relational learning Relational databases, data lakes Data-science software scikit-learn, joblib, skrub G Varoquaux 36

Individualizing treatment effects Using predictors Careful about model selection: [Doutreligne
and Varoquaux 2023] - heteroegeneous error between treated and non treated - use R-risk (also careful not to use post-treatment information) Stratification in subgroups Use all covariable that capture effet heterogeneity but not population heterogeneity [Colnet... 2023a] Good metrics separate baseline risk from treatment effect Binary outcomes require thinking [Colnet... 2023b] G Varoquaux 37 @GaelVaroquaux

References I B. Colnet, J. Josse, G. Varoquaux, and E.
Scornet. Reweighting the rct for generalization: finite sample error and variable selection. arXiv:2208.07614, 2023a. B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Risk ratio, odds ratio, risk difference... which causal measure is easier to generalize? arXiv:2303.16008, 2023b. B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. Statistical Science, 2024. M. Doutreligne and G. Varoquaux. How to select predictive models for decision making or causal inference? 2023. URL https://hal.science/hal-03946902. S. MacMahon, R. Peto, R. Collins, J. Godwin, J. Cutler, P. Sorlie, R. Abbott, J. Neaton, A. Dyer, and J. Stamler. Blood pressure, stroke, and coronary heart disease: part 1, prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. The Lancet, 335(8692):765–774, 1990. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. G Varoquaux 38

References II M. C. Sheps. Shall we count the living
or the dead? New England Journal of Medicine, 259 (25):1210–1214, 1958. doi: 10.1056/NEJM195812182592505. URL https://doi.org/10.1056/NEJM195812182592505. PMID: 13622912. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022. S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242. G Varoquaux 39

Individualizing treatment effects: transportabi...

Individualizing treatment effects: transportability and model selection

More Decks by Gael Varoquaux

Other Decks in Technology

Featured

Transcript