Slide 1

Slide 1 text

Individualizing treatment effects: Transportability and model selection Ga¨ el Varoquaux

Slide 2

Slide 2 text

Causal effects on an individual Relies on comparing treated outcome vs non-treated outcome In Randomized Control Trials Treated and non-treated statistically identical On observational data Extrapolate across similar treated and non-treated individuals Challenge: personalizing this population measure G Varoquaux 1

Slide 3

Slide 3 text

Automated decisions But also policy recommendation Many problems are better solved by organizational choices than shiny magic boxes Requires solid evidence of causal effects G Varoquaux 2

Slide 4

Slide 4 text

Individualizing treatment effects 1 Good choice of predictors for counterfactuals 2 Good choice of variables to generalize 3 Good choice of summary measure G Varoquaux 3

Slide 5

Slide 5 text

1 Good choice of predictors for counterfactuals Model selection for causal machine learning [Doutreligne and Varoquaux 2023] How to select predictive models for decision making or causal inference?

Slide 6

Slide 6 text

Intuitions: Predictors and causal effects Prognostic model: predicting a health outcome Health covariate Outcome G Varoquaux 5

Slide 7

Slide 7 text

Intuitions: Predictors and causal effects Prognostic model: predicting a health outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) G Varoquaux 5

Slide 8

Slide 8 text

Intuitions: Predictors and causal effects Prognostic model: predicting a health outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) For decisions: Individual treatment effect: comparing predicted outcomes for the same individuals G Varoquaux 5

Slide 9

Slide 9 text

Intuitions: causal model selection & distribution shift Untreated outcom eY0 (x) Treated outcom eY1 (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) Baseline health Healthy individuals did not receive the treatment G Varoquaux 6

Slide 10

Slide 10 text

Intuitions: causal model selection & distribution shift Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes G Varoquaux 6

Slide 11

Slide 11 text

Intuitions: causal model selection & distribution shift Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6

Slide 12

Slide 12 text

Intuitions: causal model selection & distribution shift Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Standard cross-validation / predictive accuracy not good Must weight equally errors on treated vs untreated outcome Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6

Slide 13

Slide 13 text

Formalism: potential outcomes Health covariate Outcome Potential outcomes: (y0(X), y1(X)) ∼ D⋆ unobserved Observations: (Y, X, A) ∼ D outcome, covariate, treatment Estimation goal: CATE τ(x)def = ¾Y1,Y0∼D⋆ [Y1 − Y0|X = x] G Varoquaux 7

Slide 14

Slide 14 text

Causal risk: oracle vs observable Using a predictor ˆ y = f (X) to estimate the CATE induces a risk: τ-risk(f) = ¾X∼p(X) [(τ(X) − ˆ τf (X))2] oracle: uses τ Standard R2 (ML practice): µ-risk(f) = ¾ (Y − f (X; A))2 on D not D⋆, Y not (Y0, Y1) The challenge Compensate for difference between D⋆ and D G Varoquaux 8

Slide 15

Slide 15 text

Causal risk: inverse propensity weighting (Propensity score) e(x)def = [A = 1|X = x]. Adjusted risk – IPW: [Wager and Athey] τ-risk⋆ IPW (f) = ¾ Y A − e(X) e(X)(1 − e(X)) − τf (X) 2 G Varoquaux 9

Slide 16

Slide 16 text

Causal risk: the R-risk Lemma – rewriting of outcome model: (R-decomposition) y(a) = m(x) + a − e(x) τ(x) + ε(x; a) (Conditional mean outcome) m(x)def = ¾Y∼D [Y|X = x], (Propensity score) e(x)def = [A = 1|X = x]. Adjusted risk – R-risk: [Nie and Wager 2021] R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 What’s the price to pay with estimated e and m? G Varoquaux 10

Slide 17

Slide 17 text

Procedure: Selecting predictors for causal inference Model-selection procedure 1. Compute m and e on train set (with standard ML tools) 2. On test set, use adjusted risk (“doubly robust”): R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 G Varoquaux 11

Slide 18

Slide 18 text

Empirical evaluation 4 datasets 1 simulation: sampling many senario response functions, treatment allocations 3 canonical real-world data with simulated counterfactuals G Varoquaux 12

Slide 19

Slide 19 text

Empirical evaluation: which risk? 1.0 0.5 0.0 0.5 risk riskIPW risk* IPW riskIPW risk* IPW U risk U risk* R risk R risk* Strong Overlap 1.0 0.5 0.0 0.5 Weak Overlap Relative ( , Risk) compared to mean over all metrics Kendall's Twins (N= 11 984) ACIC 2016 (N=4 802) Caussim (N=5 000) ACIC 2018 (N=5 000) The R-risk is best small price payed for estimated e and m doubly robust properties G Varoquaux 13

Slide 20

Slide 20 text

Empirical evaluation: “details” of the procedure Which data to learn the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ G Varoquaux 14

Slide 21

Slide 21 text

Empirical evaluation: “details” of the procedure Which data to learn the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ G Varoquaux 14

Slide 22

Slide 22 text

Empirical evaluation: “details” of the procedure Which data to learn the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ How to choose models for e and m? Which learner? Standard model selection ✓ G Varoquaux 14

Slide 23

Slide 23 text

Prediction to support decision when predictors should be causal normal cross-validation not suited Use R-risk with nuisances estimated on train set [Doutreligne and Varoquaux 2023] G Varoquaux 15

Slide 24

Slide 24 text

2 Good choice of variables to generalize From identifiability to a bias-variance tradeoff [Colnet... 2023a] Reweighting the RCT for generalization: finite sample error and variable selection

Slide 25

Slide 25 text

Transporting causal effects: RCT and target data Source data (RCT) estimate causal effects Target data make decisions G Varoquaux 17

Slide 26

Slide 26 text

Transporting causal effects: principle Model treatment heterogeneity To apply on shifted population G Varoquaux 18 Review [Colnet... 2024]

Slide 27

Slide 27 text

Assumptions needed for identifiability (oracle) 1. Internal validity of trial 2. Overlap between trial and target population 3. Transportability / Conditional Ignorability R(Y(1) − Y(0) | X = x) = T(Y(1) − Y(0) | X = x) Trial distribution Target distribution The covariates X capture all systematic variations in treatment effect Which covariates? mod: effect modifiers shift: shifted between populations Necessary: mod ∩ shift variables that are shifted and modify effect People include more, to be safe G Varoquaux 19

Slide 28

Slide 28 text

Reweighting estimator (IPSW) Re-weight the trial data: ˆ τn,m = 1 n i∈Trial ˆ wn,m(Xi ) Transport weights Yi Ai − Yi (1 − Ai ) Trial estimate Consistency of non-oracle IPSW under assumptions 1, 2, 3 For categorical covariates [Colnet... 2023a] G Varoquaux 20

Slide 29

Slide 29 text

Finite-sample error of IPSW Quadratic risk of non-oracle IPSW for n trial and m target samples ¾ ˆ τπ,n,m − τ 2 ≤ 2Vso n + 1 + Var [τ(X)] m + 2 m(n + 1) ¾ R p T (X) (1 − p T (X)) p R (X)2 V HT (X) + 2 1 − min x p R (x) n ¾ T [τ(X)2] 1 + 2 m G Varoquaux 21 [Colnet... 2023a]

Slide 30

Slide 30 text

Adding shifted covariates which are not treatment effect modifiers Add to minimal set X, V (shifted non modifiers): Makes estimating propensity weights harder lim n→∞ n Var r ˆ τ∗ t ,n (x, v) Variance with added V = v∈V p T (v)2 p R (v) lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V The stronger the shift, the bigger the variance inflation G Varoquaux 22 [Colnet... 2023a]

Slide 31

Slide 31 text

Adding treatment effect modifiers which are not shifted Add to minimal set X, V (modifiers non shifted): Gives less variance on τ lim n→∞ n Var R ˆ τ∗ T ,n (X, V) Variance with added V = lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V − ¾ R p T (X) p R (X) Var [τ(X, V) | X] The stronger the variance explained by V, the bigger the variance deflation G Varoquaux 23 [Colnet... 2023a]

Slide 32

Slide 32 text

Covariates choice Types of covariates: shift: covariate shifted between pop mod: treatment effect modifiers To reweigh / stratify for transport: - Minimal set: shift ∩ mod - Reduced variance: mod [Colnet... 2023a] G Varoquaux 24

Slide 33

Slide 33 text

3 Good choice of summary measure Risk ratio, odds ratio, risk difference... [Colnet... 2023b] Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize?

Slide 34

Slide 34 text

Summarizing across individuals Even individualizing requires local averaging G Varoquaux 26

Slide 35

Slide 35 text

Summary metrics Risk difference τ RD :=¾ Y(1) − ¾ Y(0) Risk ratio τ RR := ¾ Y(1) ¾ Y(0) For binary outcomes (in addition to the above): Survival ratio τ SR := Y(1) = 0 Y(0) = 0 Odds ratio τ OR := [Y(1) = 1] [Y(1) = 0] [Y(0) = 1] [Y(0) = 0] −1 G Varoquaux 27

Slide 36

Slide 36 text

Summary metrics and population heterogeneity Full population versus subpopulations: τ RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 Stroke outcome, data from [MacMahon... 1990] Want summary metrics most stables across subgroups G Varoquaux 28

Slide 37

Slide 37 text

Summary metrics: appealing properties Collapsibility: τ RD = p S (X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures G Varoquaux 29

Slide 38

Slide 38 text

Summary metrics: appealing properties Collapsibility: τ RD = p S (X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures Logic-respecting: min x (τ(x)) ≤ τ ≤ max x (τ(x)) The odds ratio is not logic-respecting (thus not collapsible) G Varoquaux 29

Slide 39

Slide 39 text

Transporting causal measures across populations Transporting τ, collapsible, from S to T τT = ¾ S pT (X) pS (X) g T (Y(0), X) τS (X) Re-weighting population ratio collapsibility weights of τ X contains all the covariates that are shifted and treatment effect modulators G Varoquaux 30

Slide 40

Slide 40 text

Separating baseline from treatment effect Directions of heterogeneity? Need only covariates that explain τ, and not the full variance of Y(0), Y(1) G Varoquaux 31

Slide 41

Slide 41 text

For binary outcomes You only die once Waging war makes smoking comparatively less risky Probabilities are not additive G Varoquaux 32

Slide 42

Slide 42 text

For binary outcomes: a decomposition Y(a) = 1 | X = x = b(x) + a 1 − b (x) mb (x) − b (x) mg (x) where mg(x) := Y(1) = 0 | Y(0) = 1, X = x good effect and mb (x) := Y(1) = 1 | Y(0) = 0, X = x bad effect “good”: if untreated has outcome, if treated does not Decompose summary metrics: τRD = ¾ [(1 − b (X)) mb (X)]−¾ b (X) mg (X) , τNNT = 1 ¾ [(1 − b (X)) mb (X)] − ¾ b (X) mg (X) τRR = 1+ ¾ [(1 − b (X)) mb (X)] ¾ [b(X)] − ¾ b(X)mg (X) ¾ [b(X)] , τSR = 1− ¾ [(1 − b (X)) mb (X)] ¾ [1 − b(X)] + ¾ b (X) mg (X) ¾ [1 − b(X)] G Varoquaux 33

Slide 43

Slide 43 text

For binary outcomes: a decomposition with monotonous effect Assuming treatment is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] - No general separation of b and m ⌢ - On covariates X affecting only the baseline level b ⌣ τ RR (x) = 1 − b(x)mg(x) b(x) = 1 − mg(x). constant effect on subgroups stratified by covariates X affecting only the baseline G Varoquaux 34

Slide 44

Slide 44 text

For binary outcomes: a decomposition with monotonous effect Assuming treatment is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] Assuming treatment is harmful ( x, mg(x) = 0): Survival ratio simplifies τ SR = 1 − ¾ [(1 − b(X)) mb (X)] ¾ [1 − b(X)] An age-old question: [Sheps 1958] Shall We Count the Living or the Dead? G Varoquaux 34

Slide 45

Slide 45 text

Choice of covariates to transport causal effects Good metric separates baseline risk from treatment modulation Continuous outcome Risk difference is collapsible and separates out baseline risk ⇒ Need only shifted covariates that modulate the difference Binary outcome Beneficial effect ⇒ Risk Ratio ⇒ Need only shifted covariates that modulate the benefice Harmful effect ⇒ Survival Ratio ⇒ Need only shifted covariates that modulate the harm G Varoquaux 35

Slide 46

Slide 46 text

Choice of covariates to transport causal effects Good metric separates baseline risk from treatment modulation Important because it defines heterogeneity across individuals τ RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 G Varoquaux 35

Slide 47

Slide 47 text

The soda team: Machine learning for health and social sciences Machine learning for statistics Causal inference, biases, missing values Health and social sciences Epidemiology, education, psychology Tabular relational learning Relational databases, data lakes Data-science software scikit-learn, joblib, skrub G Varoquaux 36

Slide 48

Slide 48 text

Individualizing treatment effects Using predictors Careful about model selection: [Doutreligne and Varoquaux 2023] - heteroegeneous error between treated and non treated - use R-risk (also careful not to use post-treatment information) Stratification in subgroups Use all covariable that capture effet heterogeneity but not population heterogeneity [Colnet... 2023a] Good metrics separate baseline risk from treatment effect Binary outcomes require thinking [Colnet... 2023b] G Varoquaux 37 @GaelVaroquaux

Slide 49

Slide 49 text

References I B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Reweighting the rct for generalization: finite sample error and variable selection. arXiv:2208.07614, 2023a. B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Risk ratio, odds ratio, risk difference... which causal measure is easier to generalize? arXiv:2303.16008, 2023b. B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. Statistical Science, 2024. M. Doutreligne and G. Varoquaux. How to select predictive models for decision making or causal inference? 2023. URL https://hal.science/hal-03946902. S. MacMahon, R. Peto, R. Collins, J. Godwin, J. Cutler, P. Sorlie, R. Abbott, J. Neaton, A. Dyer, and J. Stamler. Blood pressure, stroke, and coronary heart disease: part 1, prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. The Lancet, 335(8692):765–774, 1990. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. G Varoquaux 38

Slide 50

Slide 50 text

References II M. C. Sheps. Shall we count the living or the dead? New England Journal of Medicine, 259 (25):1210–1214, 1958. doi: 10.1056/NEJM195812182592505. URL https://doi.org/10.1056/NEJM195812182592505. PMID: 13622912. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022. S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242. G Varoquaux 39