Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Individualizing treatment effects: transportability and model selection

Gael Varoquaux
September 22, 2023

Individualizing treatment effects: transportability and model selection

For efficient interventions, we would like to know the causal effect of the intervention on a given individual: the individual treatment effect. Given the proper set of covariates, such quantity can be computed with machine-learning models: contrasting the predicted outcome for the individual with and without the treatment. I will analyse in detail how to best compute such quantities: what choice of covariates to minimize the variance, how to empirically select the best machine-learning model, and how a good choice of population-level summaries of treatment effect is least sensitive to heterogeneity.

Gael Varoquaux

September 22, 2023
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Causal effects on an individual Relies on comparing treated outcome

    vs non-treated outcome In Randomized Control Trials Treated and non-treated statistically identical On observational data Extrapolate across similar treated and non-treated individuals Challenge: personalizing this population measure G Varoquaux 1
  2. Automated decisions But also policy recommendation Many problems are better

    solved by organizational choices than shiny magic boxes Requires solid evidence of causal effects G Varoquaux 2
  3. Individualizing treatment effects 1 Good choice of predictors for counterfactuals

    2 Good choice of variables to generalize 3 Good choice of summary measure G Varoquaux 3
  4. 1 Good choice of predictors for counterfactuals Model selection for

    causal machine learning [Doutreligne and Varoquaux 2023] How to select predictive models for decision making or causal inference?
  5. Intuitions: Predictors and causal effects Prognostic model: predicting a health

    outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) G Varoquaux 5
  6. Intuitions: Predictors and causal effects Prognostic model: predicting a health

    outcome Health covariate Outcome Prediction function of intervention (treated Y0(x) vs untreated Y1(x)) For decisions: Individual treatment effect: comparing predicted outcomes for the same individuals G Varoquaux 5
  7. Intuitions: causal model selection & distribution shift Untreated outcom eY0

    (x) Treated outcom eY1 (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) Baseline health Healthy individuals did not receive the treatment G Varoquaux 6
  8. Intuitions: causal model selection & distribution shift Untreated outcom eY0

    (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes G Varoquaux 6
  9. Intuitions: causal model selection & distribution shift Untreated outcom eY0

    (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6
  10. Intuitions: causal model selection & distribution shift Untreated outcom eY0

    (x) Treated outcom eY1 (x) ˆ µa (x) Untreated outcom eY0 (x) Treated outcom eY1 (x) ˆ µa (x) Baseline health Standard cross-validation / predictive accuracy not good Must weight equally errors on treated vs untreated outcome Healthy individuals did not receive the treatment The model associates treatment to negative outcomes A worse predictor gives better causal inference G Varoquaux 6
  11. Formalism: potential outcomes Health covariate Outcome Potential outcomes: (y0(X), y1(X))

    ∼ D⋆ unobserved Observations: (Y, X, A) ∼ D outcome, covariate, treatment Estimation goal: CATE τ(x)def = ¾Y1,Y0∼D⋆ [Y1 − Y0|X = x] G Varoquaux 7
  12. Causal risk: oracle vs observable Using a predictor ˆ y

    = f (X) to estimate the CATE induces a risk: τ-risk(f) = ¾X∼p(X) [(τ(X) − ˆ τf (X))2] oracle: uses τ Standard R2 (ML practice): µ-risk(f) = ¾ (Y − f (X; A))2 on D not D⋆, Y not (Y0, Y1) The challenge Compensate for difference between D⋆ and D G Varoquaux 8
  13. Causal risk: inverse propensity weighting (Propensity score) e(x)def = [A

    = 1|X = x]. Adjusted risk – IPW: [Wager and Athey] τ-risk⋆ IPW (f) = ¾ Y A − e(X) e(X)(1 − e(X)) − τf (X) 2 G Varoquaux 9
  14. Causal risk: the R-risk Lemma – rewriting of outcome model:

    (R-decomposition) y(a) = m(x) + a − e(x) τ(x) + ε(x; a) (Conditional mean outcome) m(x)def = ¾Y∼D [Y|X = x], (Propensity score) e(x)def = [A = 1|X = x]. Adjusted risk – R-risk: [Nie and Wager 2021] R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 What’s the price to pay with estimated e and m? G Varoquaux 10
  15. Procedure: Selecting predictors for causal inference Model-selection procedure 1. Compute

    m and e on train set (with standard ML tools) 2. On test set, use adjusted risk (“doubly robust”): R-risk(f) = ¾(Y,X,A)∼D (Y − m (X)) − (A − e (X)) τf (X) 2 G Varoquaux 11
  16. Empirical evaluation 4 datasets 1 simulation: sampling many senario response

    functions, treatment allocations 3 canonical real-world data with simulated counterfactuals G Varoquaux 12
  17. Empirical evaluation: which risk? 1.0 0.5 0.0 0.5 risk riskIPW

    risk* IPW riskIPW risk* IPW U risk U risk* R risk R risk* Strong Overlap 1.0 0.5 0.0 0.5 Weak Overlap Relative ( , Risk) compared to mean over all metrics Kendall's Twins (N= 11 984) ACIC 2016 (N=4 802) Caussim (N=5 000) ACIC 2018 (N=5 000) The R-risk is best small price payed for estimated e and m doubly robust properties G Varoquaux 13
  18. Empirical evaluation: “details” of the procedure Which data to learn

    the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ G Varoquaux 14
  19. Empirical evaluation: “details” of the procedure Which data to learn

    the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ G Varoquaux 14
  20. Empirical evaluation: “details” of the procedure Which data to learn

    the nuisance e, m? 3 splits: learn ˆ f / learn ˆ e and ˆ m / compute risk Drawback: less data for learning or 2 splits: learn ˆ f, ˆ e, and ˆ m on same data Drawback: correlated errors 2 splits ✓ What train/test fraction? More data to train, or to test? 10% left-out data for test ✓ How to choose models for e and m? Which learner? Standard model selection ✓ G Varoquaux 14
  21. Prediction to support decision when predictors should be causal normal

    cross-validation not suited Use R-risk with nuisances estimated on train set [Doutreligne and Varoquaux 2023] G Varoquaux 15
  22. 2 Good choice of variables to generalize From identifiability to

    a bias-variance tradeoff [Colnet... 2023a] Reweighting the RCT for generalization: finite sample error and variable selection
  23. Transporting causal effects: RCT and target data Source data (RCT)

    estimate causal effects Target data make decisions G Varoquaux 17
  24. Transporting causal effects: principle Model treatment heterogeneity To apply on

    shifted population G Varoquaux 18 Review [Colnet... 2024]
  25. Assumptions needed for identifiability (oracle) 1. Internal validity of trial

    2. Overlap between trial and target population 3. Transportability / Conditional Ignorability R(Y(1) − Y(0) | X = x) = T(Y(1) − Y(0) | X = x) Trial distribution Target distribution The covariates X capture all systematic variations in treatment effect Which covariates? mod: effect modifiers shift: shifted between populations Necessary: mod ∩ shift variables that are shifted and modify effect People include more, to be safe G Varoquaux 19
  26. Reweighting estimator (IPSW) Re-weight the trial data: ˆ τn,m =

    1 n i∈Trial ˆ wn,m(Xi ) Transport weights Yi Ai − Yi (1 − Ai ) Trial estimate Consistency of non-oracle IPSW under assumptions 1, 2, 3 For categorical covariates [Colnet... 2023a] G Varoquaux 20
  27. Finite-sample error of IPSW Quadratic risk of non-oracle IPSW for

    n trial and m target samples ¾ ˆ τπ,n,m − τ 2 ≤ 2Vso n + 1 + Var [τ(X)] m + 2 m(n + 1) ¾ R p T (X) (1 − p T (X)) p R (X)2 V HT (X) + 2 1 − min x p R (x) n ¾ T [τ(X)2] 1 + 2 m G Varoquaux 21 [Colnet... 2023a]
  28. Adding shifted covariates which are not treatment effect modifiers Add

    to minimal set X, V (shifted non modifiers): Makes estimating propensity weights harder lim n→∞ n Var r ˆ τ∗ t ,n (x, v) Variance with added V = v∈V p T (v)2 p R (v) lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V The stronger the shift, the bigger the variance inflation G Varoquaux 22 [Colnet... 2023a]
  29. Adding treatment effect modifiers which are not shifted Add to

    minimal set X, V (modifiers non shifted): Gives less variance on τ lim n→∞ n Var R ˆ τ∗ T ,n (X, V) Variance with added V = lim n→∞ n Var R ˆ τ∗ T ,n (X) Variance without added V − ¾ R p T (X) p R (X) Var [τ(X, V) | X] The stronger the variance explained by V, the bigger the variance deflation G Varoquaux 23 [Colnet... 2023a]
  30. Covariates choice Types of covariates: shift: covariate shifted between pop

    mod: treatment effect modifiers To reweigh / stratify for transport: - Minimal set: shift ∩ mod - Reduced variance: mod [Colnet... 2023a] G Varoquaux 24
  31. 3 Good choice of summary measure Risk ratio, odds ratio,

    risk difference... [Colnet... 2023b] Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize?
  32. Summary metrics Risk difference τ RD :=¾ Y(1) − ¾

    Y(0) Risk ratio τ RR := ¾ Y(1) ¾ Y(0) For binary outcomes (in addition to the above): Survival ratio τ SR := Y(1) = 0 Y(0) = 0 Odds ratio τ OR := [Y(1) = 1] [Y(1) = 0] [Y(0) = 1] [Y(0) = 0] −1 G Varoquaux 27
  33. Summary metrics and population heterogeneity Full population versus subpopulations: τ

    RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 Stroke outcome, data from [MacMahon... 1990] Want summary metrics most stables across subgroups G Varoquaux 28
  34. Summary metrics: appealing properties Collapsibility: τ RD = p S

    (X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures G Varoquaux 29
  35. Summary metrics: appealing properties Collapsibility: τ RD = p S

    (X = 1) · τ RD (X = 1) + p S (X = 0) · τ RD (X = 0) % individuals with X = 1 in PS % individuals with X = 0 in PS Total measure is a weighted combination of sup-population measures Logic-respecting: min x (τ(x)) ≤ τ ≤ max x (τ(x)) The odds ratio is not logic-respecting (thus not collapsible) G Varoquaux 29
  36. Transporting causal measures across populations Transporting τ, collapsible, from S

    to T τT = ¾ S pT (X) pS (X) g T (Y(0), X) τS (X) Re-weighting population ratio collapsibility weights of τ X contains all the covariates that are shifted and treatment effect modulators G Varoquaux 30
  37. Separating baseline from treatment effect Directions of heterogeneity? Need only

    covariates that explain τ, and not the full variance of Y(0), Y(1) G Varoquaux 31
  38. For binary outcomes You only die once Waging war makes

    smoking comparatively less risky Probabilities are not additive G Varoquaux 32
  39. For binary outcomes: a decomposition Y(a) = 1 | X

    = x = b(x) + a 1 − b (x) mb (x) − b (x) mg (x) where mg(x) := Y(1) = 0 | Y(0) = 1, X = x good effect and mb (x) := Y(1) = 1 | Y(0) = 0, X = x bad effect “good”: if untreated has outcome, if treated does not Decompose summary metrics: τRD = ¾ [(1 − b (X)) mb (X)]−¾ b (X) mg (X) , τNNT = 1 ¾ [(1 − b (X)) mb (X)] − ¾ b (X) mg (X) τRR = 1+ ¾ [(1 − b (X)) mb (X)] ¾ [b(X)] − ¾ b(X)mg (X) ¾ [b(X)] , τSR = 1− ¾ [(1 − b (X)) mb (X)] ¾ [1 − b(X)] + ¾ b (X) mg (X) ¾ [1 − b(X)] G Varoquaux 33
  40. For binary outcomes: a decomposition with monotonous effect Assuming treatment

    is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] - No general separation of b and m ⌢ - On covariates X affecting only the baseline level b ⌣ τ RR (x) = 1 − b(x)mg(x) b(x) = 1 − mg(x). constant effect on subgroups stratified by covariates X affecting only the baseline G Varoquaux 34
  41. For binary outcomes: a decomposition with monotonous effect Assuming treatment

    is beneficial ( x, mb (x) = 0): Risk ratio simplifies τ RR = 1 − ¾ b(X)mg(X) ¾ [b(X)] Assuming treatment is harmful ( x, mg(x) = 0): Survival ratio simplifies τ SR = 1 − ¾ [(1 − b(X)) mb (X)] ¾ [1 − b(X)] An age-old question: [Sheps 1958] Shall We Count the Living or the Dead? G Varoquaux 34
  42. Choice of covariates to transport causal effects Good metric separates

    baseline risk from treatment modulation Continuous outcome Risk difference is collapsible and separates out baseline risk ⇒ Need only shifted covariates that modulate the difference Binary outcome Beneficial effect ⇒ Risk Ratio ⇒ Need only shifted covariates that modulate the benefice Harmful effect ⇒ Survival Ratio ⇒ Need only shifted covariates that modulate the harm G Varoquaux 35
  43. Choice of covariates to transport causal effects Good metric separates

    baseline risk from treatment modulation Important because it defines heterogeneity across individuals τ RD τ RR τ SR τ OR All patients (P S ) −0.0452 0.6 1.05 0.57 Moderate: X = 1 −0.006 0.6 1.01 0.6 Mild: X = 0 −0.080 0.6 1.1 0.545 G Varoquaux 35
  44. The soda team: Machine learning for health and social sciences

    Machine learning for statistics Causal inference, biases, missing values Health and social sciences Epidemiology, education, psychology Tabular relational learning Relational databases, data lakes Data-science software scikit-learn, joblib, skrub G Varoquaux 36
  45. Individualizing treatment effects Using predictors Careful about model selection: [Doutreligne

    and Varoquaux 2023] - heteroegeneous error between treated and non treated - use R-risk (also careful not to use post-treatment information) Stratification in subgroups Use all covariable that capture effet heterogeneity but not population heterogeneity [Colnet... 2023a] Good metrics separate baseline risk from treatment effect Binary outcomes require thinking [Colnet... 2023b] G Varoquaux 37 @GaelVaroquaux
  46. References I B. Colnet, J. Josse, G. Varoquaux, and E.

    Scornet. Reweighting the rct for generalization: finite sample error and variable selection. arXiv:2208.07614, 2023a. B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Risk ratio, odds ratio, risk difference... which causal measure is easier to generalize? arXiv:2303.16008, 2023b. B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. Statistical Science, 2024. M. Doutreligne and G. Varoquaux. How to select predictive models for decision making or causal inference? 2023. URL https://hal.science/hal-03946902. S. MacMahon, R. Peto, R. Collins, J. Godwin, J. Cutler, P. Sorlie, R. Abbott, J. Neaton, A. Dyer, and J. Stamler. Blood pressure, stroke, and coronary heart disease: part 1, prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. The Lancet, 335(8692):765–774, 1990. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. G Varoquaux 38
  47. References II M. C. Sheps. Shall we count the living

    or the dead? New England Journal of Medicine, 259 (25):1210–1214, 1958. doi: 10.1056/NEJM195812182592505. URL https://doi.org/10.1056/NEJM195812182592505. PMID: 13622912. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022. S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242. G Varoquaux 39