Slide 1

Slide 1 text

Modeling Heterogeneous Treatment Effects with R useR! 2018 Bill Lattner July 12, 2018

Slide 2

Slide 2 text

Should we rebrand “welfare” as “assistance to the poor”? 0

Slide 3

Slide 3 text

Welfare - United States Welfare in the United States referrs to an assortment of assistance programs at the Federal and State levels. • cash or wage assistance • healthcare (Medicaid) • food (SNAP) • utilities (natural gas, electricity) 1

Slide 4

Slide 4 text

Let’s run an experiment! 1

Slide 5

Slide 5 text

General Social Survey (GSS) The experimental data we’re looking at today comes from the General Social Survey. Since 1972, the General Social Survey (GSS) has provided politicians, pol- icymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priori- ties, crime and punishment, intergroup relations, and confidence in insti- tutions. 1 The survey is typically fielded every two years with a large overlap in questions between years. 1 2

Slide 6

Slide 6 text

GSS Framing Experiment The GSS began an ongoing question framing experiment in 1986. natfare/natfarey We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or the right amount. Are we spending too much, too little, or about the right amount on [TREATMENT]. Control welfare Treatment assistance to the poor 3

Slide 7

Slide 7 text

GSS Variables year the survey year treatment the question treatment, welfare or assistance response response to speding question, 1 if too much partyid party identification of respondent, from democrat to republican polviews political views of respondent, from liberal to conservative age age of respondent educ respondent years of education racial_attitude_index composite index of negative racial attitudes2 2Green and Kern, “Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees”. 4

Slide 8

Slide 8 text

Topline 5

Slide 9

Slide 9 text

Average Treatment Effect (ATE) The average treatment effect (ATE) tells us the overall effect of the treatment. The ATE is the difference in outcomes between the treatment and control groups, ATE = E[y | t = treatment] − E[y | t = control]. 6

Slide 10

Slide 10 text

ATE - dplyr > gss %>% group_by(treatment) %>% summarize(avg = mean(response)) %>% spread(treatment, avg) %>% summarize(ate = assistance - welfare) # A tibble: 1 x 1 ate 1 -0.347 7

Slide 11

Slide 11 text

ATE - lm > lm(response ~ treatment, data = gss) Call: lm(formula = response ~ treatment, data = gss) Coefficients: (Intercept) treatmentassistance 0.4550 -0.3467 8

Slide 12

Slide 12 text

Heterogeneous Treatment Effects

Slide 13

Slide 13 text

Potential Outcomes3 The Neyman-Rubin causal model: respondent Yi (0) Yi (1) treatment 1 ? too much assistance 2 too little ? welfare 3 too little ? welfare Yi(0) and Yi(1) are called potential outcomes. When respontent i is treated, we observe Yi(1), when they are untreated, we observe Yi(0). 3Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies.” 9

Slide 14

Slide 14 text

Conditional Average Treatment Effect (CATE) The average treatment effect is useful: it allows us to compare different treatments for overall effectiveness. But, it’s a population average. A more interesting measure is the conditional average effect (CATE), CATE(x) = E[Y(1) − Y(0) | X = x]. We can see the effect of the treatment on groups with a particular value x of the pre-treatment covariates. 10

Slide 15

Slide 15 text

Example: CATE of Political Views 11

Slide 16

Slide 16 text

Modeling Approaches

Slide 17

Slide 17 text

Missing Data Problem respondent Yi (0) Yi (1) treatment age … educ 1 ? too much assistance 34 … 16 2 too little ? welfare 41 … 12 3 too little ? welfare 53 … 20 Let’s use machine learning to estimate the missing potential outcomes. 12

Slide 18

Slide 18 text

Interactive Model ˆ µ = M(Y ∼ (X, T)) CATE(x) = ˆ µ(x, 1) − ˆ µ(x, 0) • use any ML/statistical model M(·, ·) • include the treatment indicator T • include all treatment and covariate interactions 13

Slide 19

Slide 19 text

Interactive Model m <- randomForest(response ~ ., data = gss) gss_treated <- gss %>% mutate(treatment = factor("assistance", levels = c("welfare", "assistance"))) gss_control <- gss %>% mutate(treatment = factor("welfare", levels = c("welfare", "assistance"))) y_1 <- predict(m, gss_treated, type = "prob")[, 2] y_0 <- predict(m, gss_control, type = "prob")[, 2] cate <- y_1 - y_0 14

Slide 20

Slide 20 text

Model Evaluation

Slide 21

Slide 21 text

Evaluation Figuring out if we have a decent model is tough. We never observe the same people under both treatment and control, so we can’t use traditional metrics like MSE or accuracy. 15

Slide 22

Slide 22 text

True vs Predicted CATE Quantiles • Using a holdout set or cross- validation, get a set of out-of-sample treatment effect scores from a given model. • Quantile those scores and calculate the true ATE within each quantile. • Check that those predictions order well and see how they compare to the average predictions in each quantile. 16

Slide 23

Slide 23 text

Uplift • The uplift curve represents the incremental gain from using the model to target effort or outreach. • Similar to the quantile plot, rank observations by predicted ATE and compare to actual ATE in each group, red line. • Compare this to randomly ordering observations, blue line. 17

Slide 24

Slide 24 text

qini4 • The qini coefficient is analogous to the area under the ROC curve (AUC) for supervised learning. • A single metric we can use to compare models fit to the same task. • Scale matters, so we can’t use to compare models in absolute terms. 4Radcliffe and Surry, “Real-world uplift modelling with significance-based uplift trees”. 18

Slide 25

Slide 25 text

Modeling Approaches, Continued

Slide 26

Slide 26 text

Split Model ˆ µ0 = M0(Y0 ∼ X0 ) ˆ µ1 = M1(Y1 ∼ X1 ) CATE(x) = ˆ µ1(x) − ˆ µ0(x) • use two models • M0(·) estimated with control group • M1(·) estimated with treatment group 19

Slide 27

Slide 27 text

Split Model m0 <- randomForest(response ~ . -treatment, data = filter(gss, treatment == "welfare")) m1 <- randomForest(response ~ . -treatment, data = filter(gss, treatment == "assistance")) y_0 <- predict(m1, gss, type = "prob")[, 2] y_1 <- predict(m2, gss, type = "prob")[, 2] cate <- y_1 - y_0 20

Slide 28

Slide 28 text

X-Learner5 ˆ µ0(x) = M1(Y0 ∼ X0 ) ˆ µ1(x) = M2(Y1 ∼ X1 ) ˜ D0 = ˆ µ1(X0 ) − Y0 ˜ D1 = Y1 − ˆ µ0(X1 ) CATE0(x) = M4(˜ D0 ∼ X0 ) CATE1(x) = M3(˜ D1 ∼ X1 ) CATE(x) = g(x)CATE0(x) + (1 − g(x))CATE1(x) • M1 and M2 estimate the response in the control and treatment groups • ˜ D1 and ˜ D0 are the imputed CATE • g(x) is a weighting function, typically the propensity score or treated fraction 5Künzel et al., “Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning”. 21

Slide 29

Slide 29 text

Generalized Random Forest (GRF) 6 7 • CART/RandomForest inspired • directly estimates CATE • guarantees for consistency and bias • proper confidence intervals • CRAN: grf 6Athey, Tibshirani, and Wager, “Generalized random forests” 7D’Agostino and Lattner, The power of persuasion modeling 22

Slide 30

Slide 30 text

GRF library(grf) x <- model.matrix(response ~ . -treatment, data = gss) y <- gss$response tmt <- ifelse(gss$treatment == "welfare", 0, 1) m <- causal_forest(x, y, tmt) cate <- predict(m, estimate.variance = TRUE) 23

Slide 31

Slide 31 text

hete Package

Slide 32

Slide 32 text

hete Package • interactive, split, and x-learner • formula interface • plugin any ML model/estimator • uplift curve • plots • on GitHub: 24

Slide 33

Slide 33 text

hete Package m <- hete_single(response ~ year + educ + age | treatment, data = gss, est = random_forest) plot(m) cate <- predict(m, gss) 25

Slide 34

Slide 34 text

So, should we rebrand “welfare”? 25

Slide 35

Slide 35 text

Rebrand? Probably! 26

Slide 36

Slide 36 text

References i Angrist, Joshua D. and Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, Dec. 2008. isbn: 0691120358. Athey, Susan, Julie Tibshirani, and Stefan Wager. “Generalized random forests”. In: arXiv preprint arXiv:1610.01271 (2016). Chernozhukov, Victor et al. Double machine learning for treatment and causal parameters. Tech. rep. cemmap working paper, Centre for Microdata Methods and Practice, 2016. D’Agostino, Michelangelo and Bill Lattner. The power of persuasion modeling. Talk at Strata + Hadoop World, San Jose, CA. 2017. 27

Slide 37

Slide 37 text

References ii Green, Donald P and Holger L Kern. “Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees”. In: Public opinion quarterly 76.3 (2012), pp. 491–511. Imbens, Guido W and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Künzel, Sören R et al. “Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning”. In: arXiv preprint arXiv:1706.03461 (2017). Radcliffe, Nicholas J and Patrick D Surry. “Real-world uplift modelling with significance-based uplift trees”. In: White Paper TR-2011-1, Stochastic Solutions (2011). 28

Slide 38

Slide 38 text

References iii Rubin, Donald B. “Estimating causal effects of treatments in randomized and nonrandomized studies.”. In: Journal of educational Psychology 66.5 (1974), p. 688. 29

Slide 39

Slide 39 text

Thank you! Twitter @wlattner GitHub Slides GSS Data grf Package hete Package 30