Randomized experiments have become ubiquitous in many fields. Traditionally, we have focused on reporting the average treatment effect (ATE) from such experiments. With recent advances in machine learning, and the overall scale at which experiments are now conducted, we can broaden our analysis to include heterogeneous treatment effects. This provides a more nuanced view of the effect of a treatment or change on the outcome of interest. Going one step further, we can use models of heterogeneous treatment effects to optimally allocate treatment.In this talk will provide a brief overview of heterogeneous treatment effect modeling. We will show how to apply some recently proposed methods using R, and compare the results of each using a question wording experiment from the General Social Survey. Finally, we will conclude with some practical issues in modeling heterogeneous treatment effects, including model selection and obtaining valid confidence intervals.
Modeling Heterogeneous Treatment Effects with R
July 12, 2018
Should we rebrand “welfare” as “assistance to the poor”?
Welfare - United States
Welfare in the United States referrs to an assortment of assistance programs at
the Federal and State levels.
• cash or wage assistance
• healthcare (Medicaid)
• food (SNAP)
• utilities (natural gas, electricity)
Let’s run an experiment!
General Social Survey (GSS)
The experimental data we’re looking at today comes from the General Social
Since 1972, the General Social Survey (GSS) has provided politicians, pol-
icymakers, and scholars with a clear and unbiased perspective on what
Americans think and feel about such issues as national spending priori-
ties, crime and punishment, intergroup relations, and conﬁdence in insti-
The survey is typically ﬁelded every two years with a large overlap in questions
GSS Framing Experiment
The GSS began an ongoing question framing experiment in 1986.
We are faced with many problems in this country, none of which can be solved
easily or inexpensively. I’m going to name some of these problems, and for each
one I’d like you to tell me whether you think we’re spending too much money on
it, too little money, or the right amount.
Are we spending too much, too little, or about the right amount on [TREATMENT].
Treatment assistance to the poor
year the survey year
treatment the question treatment, welfare or assistance
response response to speding question, 1 if too much
partyid party identiﬁcation of respondent, from democrat to republican
polviews political views of respondent, from liberal to conservative
age age of respondent
educ respondent years of education
racial_attitude_index composite index of negative racial attitudes2
2Green and Kern, “Modeling heterogeneous treatment effects in survey experiments with Bayesian
additive regression trees”.
Average Treatment Effect (ATE)
The average treatment effect (ATE) tells us the overall effect of the treatment. The
ATE is the difference in outcomes between the treatment and control groups,
ATE = E[y | t = treatment] − E[y | t = control].
ATE - dplyr
> gss %>%
summarize(avg = mean(response)) %>%
spread(treatment, avg) %>%
summarize(ate = assistance - welfare)
# A tibble: 1 x 1
ATE - lm
> lm(response ~ treatment, data = gss)
lm(formula = response ~ treatment, data = gss)
Heterogeneous Treatment Effects
The Neyman-Rubin causal model:
respondent Yi (0) Yi (1) treatment
1 ? too much assistance
2 too little ? welfare
3 too little ? welfare
Yi(0) and Yi(1) are called potential outcomes. When respontent i is treated, we
observe Yi(1), when they are untreated, we observe Yi(0).
3Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies.”
Conditional Average Treatment Effect (CATE)
The average treatment effect is useful: it allows us to compare different
treatments for overall effectiveness. But, it’s a population average. A more
interesting measure is the conditional average effect (CATE),
CATE(x) = E[Y(1) − Y(0) | X = x].
We can see the effect of the treatment on groups with a particular value x of the
Example: CATE of Political Views
Missing Data Problem
respondent Yi (0) Yi (1) treatment age … educ
1 ? too much assistance 34 … 16
2 too little ? welfare 41 … 12
3 too little ? welfare 53 … 20
Let’s use machine learning to estimate the missing potential outcomes.
µ = M(Y ∼ (X, T))
CATE(x) = ˆ
µ(x, 1) − ˆ
• use any ML/statistical model M(·, ·)
• include the treatment indicator T
• include all treatment and covariate
m <- randomForest(response ~ ., data = gss)
gss_treated <- gss %>%
mutate(treatment = factor("assistance",
levels = c("welfare", "assistance")))
gss_control <- gss %>%
mutate(treatment = factor("welfare",
levels = c("welfare", "assistance")))
y_1 <- predict(m, gss_treated, type = "prob")[, 2]
y_0 <- predict(m, gss_control, type = "prob")[, 2]
cate <- y_1 - y_0
Figuring out if we have a decent model is tough. We never observe the same
people under both treatment and control, so we can’t use traditional metrics like
MSE or accuracy.
True vs Predicted CATE Quantiles
• Using a holdout set or cross-
validation, get a set of
out-of-sample treatment effect
scores from a given model.
• Quantile those scores and calculate
the true ATE within each quantile.
• Check that those predictions order
well and see how they compare to
the average predictions in each
• The uplift curve represents the
incremental gain from using the
model to target effort or outreach.
• Similar to the quantile plot, rank
observations by predicted ATE and
compare to actual ATE in each
group, red line.
• Compare this to randomly ordering
observations, blue line.
• The qini coefﬁcient is analogous to
the area under the ROC curve (AUC)
for supervised learning.
• A single metric we can use to
compare models ﬁt to the same
• Scale matters, so we can’t use to
compare models in absolute terms.
4Radcliffe and Surry, “Real-world uplift modelling with signiﬁcance-based uplift trees”.
Modeling Approaches, Continued
µ0 = M0(Y0
µ1 = M1(Y1
CATE(x) = ˆ
µ1(x) − ˆ
• use two models
• M0(·) estimated with control group
• M1(·) estimated with treatment
m0 <- randomForest(response ~ . -treatment,
data = filter(gss, treatment == "welfare"))
m1 <- randomForest(response ~ . -treatment,
data = filter(gss, treatment == "assistance"))
y_0 <- predict(m1, gss, type = "prob")[, 2]
y_1 <- predict(m2, gss, type = "prob")[, 2]
cate <- y_1 - y_0
µ0(x) = M1(Y0
µ1(x) = M2(Y1
) − Y0
CATE0(x) = M4(˜
CATE1(x) = M3(˜
CATE(x) = g(x)CATE0(x) + (1 − g(x))CATE1(x)
estimate the response in
the control and treatment groups
D1 and ˜
D0 are the imputed CATE
• g(x) is a weighting function,
typically the propensity score or
5Künzel et al., “Meta-learners for Estimating Heterogeneous Treatment Effects using Machine
Generalized Random Forest (GRF) 6
• CART/RandomForest inspired
• directly estimates CATE
• guarantees for consistency and bias
• proper conﬁdence intervals
• CRAN: grf
6Athey, Tibshirani, and Wager, “Generalized random forests”
7D’Agostino and Lattner, The power of persuasion modeling
x <- model.matrix(response ~ . -treatment, data = gss)
y <- gss$response
tmt <- ifelse(gss$treatment == "welfare", 0, 1)
m <- causal_forest(x, y, tmt)
cate <- predict(m, estimate.variance = TRUE)
• interactive, split, and x-learner
• formula interface
• plugin any ML model/estimator
• uplift curve
• on GitHub: github.com/wlattner/hete
m <- hete_single(response ~ year + educ + age | treatment,
data = gss, est = random_forest)
cate <- predict(m, gss)
So, should we rebrand “welfare”?
Angrist, Joshua D. and Jörn-Steffen Pischke. Mostly Harmless Econometrics: An
Empiricist’s Companion. Princeton University Press, Dec. 2008. isbn: 0691120358.
Athey, Susan, Julie Tibshirani, and Stefan Wager. “Generalized random forests”. In:
arXiv preprint arXiv:1610.01271 (2016).
Chernozhukov, Victor et al. Double machine learning for treatment and causal
parameters. Tech. rep. cemmap working paper, Centre for Microdata Methods
and Practice, 2016.
D’Agostino, Michelangelo and Bill Lattner. The power of persuasion modeling. Talk
at Strata + Hadoop World, San Jose, CA. 2017.
Green, Donald P and Holger L Kern. “Modeling heterogeneous treatment effects in
survey experiments with Bayesian additive regression trees”. In: Public opinion
quarterly 76.3 (2012), pp. 491–511.
Imbens, Guido W and Donald B Rubin. Causal inference in statistics, social, and
biomedical sciences. Cambridge University Press, 2015.
Künzel, Sören R et al. “Meta-learners for Estimating Heterogeneous Treatment
Effects using Machine Learning”. In: arXiv preprint arXiv:1706.03461 (2017).
Radcliffe, Nicholas J and Patrick D Surry. “Real-world uplift modelling with
signiﬁcance-based uplift trees”. In: White Paper TR-2011-1, Stochastic Solutions
Rubin, Donald B. “Estimating causal effects of treatments in randomized and
nonrandomized studies.”. In: Journal of educational Psychology 66.5 (1974),
GSS Data gss.norc.org
grf Package github.com/swager/grf
hete Package github.com/wlattner/hete