Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Why big data didn’t end causal inference Totte Harinen Data
Scientist II Uber Labs

Structure of talk 1. The case against causal inference in
the big data era 2. Reasons why big data did not (and will not) end causal inference 3. A reconciliation between big data and causal inference? 4. Conclusions

Causal inference: Any modelling approach where some parts of the
model are assumed to correspond to some aspects of the causal structure of the world. Big data: Lots of observations and/or variables per observation.

The case against causal inference in the big data era

“Scientists are trained to recognize that correlation is not causation,
that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. […] There is now a better way. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms ﬁnd patterns where science cannot.” Chris Anderson, Wired 2008

1. Humans are bad at coming up with causal hypotheses
Coming up with hypotheses about the world and then testing them against experimental evidence seems old-fashioned, and there’s a sense that humans are somehow bad at this. The experimental method was certainly developed before powerful computing, so it’s not a crazy idea that it’s due for a revolution just like so many other things have been. 2. Correlational models form a more accurate picture of reality Anderson refers to the fact that models are abstractions of the underlying reality. He suggests that the correlational approach results in a more accurate picture of how the world is because such an approach is more ﬂexible in its assumptions and can incorporate more complexity. 3. Data analysis just seems to be headed towards the correlational approach It’s clear that running correlational analyses on big datasets has resulted in progress both in science and business, and this big data driven progress has presumably become greater over time. So we could extrapolate that progress is increasingly going to be based on the correlational approach. Reasons why big data might have ended causal inference

And still… causal inference seems to be doing just fine
Science Randomised controlled trials, mediation analysis and quasi- experiments: 41400 Google Scholar hits since the beginning of the year Business and policy A/B testing in business, incrementality measurement in advertising, experimental methods in public policy And last but not least… At Uber, we’re applying causal inference methods to answer questions relevant to our business

Reasons why big data did not (and will not) end
causal inference

Humans are good at causal hypotheses This is because during
our evolutionary history it’s been useful to be able to answer the question: “If I changed X, what would happen to Y?” Abstractness is what makes models useful When we abstract away from the particulars of a situation, we can generalize into other similar contexts. Correlational approaches don’t give us the counterfactual Estimating a causal effect requires estimating what would have happened in the absence of the cause. Three quick considerations

Bigger data = better causal inference Bigger sample sizes enable
us to identify smaller causal effects and/or have a greater number of treatment arms in a standard RCT. Participant matching approaches beneﬁt from a larger number of covariates. Time series based causal inference methods require multiple observations… All of which were difﬁcult to achieve before the big data era. Technology

Email open rate before and after a personalised title Interrupted
time series analysis is a classic method for inferring the causal impact of an intervention, based on qualitative assumptions of the underlying causal structure. However, until recently, high quality time series data was hard to get. Example: interrupted time series analysis

Causal inference is in its infancy The formal language to
describe causal relationships has been developed fairly recently. This has enabled both the development of better computational methods for causal inference as well as the clariﬁcation of key assumptions. New methodologies

How much of the impact of an email is mediated
via click through? Causal mediation modeling is designed to reveal the mechanisms through which the impact of an intervention is mediated. Until recently, there weren’t easy to implement methods to run mediation models with nonparametric data. Example: causal mediation modeling

A reconciliation between big data and causal inference?

Big data enables us to do more and better causal
inference Better participant matching, subpopulation analyses, multi-arm trials and the ability to identify smaller effects are some of the beneﬁts to causal inference that arise from the existence of big data. Big data ﬁndings can inspire causal hypotheses Experiments, quasi-experiments and causal modelling can be used to test hypotheses about patterns that arise from correlational analysis. Machine learning methods can help us to estimate causal quantities Exciting developments in machine learning ay help us to estimate counterfactuals like what the outcome would have looked like for the treated in the absence of the intervention. Some general ways in which big data and causal inference complement each other

Algorithms to identify causal models from observational data A causal
model implies a certain correlation structure between variables, which can be used to identify causal structures from purely observational data. Variable selection A simple application is that of identifying predictors, out of hundreds of variables, for compliance rates in randomised trials with noncompliance. Synthetic control groups In time series analysis, for example, we need to be able to determine what the outcome variable of a unit of interest would have looked like in the absence of the intervention, which is where it may be helpful to be able to identify units with highly correlated pre-intervention outcomes. Artiﬁcial intelligence to pick out actual causes Actual causes are those that are judged to be “causally responsible” for an outcome out of a set of counterfactually related events. Some particularly interesting prospects

“We postulate that the major impediment to achieving accelerated learning
speeds as well as human level performance can be overcome by […] equipping learning machines with causal reasoning tools. This postulate would have been speculative twenty years ago, prior to the mathematization of counterfactuals. Not so today. Advances in graphical and structural models have made counterfactuals computationally manageable and thus rendered metastatistical learning worthy of serious exploration..” Judea Pearl, Theoretical Impediments to Machine Learning, 2016

Conclusions

The rumours of the death of causal inference were strongly
exaggerated The arguments probably weren’t very good to begin with, but they did have the merit of drawing our attention to the intersection of these two important ﬁelds. Causal inference is here to stay If anything, the ﬁeld has become more active in recent years, thanks to technological and methodological developments. Correlational methods alone don’t answer the question of what would have happened in the absence of an intervention. The future belongs to both big data and causal inference When it comes to the relationship between big data and causal inference, perhaps the most exciting recent developments are in the area of combining causal inference methods with big data approaches. Conclusions

Totte Harinen Data Scientist II  [email protected] Thank you

Why big data didn’t end causal inference by Tot...

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

Why big data didn’t end causal inference Totte Harinen Data

Structure of talk 1. The case against causal inference in

Causal inference: Any modelling approach where some parts of the

The case against causal inference in the big data era

“Scientists are trained to recognize that correlation is not causation,

1. Humans are bad at coming up with causal hypotheses

And still… causal inference seems to be doing just fine

Reasons why big data did not (and will not) end

Humans are good at causal hypotheses This is because during

Bigger data = better causal inference Bigger sample sizes enable

Email open rate before and after a personalised title Interrupted

Causal inference is in its infancy The formal language to

How much of the impact of an email is mediated

A reconciliation between big data and causal inference?

Big data enables us to do more and better causal

Algorithms to identify causal models from observational data A causal

“We postulate that the major impediment to achieving accelerated learning

Conclusions

The rumours of the death of causal inference were strongly

Totte Harinen Data Scientist II  [email protected] Thank you