Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.


Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 04, 2017

More Decks by Big Data Spain

Other Decks in Technology


  1. Structure of talk 1. The case against causal inference in

    the big data era 2. Reasons why big data did not (and will not) end causal inference 3. A reconciliation between big data and causal inference? 4. Conclusions
  2. Causal inference: Any modelling approach where some parts of the

    model are assumed to correspond to some aspects of the causal structure of the world. Big data: Lots of observations and/or variables per observation.
  3. “Scientists are trained to recognize that correlation is not causation,

    that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. […] There is now a better way. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” Chris Anderson, Wired 2008
  4. 1. Humans are bad at coming up with causal hypotheses

    Coming up with hypotheses about the world and then testing them against experimental evidence seems old-fashioned, and there’s a sense that humans are somehow bad at this. The experimental method was certainly developed before powerful computing, so it’s not a crazy idea that it’s due for a revolution just like so many other things have been. 2. Correlational models form a more accurate picture of reality Anderson refers to the fact that models are abstractions of the underlying reality. He suggests that the correlational approach results in a more accurate picture of how the world is because such an approach is more flexible in its assumptions and can incorporate more complexity. 3. Data analysis just seems to be headed towards the correlational approach It’s clear that running correlational analyses on big datasets has resulted in progress both in science and business, and this big data driven progress has presumably become greater over time. So we could extrapolate that progress is increasingly going to be based on the correlational approach. Reasons why big data might have ended causal inference
  5. And still… causal inference seems to be doing just fine

    Science Randomised controlled trials, mediation analysis and quasi- experiments: 41400 Google Scholar hits since the beginning of the year Business and policy A/B testing in business, incrementality measurement in advertising, experimental methods in public policy And last but not least… At Uber, we’re applying causal inference methods to answer questions relevant to our business
  6. Humans are good at causal hypotheses This is because during

    our evolutionary history it’s been useful to be able to answer the question: “If I changed X, what would happen to Y?” Abstractness is what makes models useful When we abstract away from the particulars of a situation, we can generalize into other similar contexts. Correlational approaches don’t give us the counterfactual Estimating a causal effect requires estimating what would have happened in the absence of the cause. Three quick considerations
  7. Bigger data = better causal inference Bigger sample sizes enable

    us to identify smaller causal effects and/or have a greater number of treatment arms in a standard RCT. Participant matching approaches benefit from a larger number of covariates. Time series based causal inference methods require multiple observations… All of which were difficult to achieve before the big data era. Technology
  8. Email open rate before and after a personalised title Interrupted

    time series analysis is a classic method for inferring the causal impact of an intervention, based on qualitative assumptions of the underlying causal structure. However, until recently, high quality time series data was hard to get. Example: interrupted time series analysis
  9. Causal inference is in its infancy The formal language to

    describe causal relationships has been developed fairly recently. This has enabled both the development of better computational methods for causal inference as well as the clarification of key assumptions. New methodologies
  10. How much of the impact of an email is mediated

    via click through? Causal mediation modeling is designed to reveal the mechanisms through which the impact of an intervention is mediated. Until recently, there weren’t easy to implement methods to run mediation models with nonparametric data. Example: causal mediation modeling
  11. Big data enables us to do more and better causal

    inference Better participant matching, subpopulation analyses, multi-arm trials and the ability to identify smaller effects are some of the benefits to causal inference that arise from the existence of big data. Big data findings can inspire causal hypotheses Experiments, quasi-experiments and causal modelling can be used to test hypotheses about patterns that arise from correlational analysis. Machine learning methods can help us to estimate causal quantities Exciting developments in machine learning ay help us to estimate counterfactuals like what the outcome would have looked like for the treated in the absence of the intervention. Some general ways in which big data and causal inference complement each other
  12. Algorithms to identify causal models from observational data A causal

    model implies a certain correlation structure between variables, which can be used to identify causal structures from purely observational data. Variable selection A simple application is that of identifying predictors, out of hundreds of variables, for compliance rates in randomised trials with noncompliance. Synthetic control groups In time series analysis, for example, we need to be able to determine what the outcome variable of a unit of interest would have looked like in the absence of the intervention, which is where it may be helpful to be able to identify units with highly correlated pre-intervention outcomes. Artificial intelligence to pick out actual causes Actual causes are those that are judged to be “causally responsible” for an outcome out of a set of counterfactually related events. Some particularly interesting prospects
  13. “We postulate that the major impediment to achieving accelerated learning

    speeds as well as human level performance can be overcome by […] equipping learning machines with causal reasoning tools. This postulate would have been speculative twenty years ago, prior to the mathematization of counterfactuals. Not so today. Advances in graphical and structural models have made counterfactuals computationally manageable and thus rendered metastatistical learning worthy of serious exploration..” Judea Pearl, Theoretical Impediments to Machine Learning, 2016
  14. The rumours of the death of causal inference were strongly

    exaggerated The arguments probably weren’t very good to begin with, but they did have the merit of drawing our attention to the intersection of these two important fields. Causal inference is here to stay If anything, the field has become more active in recent years, thanks to technological and methodological developments. Correlational methods alone don’t answer the question of what would have happened in the absence of an intervention. The future belongs to both big data and causal inference When it comes to the relationship between big data and causal inference, perhaps the most exciting recent developments are in the area of combining causal inference methods with big data approaches. Conclusions