$30 off During Our Annual Pro Sale. View Details »

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.

https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 04, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. View Slide

  2. Why big data
    didn’t end causal
    inference
    Totte Harinen
    Data Scientist II
    Uber Labs

    View Slide

  3. Structure of talk
    1. The case against causal
    inference in the big data
    era
    2. Reasons why big data did
    not (and will not) end
    causal inference
    3. A reconciliation between
    big data and causal
    inference?
    4. Conclusions

    View Slide

  4. Causal inference: Any modelling
    approach where some parts of the
    model are assumed to correspond
    to some aspects of the causal
    structure of the world.
    Big data: Lots of observations and/or
    variables per observation.

    View Slide

  5. The case against causal inference in the big data era

    View Slide

  6. “Scientists are trained to recognize that correlation is
    not causation, that no conclusions should be drawn
    simply on the basis of correlation between X and Y (it
    could just be a coincidence). Instead, you must
    understand the underlying mechanisms that connect
    the two. […]
    There is now a better way. Petabytes allow us to say:
    ‘Correlation is enough.’ We can stop looking for
    models. We can analyze the data without hypotheses
    about what it might show. We can throw the numbers
    into the biggest computing clusters the world has ever
    seen and let statistical algorithms find patterns where
    science cannot.”
    Chris Anderson, Wired 2008

    View Slide

  7. 1. Humans are bad at coming up
    with causal hypotheses
    Coming up with hypotheses
    about the world and then testing
    them against experimental
    evidence seems old-fashioned,
    and there’s a sense that humans
    are somehow bad at this. The
    experimental method was
    certainly developed before
    powerful computing, so it’s not a
    crazy idea that it’s due for a
    revolution just like so many other
    things have been.
    2. Correlational models form a
    more accurate picture of reality
    Anderson refers to the fact that
    models are abstractions of the
    underlying reality. He suggests
    that the correlational approach
    results in a more accurate picture
    of how the world is because such
    an approach is more flexible in its
    assumptions and can incorporate
    more complexity.
    3. Data analysis just seems to be
    headed towards the correlational
    approach
    It’s clear that running
    correlational analyses on big
    datasets has resulted in progress
    both in science and business, and
    this big data driven progress has
    presumably become greater over
    time. So we could extrapolate
    that progress is increasingly
    going to be based on the
    correlational approach.
    Reasons why big data might have ended causal inference

    View Slide

  8. And still… causal inference seems
    to be doing just fine
    Science
    Randomised controlled trials,
    mediation analysis and quasi-
    experiments: 41400 Google
    Scholar hits since the
    beginning of the year
    Business and policy
    A/B testing in business,
    incrementality measurement
    in advertising, experimental
    methods in public policy
    And last but not least…
    At Uber, we’re applying causal
    inference methods to answer
    questions relevant to our
    business

    View Slide

  9. Reasons why big data did not (and will not) end causal
    inference

    View Slide

  10. Humans are good at causal
    hypotheses
    This is because during our
    evolutionary history it’s been
    useful to be able to answer
    the question: “If I changed X,
    what would happen to Y?”
    Abstractness is what makes
    models useful
    When we abstract away from
    the particulars of a situation,
    we can generalize into other
    similar contexts.
    Correlational approaches
    don’t give us the
    counterfactual
    Estimating a causal effect
    requires estimating what
    would have happened in the
    absence of the cause.
    Three quick considerations

    View Slide

  11. Bigger data = better causal
    inference
    Bigger sample sizes enable
    us to identify smaller causal
    effects and/or have a
    greater number of
    treatment arms in a
    standard RCT. Participant
    matching approaches
    benefit from a larger
    number of covariates. Time
    series based causal
    inference methods require
    multiple observations… All
    of which were difficult to
    achieve before the big data
    era.
    Technology

    View Slide

  12. Email open rate before
    and after a
    personalised title
    Interrupted time series
    analysis is a classic
    method for inferring
    the causal impact of
    an intervention, based
    on qualitative
    assumptions of the
    underlying causal
    structure. However,
    until recently, high
    quality time series
    data was hard to get.
    Example:
    interrupted time
    series analysis

    View Slide

  13. Causal inference is in its
    infancy
    The formal language to
    describe causal
    relationships has been
    developed fairly recently.
    This has enabled both the
    development of better
    computational methods
    for causal inference as
    well as the clarification of
    key assumptions.
    New methodologies

    View Slide

  14. How much of the
    impact of an email is
    mediated via click
    through?
    Causal mediation
    modeling is designed
    to reveal the
    mechanisms through
    which the impact of an
    intervention is
    mediated. Until
    recently, there weren’t
    easy to implement
    methods to run
    mediation models
    with nonparametric
    data.
    Example: causal
    mediation modeling

    View Slide

  15. A reconciliation between big data and causal inference?

    View Slide

  16. Big data enables us to do more and better causal inference
    Better participant matching, subpopulation analyses, multi-arm trials and the ability to identify smaller
    effects are some of the benefits to causal inference that arise from the existence of big data.
    Big data findings can inspire causal hypotheses
    Experiments, quasi-experiments and causal modelling can be used to test hypotheses about patterns that
    arise from correlational analysis.
    Machine learning methods can help us to estimate causal quantities
    Exciting developments in machine learning ay help us to estimate counterfactuals like what the outcome
    would have looked like for the treated in the absence of the intervention.
    Some general ways in which big data and causal inference complement each other

    View Slide

  17. Algorithms to identify causal models from observational data
    A causal model implies a certain correlation structure between variables, which can be used to identify
    causal structures from purely observational data.
    Variable selection
    A simple application is that of identifying predictors, out of hundreds of variables, for compliance rates in
    randomised trials with noncompliance.
    Synthetic control groups
    In time series analysis, for example, we need to be able to determine what the outcome variable of a unit of
    interest would have looked like in the absence of the intervention, which is where it may be helpful to be
    able to identify units with highly correlated pre-intervention outcomes.
    Artificial intelligence to pick out actual causes
    Actual causes are those that are judged to be “causally responsible” for an outcome out of a set of
    counterfactually related events.
    Some particularly interesting prospects

    View Slide

  18. “We postulate that the major impediment to
    achieving accelerated learning speeds as well
    as human level performance can be overcome
    by […] equipping learning machines with causal
    reasoning tools. This postulate would have
    been speculative twenty years ago, prior to the
    mathematization of counterfactuals. Not so
    today. Advances in graphical and structural
    models have made counterfactuals
    computationally manageable and thus
    rendered metastatistical learning worthy of
    serious exploration..”
    Judea Pearl, Theoretical Impediments to Machine
    Learning, 2016

    View Slide

  19. Conclusions

    View Slide

  20. The rumours of the death of
    causal inference were strongly
    exaggerated
    The arguments probably weren’t
    very good to begin with, but they
    did have the merit of drawing our
    attention to the intersection of
    these two important fields.
    Causal inference is here to stay
    If anything, the field has become
    more active in recent years,
    thanks to technological and
    methodological developments.
    Correlational methods alone
    don’t answer the question of
    what would have happened in
    the absence of an intervention.
    The future belongs to both big
    data and causal inference
    When it comes to the relationship
    between big data and causal
    inference, perhaps the most
    exciting recent developments are
    in the area of combining causal
    inference methods with big data
    approaches.
    Conclusions

    View Slide

  21. Totte Harinen
    Data Scientist II

    [email protected]
    Thank you

    View Slide