Slide 1

Slide 1 text

Causality & Experiments MODELING SOCIAL DATA JAKE HOFMAN COLUMBIA UNIVERSITY

Slide 2

Slide 2 text

Prediction Seeing: Make a forecast, leaving the world as it is vs. Causation Doing: Anticipate what will happen when you make a change in the world

Slide 3

Slide 3 text

Prediction Seeing: Make a forecast, leaving the world as it is (seeing my neighbor with an umbrella might predict rain) vs. Causation Doing: Anticipate what will happen when you make a change in the world (but handing my neighbor an umbrella doesn’t cause rain)

Slide 4

Slide 4 text

“Causes of effects” It’s tempting to ask “what caused Y”, e.g. ◦ What makes an email spam? ◦ What caused my kid to get sick? ◦ Why did the stock market drop? This is ”reverse causal inference”, and is generally quite hard John Stuart Mill (1843)

Slide 5

Slide 5 text

“Effects of causes” Alternatively, we can ask “what happens if we do X?”, e.g. ◦ How does education impact future earnings? ◦ What is the effect of advertising on sales? ◦ How does hospitalization affect health? This is “forward causal inference”: still hard, but less contentious! John Stuart Mill (1843)

Slide 6

Slide 6 text

Example: Hospitalization on health What’s wrong with estimating this model from observational data? Health tomorrow Hospital visit today Effect? Arrow means “X causes Y”

Slide 7

Slide 7 text

Confounds The effect and cause might be confounded by a common cause, and be changing together as a result Health tomorrow Hospital visit today Effect? Health today Dashed circle means “unobserved”

Slide 8

Slide 8 text

Confounds If we only get to observe them changing together, we can’t estimate the effect of hospitalization changing alone Health tomorrow Hospital visit today Effect? Health today

Slide 9

Slide 9 text

A counterfactual (what-if) definition What if you would have acted differently? E.g., how does the health of a hospitalized patient compare to their health if they would have stayed home? We only get to observe one of these outcomes, which is the fundamental problem of causal inference How does this differ from an observational estimate?

Slide 10

Slide 10 text

Observational estimates Let’s say all sick people in our dataset went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = (Sick and went to hospital) – (Healthy and stayed home)

Slide 11

Slide 11 text

Observational estimates Let’s say all sick people in our dataset went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = [(Sick and went to hospital) – (Sick if stayed home)] + [(Sick if stayed home) - (Healthy and stayed home)]

Slide 12

Slide 12 text

Selection bias Let’s say all sick people in our dataset went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = [(Sick and went to hospital) – (Sick if stayed home)] + [(Sick if stayed home) - (Healthy and stayed home)] Causal effect Selection bias (Baseline difference between those who opted in to the treatment and those who didn’t)

Slide 13

Slide 13 text

Basic identity of causal inference Let’s say all sick people in our dataset went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Observed difference = Causal effect – Selection bias Selection bias is likely negative here, making the observed difference an underestimate of the causal effect

Slide 14

Slide 14 text

Simpson’s paradox Selection bias can be so large that observational and causal estimates give opposite effects (e.g., going to hospitals makes you less healthy) http://vudlab.com/simpsons

Slide 15

Slide 15 text

Simpson’s paradox So which is right, the aggregated or the partitioned? It depends on the causal mechanism https://en.wikipedia.org/wiki/Simpson%27s_paradox

Slide 16

Slide 16 text

Simpson’s paradox So which is right, the aggregated or the partitioned? It depends on the causal mechanism Morgan and Winship (2015) 108 Chapter 4. Models of Causal Exposure and Identification Criteria Motivation SAT Rejected Admitted Applicants to a Hypothetical College Figure 4.2 Simulation of conditional dependence within values of a collider variable.

Slide 17

Slide 17 text

“To find out what happens when you change something, it is necessary to change it.” -GEORGE BOX

Slide 18

Slide 18 text

Controlled experiments

Slide 19

Slide 19 text

Counterfactuals To isolate the causal effect, we have to change one and only one thing (hospital visits), and compare outcomes + vs (what happened) Reality (what would have happened) Counterfactual

Slide 20

Slide 20 text

The ideal causal estimate CLONE EACH PERSON SEND ONE COPY TO THE HOSPITAL, MAKE THE OTHER STAY HOME MEASURE THE DIFFERENCE IN HEALTH BETWEEN THE COPIES

Slide 21

Slide 21 text

But this might be confounded for various reasons---e.g., Mark has a different diet than Scott

Slide 22

Slide 22 text

Counterfactuals We never get to observe what would have happened if we did something else, so we have to estimate it + vs (what happened) Reality (what would have happened) Counterfactual

Slide 23

Slide 23 text

Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails

Slide 24

Slide 24 text

Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails

Slide 25

Slide 25 text

Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2

Slide 26

Slide 26 text

Basic identity of causal inference The observed difference is now the causal effect: Observed difference = Causal effect – Selection bias = Causal effect Selection bias is zero, since there’s no difference, on average, between those who were hospitalized and those who weren’t

Slide 27

Slide 27 text

Hospital visit today Random assignment Random assignment determines the treatment independent of any confounds Health tomorrow Effect? Health today Coin flip Double lines mean “intervention”

Slide 28

Slide 28 text

Random assignment Dunning (2012)

Slide 29

Slide 29 text

Experiments: Caveats / limitations Random assignment is the “gold standard” for causal inference, but it has some limitations: ◦ Randomization often isn’t feasible and/or ethical ◦ Experiments are costly in terms of time and money ◦ It’s difficult to create convincing parallel worlds ◦ Effects in the lab can differ from real-world effects ◦ Inevitably people deviate from their random assignments

Slide 30

Slide 30 text

Validity of experiments INTERNAL VALIDITY Could anything other than the treatment (i.e. a confound) have produced this outcome? Was the study double-blind? Did doctors give the experimental drug to some especially sick patients (breaking randomization) hoping that it would save them? Or treat patients differently based on whether they got the drug or not? EXTERNAL VALIDITY Do the results of the experiment hold in settings we care about? Would this medication be just as effective outside of a clinical trial, when usage is less rigorously monitored or when tried on a different population of patients? Slide thanks to Andrew Mao

Slide 31

Slide 31 text

Expanding the experiment design space Complexity, Realism Size, Scale Duration, Participation Physical labs • Longer periods of time • Fewer constraints on location • More samples of data • Large-scale social interaction • Realistic vs. abstract, simple tasks • More precise instrumentation A software-based “virtual lab” with online participants Slide thanks to Andrew Mao

Slide 32

Slide 32 text

personal use. Not for redistribution. The definitive version is published in KDD 2007 (http://www.kdd2007.com/) Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO Ron Kohavi Microsoft One Microsoft Way Redmond, WA 98052 [email protected] Randal M. Henne Microsoft One Microsoft Way Redmond, WA 98052 [email protected] Dan Sommerfield Microsoft One Microsoft Way Redmond, WA 98052 [email protected] ABSTRACT The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on- investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments. Categories and Subject Descriptors G.3 Probability and Statistics/Experimental Design: controlled experiments, randomized experiments, A/B testing. I.2.6 Learning: real-time, automation, causality. 1. INTRODUCTION One accurate measurement is worth more than a thousand expert opinions — Admiral Grace Hopper In the 1700s, a British ship’s captain observed the lack of scurvy among sailors serving on the naval ships of Mediterranean countries, where citrus fruit was part of their rations. He then gave half his crew limes (the Treatment group) while the other half (the Control group) continued with their regular diet. Despite much grumbling among the crew in the Treatment group, the experiment was a success, showing that consuming limes prevented scurvy. While the captain did not realize that scurvy is a consequence of vitamin C deficiency, and that limes are rich in vitamin C, the intervention worked. British sailors eventually were compelled to consume citrus fruit regularly, a practice that gave rise to the still-popular label limeys (1). Some 300 years later, Greg Linden at Amazon created a prototype to show personalized recommendations based on items in the shopping cart (2). You add an item, recommendations show up; add another item, different recommendations show up. Linden notes that while the prototype looked promising, ―a marketing senior vice-president was dead set against it,‖ claiming it will distract people from checking out. Greg was ―forbidden to work on this any further.‖ Nonetheless, Greg ran a controlled experiment, and the ―feature won by such a wide margin that not having it live was costing Amazon a noticeable chunk of change. With new urgency, shopping cart recommendations launched.‖ Since then, multiple sites have copied cart recommendations. The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition (3), and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively. At Microsoft, there are multiple systems for running controlled experiments. We describe several — Jan L.A. van de Snepscheut eoretical techniques seem well suited for practical use and ire significant ingenuity to apply them to messy real world ments. Controlled experiments are no exception. Having rge number of online experiments, we now share several l lessons in three areas: (i) analysis; (ii) trust and on; and (iii) culture and business. Analysis Mine the Data olled experiment provides more than just a single bit of tion about whether the difference in OECs is statistically ant. Rich data is typically collected that can be analyzed achine learning and data mining techniques. For example, riment showed no significant difference overall, but a on of users with a specific browser version was antly worse for the Treatment. The specific Treatment which involved JavaScript, was buggy for that browser and users abandoned. Excluding the population from the showed positive results, and once the bug was fixed, the was indeed retested and was positive. Speed Matters ment might provide a worse user experience because of its ance. Greg Linden (36 p. 15) wrote that experiments at n showed a 1% sales decrease for an additional 100msec, t a specific experiments at Google, which increased the display search results by 500 msecs reduced revenues by ased on a talk by Marissa Mayer at Web 2.0). If time is ctly part of your OEC, make sure that a new feature that is s not losing because it is slower. Test One Factor at a Time (or Not) authors (19 p. 76; 20) recommend testing one factor at a We believe the advice, interpreted narrowly, is too ve and can lead organizations to focus on small 5.2 Trust and Execution 5.2.1 Run Continuous A/A Tests Run A/A tests (see Section 3.1) and validate the following. 1. Are users split according to the planned percentages? 2. Is the data collected matching the system of record? 3. Are the results showing non-significant results 95% of the time? Continuously run A/A tests in parallel with other experiments. 5.2.2 Automate Ramp-up and Abort As discussed in Section 3.3, we recommend that experiments ramp-up in the percentages assigned to the Treatment(s). By doing near-real-time analysis, experiments can be auto-aborted if a treatment is statistically significantly underperforming relative to the Control. An auto-abort simply reduces the percentage of users assigned to a treatment to zero. By reducing the risk in exposing many users to egregious errors, the organization can make bold bets and innovate faster. Ramp-up is quite easy to do in online environments, yet hard to do in offline studies. We have seen no mention of these practical ideas in the literature, yet they are extremely useful. 5.2.3 Determine the Minimum Sample Size Decide on the statistical power, the effect you would like to detect, and estimate the variability of the OEC through an A/A test. Based on this data you can compute the minimum sample size needed for the experiment and hence the running time for your web site. A common mistake is to run experiments that are underpowered. Consider the techniques mentioned in Section 3.2 point 3 to reduce the variability of the OEC. 5.2.4 Assign 50% of Users to Treatment One common practice among novice experimenters is to run new variants for only a small percentage of users. The logic behind that decision is that in case of an error only few users will see a http://exp-platform.com/hippo.aspx Page 7 significant. Rich data is typically collected that can be analyzed using machine learning and data mining techniques. For example, an experiment showed no significant difference overall, but a population of users with a specific browser version was significantly worse for the Treatment. The specific Treatment feature, which involved JavaScript, was buggy for that browser version and users abandoned. Excluding the population from the analysis showed positive results, and once the bug was fixed, the feature was indeed retested and was positive. 5.1.2 Speed Matters A Treatment might provide a worse user experience because of its performance. Greg Linden (36 p. 15) wrote that experiments at Amazon showed a 1% sales decrease for an additional 100msec, and that a specific experiments at Google, which increased the time to display search results by 500 msecs reduced revenues by 20% (based on a talk by Marissa Mayer at Web 2.0). If time is not directly part of your OEC, make sure that a new feature that is losing is not losing because it is slower. 5.1.3 Test One Factor at a Time (or Not) Several authors (19 p. 76; 20) recommend testing one factor at a time. We believe the advice, interpreted narrowly, is too restrictive and can lead organizations to focus on small incremental improvements. Conversely, some companies are touting their fractional factorial designs and Taguchi methods, thus introducing complexity where it may not be needed. While it is clear that factorial designs allow for joint optimization of factors, and are therefore superior in theory (15; 16) our experience from running experiments in online web sites is that interactions are less frequent than people assume (33), and awareness of the issue is enough that parallel interacting experiments are avoided. Our recommendations are therefore: x Conduct single-factor experiments for gaining insights and when you make incremental changes that could be decoupled. x Try some bold bets and very different designs. For example, let two designers come up with two very different designs for a new feature and try them one against the other. You might then start to perturb the winning version to improve it further. For backend algorithms it is even easier to try a completely different algorithm (e.g., a new recommendation algorithm). Data mining can help isolate areas where the new algorithm is significantly better, leading to interesting insights. x Use factorial designs when several factors are suspected to interact strongly. Limit the factors and the possible values per factor because users will be fragmented (reducing power) and because testing the combinations for launch is hard. doing near-real-time analysis, experiments can be auto-aborted if a treatment is statistically significantly underperforming relative to the Control. An auto-abort simply reduces the percentage of users assigned to a treatment to zero. By reducing the risk in exposing many users to egregious errors, the organization can make bold bets and innovate faster. Ramp-up is quite easy to do in online environments, yet hard to do in offline studies. We have seen no mention of these practical ideas in the literature, yet they are extremely useful. 5.2.3 Determine the Minimum Sample Size Decide on the statistical power, the effect you would like to detect, and estimate the variability of the OEC through an A/A test. Based on this data you can compute the minimum sample size needed for the experiment and hence the running time for your web site. A common mistake is to run experiments that are underpowered. Consider the techniques mentioned in Section 3.2 point 3 to reduce the variability of the OEC. 5.2.4 Assign 50% of Users to Treatment One common practice among novice experimenters is to run new variants for only a small percentage of users. The logic behind that decision is that in case of an error only few users will see a bad treatment, which is why we recommend Treatment ramp-up. In order to maximize the power of an experiment and minimize the running time, we recommend that 50% of users see each of the variants in an A/B test. Assuming all factors are fixed, a good approximation for the multiplicative increase in running time for an A/B test relative to 50%/50% is 1/(4 1 − ) where the treatment receives portion of the traffic. For example, if an experiment is run at 99%/1%, then it will have to run about 25 times longer than if it ran at 50%/50%. 5.2.5 Beware of Day of Week Effects Even if you have a lot of users visiting the site, implying that you could run an experiment for only hours or a day, we strongly recommend running experiments for at least a week or two, then continuing by multiples of a week so that day-of-week effects can be analyzed. For many sites the users visiting on the weekend represent different segments, and analyzing them separately may lead to interesting insights. This lesson can be generalized to other time-related events, such as holidays and seasons, and to different geographies: what works in the US may not work well in France, Germany, or Japan. Putting 5.2.3, 5.2.4, and 5.2.5 together, suppose that the power calculations imply that you need to run an A/B test for a minimum of 5 days, if the experiment were run at 50%/50%. We would

Slide 33

Slide 33 text

Experimental evidence of massive-scale emotional contagion through social networks Adam D. I. Kramera,1, Jamie E. Guilloryb,2, and Jeffrey T. Hancockb,c aCore Data Science Team, Facebook, Inc., Menlo Park, CA 94025; and Departments of bCommunication and cInformation Science, Cornell University, Ithaca, NY 14853 Edited by Susan T. Fiske, Princeton University, Princeton, NJ, and approved March 25, 2014 (received for review October 23, 2013) Emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. Emotional contagion is well established in laboratory experiments, with people transferring positive and negative emotions to others. Data from a large real-world social network, collected over a 20-y period suggests that longer-lasting moods (e.g., depression, happiness) can be transferred through networks [Fowler JH, Christakis NA (2008) BMJ 337:a2338], al- though the results are controversial. In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were re- duced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and non- verbal cues are not strictly necessary for emotional contagion, and demonstrated that (i) emotional contagion occurs via text-based computer-mediated communication (7); (ii) contagion of psy- chological and physiological qualities has been suggested based on correlational data for social networks generally (7, 8); and (iii) people’s emotional expressions on Facebook predict friends’ emotional expressions, even days later (7) (although some shared experiences may in fact last several days). To date, however, there is no experimental evidence that emotions or moods are contagious in the absence of direct interaction between experiencer and target. On Facebook, people frequently express emotions, which are later seen by their friends via Facebook’s “News Feed” product (8). Because people’s friends frequently produce much more content than one person can view, the News Feed filters posts, stories, and activities undertaken by friends. News Feed is the primary manner by which people see content that friends share. Which content is shown or omitted in the News Feed is de- termined via a ranking algorithm that Facebook continually develops and tests in the interest of showing viewers the content they will find most relevant and engaging. One such test is reported in this study: A test of whether posts with emotional ed to others via emotional ence the same emotions as agion is well established in people transfer positive and hers. Similarly, data from llected over a 20-y period e.g., depression, happiness) as well (2, 3). effect as contagion of mood tudy’s correlational nature, ion of contextual variables eriences (4, 5), raising im- n processes in networks. An this scrutiny directly; how- periments have been criti- cial interactions. Interacting d an unhappy person, un- esult from experiencing an a partner’s emotion. Prior whether nonverbal cues are f verbal cues alone suffice. e moods are correlated in s possible, but the causal sses occur for emotions in sive in the absence of ex- rs have suggested that in posure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment. Two parallel experiments were conducted for positive and negative emotion: One in which exposure to friends’ positive emotional content in their News Feed was reduced, and one in which ex- posure to negative emotional content in their News Feed was reduced. In these conditions, when a person loaded their News Feed, posts that contained emotional content of the relevant emotional valence, each emotional post had between a 10% and Significance We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction be- tween people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues. Author contributions: A.D.I.K., J.E.G., and J.T.H. designed research; A.D.I.K. performed research; A.D.I.K. analyzed data; and A.D.I.K., J.E.G., and J.T.H. wrote the paper. The authors declare no conflict of interest.

Slide 34

Slide 34 text

Editorial Expression of Concern and Correction PSYCHOLOGICAL AND COGNITIVE SCIENCES PNAS is publishing an Editorial Expression of Concern re- garding the following article: “Experimental evidence of massive- scale emotional contagion through social networks,” by Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock, which appeared in issue 24, June 17, 2014, of Proc Natl Acad Sci USA (111:8788–8790; first published June 2, 2014; 10.1073/ pnas.1320040111). This paper represents an important and emerg- ing area of social science research that needs to be approached with sensitivity and with vigilance regarding personal privacy issues. Questions have been raised about the principles of informed consent and opportunity to opt out in connection with the re- search in this paper. The authors noted in their paper, “[The work] was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, con- stituting informed consent for this research.” When the authors prepared their paper for publication in PNAS, they stated that: “Because this experiment was conducted by Facebook, Inc. for internal purposes, the Cornell University IRB [Institutional Re- view Board] determined that the project did not fall under Cor- nell’s Human Research Protection Program.” This statement has since been confirmed by Cornell University. Obtaining informed consent and allowing participants to opt out are best practices in most instances under the US Department of Health and Human Services Policy for the Protection of Human Research Subjects (the “Common Rule”). Adherence to the Com- mon Rule is PNAS policy, but as a private company Facebook was under no obligation to conform to the provisions of the Common Rule when it collected the data used by the authors, and the Common Rule does not preclude their use of the data. Based on the information provided by the authors, PNAS editors deemed it appropriate to publish the paper. It is nevertheless a matter of concern that the collection of the data by Facebook may have involved practices that were not fully consistent with the prin- ciples of obtaining informed consent and allowing participants to opt out. Inder M. Verma Editor-in-Chief www.pnas.org/cgi/doi/10.1073/pnas.1412469111 PSYCHOLOGICAL AND COGNITIVE SCIENCES Correction for “Experimental evidence of massive-scale emotional contagion through social networks,” by Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock, which appeared in issue 24, June 17, 2014, of Proc Natl Acad Sci USA (111:8788– 8790; first published June 2, 2014; 10.1073/pnas.1320040111). The authors note that, “At the time of the study, the middle author, Jamie E. Guillory, was a graduate student at Cornell University under the tutelage of senior author Jeffrey T. Hancock, also of Cornell University (Guillory is now a postdoctoral fellow at Center for Tobacco Control Research and Education, University of California, San Francisco, CA 94143).” The author and af- filiation lines have been updated to reflect the above changes and a present address footnote has been added. The online version has been corrected. The corrected author and affiliation lines appear below. Adam D. I. Kramera,1, Jamie E. Guilloryb,2, and Jeffrey T. Hancockb,c aCore Data Science Team, Facebook, Inc., Menlo Park, CA 94025; and Departments of bCommunication and cInformation Science, Cornell University, Ithaca, NY 14853 1To whom correspondence should be addressed. Email: [email protected]. 2Present address: Center for Tobacco Control Research and Education, University of California, San Francisco, CA 94143. www.pnas.org/cgi/doi/10.1073/pnas.1412583111 CORRECTION

Slide 35

Slide 35 text

Experiments with non-compliance

Slide 36

Slide 36 text

Dunning (2012)

Slide 37

Slide 37 text

Experiments with non-compliance

Slide 38

Slide 38 text

Natural experiments

Slide 39

Slide 39 text

Natural experiments Sometimes we get lucky and nature effectively runs experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state

Slide 40

Slide 40 text

Natural experiments Sometimes we get lucky and nature effectively runs experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state Experiments happen all the time, we just have to notice them

Slide 41

Slide 41 text

As-if random Idea: Nature randomly assigns conditions Example: People are randomly exposed to water sources (Snow, 1854) http://bit.ly/johnsnowmap

Slide 42

Slide 42 text

Instrumental variables Idea: An instrument independently shifts the distribution of a treatment Example: A lottery influences military service (Angrist, 1990) Military service Future earnings Effect? Confounds Lottery

Slide 43

Slide 43 text

Figure 4: Average Revenue around Discontinuous Changes in Rating Notes: Each restaurant’s log revenue is de-meaned to normalize a restaurant’s average log revenue to zero. Normalized log revenues are then averaged within bins based on how far the restaurant’s rating is from a rounding threshold in that quarter. The graph plots average log revenue as a function of how far the rating is from a rounding threshold. All points with a positive (negative) distance from a discontinuity are rounded up (down). Regression discontinuities Idea: Things change around an arbitrarily chosen threshold Example: Star ratings get arbitrarily rounded (Luca, 2011) http://bit.ly/yelpstars

Slide 44

Slide 44 text

Difference in differences Idea: Compare differences after a sudden change with trends in a control group Example: Minimum wage changes in just one state (Card & Krueger, 1994) http://stats.stackexchange.com/a/125266

Slide 45

Slide 45 text

Natural experiments: Caveats Natural experiments are great, but: ◦ Good natural experiments are hard to find ◦ They rely on many (untestable) assumptions ◦ The treated population may not be the one of interest

Slide 46

Slide 46 text

Closing thoughts Large-scale observational data is useful for building predictive models of a static world

Slide 47

Slide 47 text

Closing thoughts But without appropriate random variation, it’s hard to predict what happens when you change something in the world

Slide 48

Slide 48 text

Closing thoughts Randomized experiments are like custom-made datasets to answer a specific question

Slide 49

Slide 49 text

Closing thoughts Additional data + algorithms can help us discover and analyze these examples in the wild

Slide 50

Slide 50 text

https://www.xkcd.com/552/ “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’” Causality is tricky!