Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Social Data, Lecture 12: Causality and Experiments

Modeling Social Data, Lecture 12: Causality and Experiments

Jake Hofman

April 26, 2019
Tweet

More Decks by Jake Hofman

Other Decks in Education

Transcript

  1. Prediction Seeing: Make a forecast, leaving the world as it

    is vs. Causation Doing: Anticipate what will happen when you make a change in the world
  2. Prediction Seeing: Make a forecast, leaving the world as it

    is (seeing my neighbor with an umbrella might predict rain) vs. Causation Doing: Anticipate what will happen when you make a change in the world (but handing my neighbor an umbrella doesn’t cause rain)
  3. “Causes of effects” It’s tempting to ask “what caused Y”,

    e.g. ◦ What makes an email spam? ◦ What caused my kid to get sick? ◦ Why did the stock market drop? This is ”reverse causal inference”, and is generally quite hard John Stuart Mill (1843)
  4. “Effects of causes” Alternatively, we can ask “what happens if

    we do X?”, e.g. ◦ How does education impact future earnings? ◦ What is the effect of advertising on sales? ◦ How does hospitalization affect health? This is “forward causal inference”: still hard, but less contentious! John Stuart Mill (1843)
  5. Example: Hospitalization on health What’s wrong with estimating this model

    from observational data? Health tomorrow Hospital visit today Effect? Arrow means “X causes Y”
  6. Confounds The effect and cause might be confounded by a

    common cause, and be changing together as a result Health tomorrow Hospital visit today Effect? Health today Dashed circle means “unobserved”
  7. Confounds If we only get to observe them changing together,

    we can’t estimate the effect of hospitalization changing alone Health tomorrow Hospital visit today Effect? Health today
  8. A counterfactual (what-if) definition What if you would have acted

    differently? E.g., how does the health of a hospitalized patient compare to their health if they would have stayed home? We only get to observe one of these outcomes, which is the fundamental problem of causal inference How does this differ from an observational estimate?
  9. Observational estimates Let’s say all sick people in our dataset

    went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = (Sick and went to hospital) – (Healthy and stayed home)
  10. Observational estimates Let’s say all sick people in our dataset

    went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = [(Sick and went to hospital) – (Sick if stayed home)] + [(Sick if stayed home) - (Healthy and stayed home)]
  11. Selection bias Let’s say all sick people in our dataset

    went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Δobs = [(Sick and went to hospital) – (Sick if stayed home)] + [(Sick if stayed home) - (Healthy and stayed home)] Causal effect Selection bias (Baseline difference between those who opted in to the treatment and those who didn’t)
  12. Basic identity of causal inference Let’s say all sick people

    in our dataset went to the hospital today, and healthy people stayed home The observed difference in health tomorrow is: Observed difference = Causal effect – Selection bias Selection bias is likely negative here, making the observed difference an underestimate of the causal effect
  13. Simpson’s paradox Selection bias can be so large that observational

    and causal estimates give opposite effects (e.g., going to hospitals makes you less healthy) http://vudlab.com/simpsons
  14. Simpson’s paradox So which is right, the aggregated or the

    partitioned? It depends on the causal mechanism https://en.wikipedia.org/wiki/Simpson%27s_paradox
  15. Simpson’s paradox So which is right, the aggregated or the

    partitioned? It depends on the causal mechanism Morgan and Winship (2015) 108 Chapter 4. Models of Causal Exposure and Identification Criteria Motivation SAT Rejected Admitted Applicants to a Hypothetical College Figure 4.2 Simulation of conditional dependence within values of a collider variable.
  16. “To find out what happens when you change something, it

    is necessary to change it.” -GEORGE BOX
  17. Counterfactuals To isolate the causal effect, we have to change

    one and only one thing (hospital visits), and compare outcomes + vs (what happened) Reality (what would have happened) Counterfactual
  18. The ideal causal estimate CLONE EACH PERSON SEND ONE COPY

    TO THE HOSPITAL, MAKE THE OTHER STAY HOME MEASURE THE DIFFERENCE IN HEALTH BETWEEN THE COPIES
  19. Counterfactuals We never get to observe what would have happened

    if we did something else, so we have to estimate it + vs (what happened) Reality (what would have happened) Counterfactual
  20. Random assignment We can use randomization to create two groups

    that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails
  21. Random assignment We can use randomization to create two groups

    that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails
  22. Random assignment We can use randomization to create two groups

    that differ only in which treatment they receive, restoring symmetry + World 1 World 2
  23. Basic identity of causal inference The observed difference is now

    the causal effect: Observed difference = Causal effect – Selection bias = Causal effect Selection bias is zero, since there’s no difference, on average, between those who were hospitalized and those who weren’t
  24. Hospital visit today Random assignment Random assignment determines the treatment

    independent of any confounds Health tomorrow Effect? Health today Coin flip Double lines mean “intervention”
  25. Experiments: Caveats / limitations Random assignment is the “gold standard”

    for causal inference, but it has some limitations: ◦ Randomization often isn’t feasible and/or ethical ◦ Experiments are costly in terms of time and money ◦ It’s difficult to create convincing parallel worlds ◦ Effects in the lab can differ from real-world effects ◦ Inevitably people deviate from their random assignments
  26. Validity of experiments INTERNAL VALIDITY Could anything other than the

    treatment (i.e. a confound) have produced this outcome? Was the study double-blind? Did doctors give the experimental drug to some especially sick patients (breaking randomization) hoping that it would save them? Or treat patients differently based on whether they got the drug or not? EXTERNAL VALIDITY Do the results of the experiment hold in settings we care about? Would this medication be just as effective outside of a clinical trial, when usage is less rigorously monitored or when tried on a different population of patients? Slide thanks to Andrew Mao
  27. Expanding the experiment design space Complexity, Realism Size, Scale Duration,

    Participation Physical labs • Longer periods of time • Fewer constraints on location • More samples of data • Large-scale social interaction • Realistic vs. abstract, simple tasks • More precise instrumentation A software-based “virtual lab” with online participants Slide thanks to Andrew Mao
  28. personal use. Not for redistribution. The definitive version is published

    in KDD 2007 (http://www.kdd2007.com/) Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO Ron Kohavi Microsoft One Microsoft Way Redmond, WA 98052 [email protected] Randal M. Henne Microsoft One Microsoft Way Redmond, WA 98052 [email protected] Dan Sommerfield Microsoft One Microsoft Way Redmond, WA 98052 [email protected] ABSTRACT The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on- investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments. Categories and Subject Descriptors G.3 Probability and Statistics/Experimental Design: controlled experiments, randomized experiments, A/B testing. I.2.6 Learning: real-time, automation, causality. 1. INTRODUCTION One accurate measurement is worth more than a thousand expert opinions — Admiral Grace Hopper In the 1700s, a British ship’s captain observed the lack of scurvy among sailors serving on the naval ships of Mediterranean countries, where citrus fruit was part of their rations. He then gave half his crew limes (the Treatment group) while the other half (the Control group) continued with their regular diet. Despite much grumbling among the crew in the Treatment group, the experiment was a success, showing that consuming limes prevented scurvy. While the captain did not realize that scurvy is a consequence of vitamin C deficiency, and that limes are rich in vitamin C, the intervention worked. British sailors eventually were compelled to consume citrus fruit regularly, a practice that gave rise to the still-popular label limeys (1). Some 300 years later, Greg Linden at Amazon created a prototype to show personalized recommendations based on items in the shopping cart (2). You add an item, recommendations show up; add another item, different recommendations show up. Linden notes that while the prototype looked promising, ―a marketing senior vice-president was dead set against it,‖ claiming it will distract people from checking out. Greg was ―forbidden to work on this any further.‖ Nonetheless, Greg ran a controlled experiment, and the ―feature won by such a wide margin that not having it live was costing Amazon a noticeable chunk of change. With new urgency, shopping cart recommendations launched.‖ Since then, multiple sites have copied cart recommendations. The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition (3), and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively. At Microsoft, there are multiple systems for running controlled experiments. We describe several — Jan L.A. van de Snepscheut eoretical techniques seem well suited for practical use and ire significant ingenuity to apply them to messy real world ments. Controlled experiments are no exception. Having rge number of online experiments, we now share several l lessons in three areas: (i) analysis; (ii) trust and on; and (iii) culture and business. Analysis Mine the Data olled experiment provides more than just a single bit of tion about whether the difference in OECs is statistically ant. Rich data is typically collected that can be analyzed achine learning and data mining techniques. For example, riment showed no significant difference overall, but a on of users with a specific browser version was antly worse for the Treatment. The specific Treatment which involved JavaScript, was buggy for that browser and users abandoned. Excluding the population from the showed positive results, and once the bug was fixed, the was indeed retested and was positive. Speed Matters ment might provide a worse user experience because of its ance. Greg Linden (36 p. 15) wrote that experiments at n showed a 1% sales decrease for an additional 100msec, t a specific experiments at Google, which increased the display search results by 500 msecs reduced revenues by ased on a talk by Marissa Mayer at Web 2.0). If time is ctly part of your OEC, make sure that a new feature that is s not losing because it is slower. Test One Factor at a Time (or Not) authors (19 p. 76; 20) recommend testing one factor at a We believe the advice, interpreted narrowly, is too ve and can lead organizations to focus on small 5.2 Trust and Execution 5.2.1 Run Continuous A/A Tests Run A/A tests (see Section 3.1) and validate the following. 1. Are users split according to the planned percentages? 2. Is the data collected matching the system of record? 3. Are the results showing non-significant results 95% of the time? Continuously run A/A tests in parallel with other experiments. 5.2.2 Automate Ramp-up and Abort As discussed in Section 3.3, we recommend that experiments ramp-up in the percentages assigned to the Treatment(s). By doing near-real-time analysis, experiments can be auto-aborted if a treatment is statistically significantly underperforming relative to the Control. An auto-abort simply reduces the percentage of users assigned to a treatment to zero. By reducing the risk in exposing many users to egregious errors, the organization can make bold bets and innovate faster. Ramp-up is quite easy to do in online environments, yet hard to do in offline studies. We have seen no mention of these practical ideas in the literature, yet they are extremely useful. 5.2.3 Determine the Minimum Sample Size Decide on the statistical power, the effect you would like to detect, and estimate the variability of the OEC through an A/A test. Based on this data you can compute the minimum sample size needed for the experiment and hence the running time for your web site. A common mistake is to run experiments that are underpowered. Consider the techniques mentioned in Section 3.2 point 3 to reduce the variability of the OEC. 5.2.4 Assign 50% of Users to Treatment One common practice among novice experimenters is to run new variants for only a small percentage of users. The logic behind that decision is that in case of an error only few users will see a http://exp-platform.com/hippo.aspx Page 7 significant. Rich data is typically collected that can be analyzed using machine learning and data mining techniques. For example, an experiment showed no significant difference overall, but a population of users with a specific browser version was significantly worse for the Treatment. The specific Treatment feature, which involved JavaScript, was buggy for that browser version and users abandoned. Excluding the population from the analysis showed positive results, and once the bug was fixed, the feature was indeed retested and was positive. 5.1.2 Speed Matters A Treatment might provide a worse user experience because of its performance. Greg Linden (36 p. 15) wrote that experiments at Amazon showed a 1% sales decrease for an additional 100msec, and that a specific experiments at Google, which increased the time to display search results by 500 msecs reduced revenues by 20% (based on a talk by Marissa Mayer at Web 2.0). If time is not directly part of your OEC, make sure that a new feature that is losing is not losing because it is slower. 5.1.3 Test One Factor at a Time (or Not) Several authors (19 p. 76; 20) recommend testing one factor at a time. We believe the advice, interpreted narrowly, is too restrictive and can lead organizations to focus on small incremental improvements. Conversely, some companies are touting their fractional factorial designs and Taguchi methods, thus introducing complexity where it may not be needed. While it is clear that factorial designs allow for joint optimization of factors, and are therefore superior in theory (15; 16) our experience from running experiments in online web sites is that interactions are less frequent than people assume (33), and awareness of the issue is enough that parallel interacting experiments are avoided. Our recommendations are therefore: x Conduct single-factor experiments for gaining insights and when you make incremental changes that could be decoupled. x Try some bold bets and very different designs. For example, let two designers come up with two very different designs for a new feature and try them one against the other. You might then start to perturb the winning version to improve it further. For backend algorithms it is even easier to try a completely different algorithm (e.g., a new recommendation algorithm). Data mining can help isolate areas where the new algorithm is significantly better, leading to interesting insights. x Use factorial designs when several factors are suspected to interact strongly. Limit the factors and the possible values per factor because users will be fragmented (reducing power) and because testing the combinations for launch is hard. doing near-real-time analysis, experiments can be auto-aborted if a treatment is statistically significantly underperforming relative to the Control. An auto-abort simply reduces the percentage of users assigned to a treatment to zero. By reducing the risk in exposing many users to egregious errors, the organization can make bold bets and innovate faster. Ramp-up is quite easy to do in online environments, yet hard to do in offline studies. We have seen no mention of these practical ideas in the literature, yet they are extremely useful. 5.2.3 Determine the Minimum Sample Size Decide on the statistical power, the effect you would like to detect, and estimate the variability of the OEC through an A/A test. Based on this data you can compute the minimum sample size needed for the experiment and hence the running time for your web site. A common mistake is to run experiments that are underpowered. Consider the techniques mentioned in Section 3.2 point 3 to reduce the variability of the OEC. 5.2.4 Assign 50% of Users to Treatment One common practice among novice experimenters is to run new variants for only a small percentage of users. The logic behind that decision is that in case of an error only few users will see a bad treatment, which is why we recommend Treatment ramp-up. In order to maximize the power of an experiment and minimize the running time, we recommend that 50% of users see each of the variants in an A/B test. Assuming all factors are fixed, a good approximation for the multiplicative increase in running time for an A/B test relative to 50%/50% is 1/(4 1 − ) where the treatment receives portion of the traffic. For example, if an experiment is run at 99%/1%, then it will have to run about 25 times longer than if it ran at 50%/50%. 5.2.5 Beware of Day of Week Effects Even if you have a lot of users visiting the site, implying that you could run an experiment for only hours or a day, we strongly recommend running experiments for at least a week or two, then continuing by multiples of a week so that day-of-week effects can be analyzed. For many sites the users visiting on the weekend represent different segments, and analyzing them separately may lead to interesting insights. This lesson can be generalized to other time-related events, such as holidays and seasons, and to different geographies: what works in the US may not work well in France, Germany, or Japan. Putting 5.2.3, 5.2.4, and 5.2.5 together, suppose that the power calculations imply that you need to run an A/B test for a minimum of 5 days, if the experiment were run at 50%/50%. We would
  29. Experimental evidence of massive-scale emotional contagion through social networks Adam

    D. I. Kramera,1, Jamie E. Guilloryb,2, and Jeffrey T. Hancockb,c aCore Data Science Team, Facebook, Inc., Menlo Park, CA 94025; and Departments of bCommunication and cInformation Science, Cornell University, Ithaca, NY 14853 Edited by Susan T. Fiske, Princeton University, Princeton, NJ, and approved March 25, 2014 (received for review October 23, 2013) Emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. Emotional contagion is well established in laboratory experiments, with people transferring positive and negative emotions to others. Data from a large real-world social network, collected over a 20-y period suggests that longer-lasting moods (e.g., depression, happiness) can be transferred through networks [Fowler JH, Christakis NA (2008) BMJ 337:a2338], al- though the results are controversial. In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were re- duced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and non- verbal cues are not strictly necessary for emotional contagion, and demonstrated that (i) emotional contagion occurs via text-based computer-mediated communication (7); (ii) contagion of psy- chological and physiological qualities has been suggested based on correlational data for social networks generally (7, 8); and (iii) people’s emotional expressions on Facebook predict friends’ emotional expressions, even days later (7) (although some shared experiences may in fact last several days). To date, however, there is no experimental evidence that emotions or moods are contagious in the absence of direct interaction between experiencer and target. On Facebook, people frequently express emotions, which are later seen by their friends via Facebook’s “News Feed” product (8). Because people’s friends frequently produce much more content than one person can view, the News Feed filters posts, stories, and activities undertaken by friends. News Feed is the primary manner by which people see content that friends share. Which content is shown or omitted in the News Feed is de- termined via a ranking algorithm that Facebook continually develops and tests in the interest of showing viewers the content they will find most relevant and engaging. One such test is reported in this study: A test of whether posts with emotional ed to others via emotional ence the same emotions as agion is well established in people transfer positive and hers. Similarly, data from llected over a 20-y period e.g., depression, happiness) as well (2, 3). effect as contagion of mood tudy’s correlational nature, ion of contextual variables eriences (4, 5), raising im- n processes in networks. An this scrutiny directly; how- periments have been criti- cial interactions. Interacting d an unhappy person, un- esult from experiencing an a partner’s emotion. Prior whether nonverbal cues are f verbal cues alone suffice. e moods are correlated in s possible, but the causal sses occur for emotions in sive in the absence of ex- rs have suggested that in posure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment. Two parallel experiments were conducted for positive and negative emotion: One in which exposure to friends’ positive emotional content in their News Feed was reduced, and one in which ex- posure to negative emotional content in their News Feed was reduced. In these conditions, when a person loaded their News Feed, posts that contained emotional content of the relevant emotional valence, each emotional post had between a 10% and Significance We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction be- tween people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues. Author contributions: A.D.I.K., J.E.G., and J.T.H. designed research; A.D.I.K. performed research; A.D.I.K. analyzed data; and A.D.I.K., J.E.G., and J.T.H. wrote the paper. The authors declare no conflict of interest.
  30. Editorial Expression of Concern and Correction PSYCHOLOGICAL AND COGNITIVE SCIENCES

    PNAS is publishing an Editorial Expression of Concern re- garding the following article: “Experimental evidence of massive- scale emotional contagion through social networks,” by Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock, which appeared in issue 24, June 17, 2014, of Proc Natl Acad Sci USA (111:8788–8790; first published June 2, 2014; 10.1073/ pnas.1320040111). This paper represents an important and emerg- ing area of social science research that needs to be approached with sensitivity and with vigilance regarding personal privacy issues. Questions have been raised about the principles of informed consent and opportunity to opt out in connection with the re- search in this paper. The authors noted in their paper, “[The work] was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, con- stituting informed consent for this research.” When the authors prepared their paper for publication in PNAS, they stated that: “Because this experiment was conducted by Facebook, Inc. for internal purposes, the Cornell University IRB [Institutional Re- view Board] determined that the project did not fall under Cor- nell’s Human Research Protection Program.” This statement has since been confirmed by Cornell University. Obtaining informed consent and allowing participants to opt out are best practices in most instances under the US Department of Health and Human Services Policy for the Protection of Human Research Subjects (the “Common Rule”). Adherence to the Com- mon Rule is PNAS policy, but as a private company Facebook was under no obligation to conform to the provisions of the Common Rule when it collected the data used by the authors, and the Common Rule does not preclude their use of the data. Based on the information provided by the authors, PNAS editors deemed it appropriate to publish the paper. It is nevertheless a matter of concern that the collection of the data by Facebook may have involved practices that were not fully consistent with the prin- ciples of obtaining informed consent and allowing participants to opt out. Inder M. Verma Editor-in-Chief www.pnas.org/cgi/doi/10.1073/pnas.1412469111 PSYCHOLOGICAL AND COGNITIVE SCIENCES Correction for “Experimental evidence of massive-scale emotional contagion through social networks,” by Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock, which appeared in issue 24, June 17, 2014, of Proc Natl Acad Sci USA (111:8788– 8790; first published June 2, 2014; 10.1073/pnas.1320040111). The authors note that, “At the time of the study, the middle author, Jamie E. Guillory, was a graduate student at Cornell University under the tutelage of senior author Jeffrey T. Hancock, also of Cornell University (Guillory is now a postdoctoral fellow at Center for Tobacco Control Research and Education, University of California, San Francisco, CA 94143).” The author and af- filiation lines have been updated to reflect the above changes and a present address footnote has been added. The online version has been corrected. The corrected author and affiliation lines appear below. Adam D. I. Kramera,1, Jamie E. Guilloryb,2, and Jeffrey T. Hancockb,c aCore Data Science Team, Facebook, Inc., Menlo Park, CA 94025; and Departments of bCommunication and cInformation Science, Cornell University, Ithaca, NY 14853 1To whom correspondence should be addressed. Email: [email protected]. 2Present address: Center for Tobacco Control Research and Education, University of California, San Francisco, CA 94143. www.pnas.org/cgi/doi/10.1073/pnas.1412583111 CORRECTION
  31. Natural experiments Sometimes we get lucky and nature effectively runs

    experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state
  32. Natural experiments Sometimes we get lucky and nature effectively runs

    experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state Experiments happen all the time, we just have to notice them
  33. As-if random Idea: Nature randomly assigns conditions Example: People are

    randomly exposed to water sources (Snow, 1854) http://bit.ly/johnsnowmap
  34. Instrumental variables Idea: An instrument independently shifts the distribution of

    a treatment Example: A lottery influences military service (Angrist, 1990) Military service Future earnings Effect? Confounds Lottery
  35. Figure 4: Average Revenue around Discontinuous Changes in Rating Notes:

    Each restaurant’s log revenue is de-meaned to normalize a restaurant’s average log revenue to zero. Normalized log revenues are then averaged within bins based on how far the restaurant’s rating is from a rounding threshold in that quarter. The graph plots average log revenue as a function of how far the rating is from a rounding threshold. All points with a positive (negative) distance from a discontinuity are rounded up (down). Regression discontinuities Idea: Things change around an arbitrarily chosen threshold Example: Star ratings get arbitrarily rounded (Luca, 2011) http://bit.ly/yelpstars
  36. Difference in differences Idea: Compare differences after a sudden change

    with trends in a control group Example: Minimum wage changes in just one state (Card & Krueger, 1994) http://stats.stackexchange.com/a/125266
  37. Natural experiments: Caveats Natural experiments are great, but: ◦ Good

    natural experiments are hard to find ◦ They rely on many (untestable) assumptions ◦ The treated population may not be the one of interest
  38. Closing thoughts But without appropriate random variation, it’s hard to

    predict what happens when you change something in the world
  39. https://www.xkcd.com/552/ “Correlation doesn’t imply causation, but it does waggle its

    eyebrows suggestively and gesture furtively while mouthing ‘look over there’” Causality is tricky!