Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Your data doesn't mean what you think it does

Keen
June 30, 2015

Your data doesn't mean what you think it does

Zan Armstrong teaches us about visualizing statistical mix effects and Simpson's Paradox.

Keen

June 30, 2015
Tweet

More Decks by Keen

Other Decks in Technology

Transcript

  1. Your data doesn't mean what you think it does: Visualizing

    Statistical Mix Effects and Simpson's Paradox Zan Armstrong, @zanstrong
  2. The median wage increased by 0.9%* *In the United States,

    from 2000 to 2013, inflation adjusted, workers 25+ years old
  3. But we see that the median wage decreased for every

    education category. Segment Overall No degree High school Some college Bachelor's + *In the United States, from 2000 to 2013, inflation adjusted The median wage increased by 0.9%* Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2%
  4. How can every group be worse than average? Segment Overall

    No degree High school Some college Bachelor's + Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2%
  5. Change in Number Employed (%) +6.4% -21.3% -10.6% +5.4% +33.0%

    More people with Bachelor's degrees working. Fewer people without a degree working. Mix Effects: the labor force shifted. Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2% $827 $472 $651 $748 $1194 Segment Overall No degree High school Some college Bachelor's +
  6. Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2%

    More people with Bachelor's degrees working. Fewer people without a degree working. Higher educated, and higher paid, segments increased in size. Change in Number Employed (%) +6.4% -21.3% -10.6% +5.4% +33.0% Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2% Median Wage, 2013 $827 $472 $651 $748 $1194 Mix Effects: the labor force shifted. Segment Overall No degree High school Some college Bachelor's +
  7. The size of segments matters "change in value" "change in

    weight/size" "value" Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2% Change in Number Employed (%) +6.4% -21.3% -10.6% +5.4% +33.0% Change in Median Wage (%) +0.9% -7.9% -4.7% -7.6% -1.2% Median Wage, 2013 $827 $472 $651 $748 $1194 Segment Overall No degree High school Some college Bachelor's +
  8. Mix Effects and Simpson's Paradox Mix effects Changes in relative

    sizes of segments affect average. Simpson's paradox Average moves opposite from any segment. Counterintuitive!
  9. These issues are common Berkeley Graduate Admissions and Gender Bias

    Race and the death penalty in Florida Standardized test scores/education spending Active debate about diagnosis and treatment of meningococcal disease The "hot hand" in basketball; Baseball hitting averages Treatments for kidney stones My motivation: analyzing financial data at Google
  10. Value before Value after Change or difference in value Weight

    or Size before Weight/Size after Change or difference in weight/size ex: median wage before, or unemployment rate before ex: median wage after, or unemployment rate before ex: change in median wage, or change in unemployment rate ex: percent of total labor force, or number of people in labor force before ex: labor force after ex: change in percent of total workers, or change in number of workers 6 Variables that (might) matter
  11. weight before weight after only two variables at a time:

    the size of the box and the color of the box value or weight values or weights change in value, or change in weight Standard vis: only a few variables size represents geographic size rather than size metric
  12. Tassone, Eric C., Marie Lynn Miranda, and Alan E. Gelfand.

    "Disaggregated spatial modelling for areal unit categorical data." Journal of the Royal Statistical Society: Series C (Applied Statistics) 59.1 (2010): 175-190. Example: maps Question: what part of North Carolina has the worst "low birth weight" percentage? (shown in darker colors)
  13. Tackling the paradox: Goal: answer the question: "Is the aggregate

    change representative of the many smaller groups? If not, why not?"
  14. weight/size value Variation of a scatterplot Comet Chart ex: number

    of people in labor force ex: unemployment rate
  15. For one segment plot the start and end points. 4

    variables values, weights before and after Comet Chart weight/size value ex: unemployment rate ex: number of people in labor force start end
  16. Comet Chart weight/size value ex: unemployment rate ex: number of

    people in labor force start: end Replace the points with a comet. 6 variables values, weights before, after, change Danny Holten and Jarke J. van Wijk. 2009. A user study on visualizing directed edges in graphs. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 2299-2308. DOI=10.1145/1518701. 1519054 http://doi.acm.org/10.1145/1518701.1519054
  17. Add the rest of the segments to the chart. Comet

    Chart weight/size value ex: unemployment rate ex: number of people in labor force
  18. change in value change in weight Comet Chart weight/size value

    ex: unemployment rate ex: number of people in labor force
  19. Color highlights certain types of data. Ex: blue segments increased

    in size while orange segments decreased in size. Comet Chart weight/size value ex: unemployment rate ex: number of people in labor force size increasing size decreasing
  20. Or, blue segments increased in value while orange segments decreased

    in value. Comet Chart weight/size value ex: unemployment rate ex: number of people in labor force value increasing value decreasing
  21. Pure value changes Unemployment rate and size of labor force

    by county unemployment increasing unemployment decreasing Sept 2013 Sept 2012
  22. Michigan: reveals different trends for different sizes Spotting outliers Unemployment

    rate and size of labor force by county unemployment increasing unemployment decreasing Sept 2012 Sept 2013
  23. Mix effects: under the hood of a small aggregate change

    Fetal Death Rate: deaths per 1000 born (log) Number of babies born (log) Each comet represents babies born in 10 different US states and birth-weight category (very small, small, etc). Fetal Death Rate and number of babies born by birth-weight categories and 10 US states increase in number employed decrease in # workers late 1990's late 2000's
  24. Don't jump to action from your dashboard: Check for mix

    effects first! Visualization can help us see the flow of our data and to ask data-driven questions Try Comet Charts as a way to discover mix in your data! Summary Thank you to Martin Wattenberg, Fernanda Viégas and their Google Data Visualization Research Team!
  25. Paper: http://research.google.com/pubs/pub42901.html My projects: blog.zanarmstrong.com/about * Which is bigger: Africa

    or North America? * Which is colder: January or June in SF? And here. Email: [email protected] Twitter: @zanstrong Curious to learn more?
  26. There is no statistical "silver bullet." "It cannot be overemphasized

    that although these paradoxes reveal the perils of using statistical criteria to guide causal analysis, they hold neither the explanations of the phenomenon they depict nor the pointers on how to avoid them.” - Arah "The inclusion of additional control variables may increase or decrease the bias, and we cannot know for sure which is the case in any particular situation." - Clarke Arah, Onyebuchi A. "Emerging Themes in." Emerging Themes in Epidemiology5 (2008): 5. Clarke, Kevin A. "The phantom menace: Omitted variable bias in econometric research." Conflict Management and Peace Science 22.4 (2005): 341-352.