Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal Thinking for Descriptive Research

Causal Thinking for Descriptive Research

Abstract: Causal inference is hard, and everyone knows it. It is less
recognized that descriptive and comparative scholarship also rely upon
causal inference. How data are sampled and curated influences how we
should process the data, in order to accurately describe or compare
the people, times, and places of interest. I'll present some examples
to illustrate the problems that ignoring causal structure can create,
along with some solutions.

A0f2f64b2e58f3bfa48296fb9ed73853?s=128

Richard McElreath

September 21, 2021
Tweet

Transcript

  1. Causal Thinking
 for
 Descriptive Research Biography Surv Almanacs Collection veys

    Exca Folktales tales urveys Ethnography R ds A Richard McElreath MPI-EVA Leipzig
  2. Light | Dark Good | Evil Civilization | Barbarism Inferential

    | Descriptive
  3. No causes in; nothing much out Tests of non-causal laws

    require background information too and much of it will necessarily be causal. There is a slogan in the book about attempts to infer causes purely from probabilities: “No causes in; no causes out.” But the stronger conclusion is intended as well: “No causes in; nothing much out.” Nancy Cartwright 1989 Nature’s Capacities and Their Measurement
  4. Casual thinking for everyone • The principles that license causal

    inference in experiments are fundamentally the same as those that license description • Causal inference depends upon trustworthy description • Description depends upon causal information • Big data alone just means bigger bias
  5. Some ordinary descriptive terrors • Missing values • Measurement •

    Relevance
  6. None
  7. None
  8.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown 'ĶĴłĿIJ ƉƍƏ .JTTJOH WBMVFT JO UIF Moralizing_gods EBUB ćF CMVF QPJOUT CPUI PQFO BOE ĕMMFE BSF PCTFSWFE WBMVFT GPS UIF QSFTFODF PG CFMJFGT BCPVU NPSBMJ[JOH HPET ćF x TZNCPMT BSF VOLOPXOT UIF NJTTJOH WBMVFT Where is your god now? McElreath 2020 Statistical Rethinking, page 514
  9. Where is your god now? • Missing data are not

    the real problem. • The cause of the missingness is the problem. • Data for Hawaii: Captain James Cook (1728–1779)
  10. Missing methods • A circus of methods for handling missing

    values • drop all cases with missing values • replace missing values with means • “multiple imputation” • Bayesian imputation • prayer MISSING DATA ARE A CAUSAL
 PROBLEM
  11. X Y Drawing our assumptions X Y observed variable unobserved

    variable a causal relationship
  12. Ỵ Y Missing is a mechanism true observed with missing

    values
  13. Ỵ D Y Missing is a mechanism true observed with

    missing values mechanism (a dog)
  14. Ỵ X D Y Benign missingness true observed with missing

    values mechanism (a dog) cause
  15. Ỵ X D Y Benign missingness -2 -1 0 1

    2 -2 -1 0 1 2 3 X Y
  16. Ỵ X D Y Less benign missingness

  17. Ỵ X D Y Less benign missingness -1 0 1

    2 -2 -1 0 1 2 3 X Y If we can model X –> Y, can recover p(Y)
  18. Ỵ X D Y Utterly evil missingness

  19. Ỵ X D Y Utterly evil missingness -2 -1 0

    1 2 -2 -1 0 1 2 X Y If can model Y –> D, some hope
  20.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown 'ĶĴłĿIJ ƉƍƏ .JTTJOH WBMVFT JO UIF Moralizing_gods EBUB ćF CMVF QPJOUT CPUI PQFO BOE ĕMMFE BSF PCTFSWFE WBMVFT GPS UIF QSFTFODF PG CFMJFGT BCPVU NPSBMJ[JOH HPET ćF x TZNCPMT BSF VOLOPXOT UIF NJTTJOH WBMVFT Where is your god now? McElreath 2020 Statistical Rethinking, page 514
  21.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  22.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  23.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  24. G̣ D G moralizing gods

  25. G̣ L D G literacy moralizing gods

  26. G̣ L D G P population literacy moralizing gods

  27. G̣ L D G P Only large P have L

    = 1 G observed only when L = 1 No info re assoc P and G when L = 0 No way to reconstruct distribution of G 3 DPEF  with( Moralizing_gods , table( gods=moralizin literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMV PG BOZ LJOE JO NPTU DBTFT "O XJUI TNBMMFS QPMJUJFT ćJT JT UP CF MJUFSBUF ćFTF EBUB BSF J[JOH HPET BOE NJTTJOH WBMVF NPSBMJ[JOH HPET DPVME CF DPN
  28. Missed opportunities • Many sources of data have partial missingness

    • Essential to explore causes of the missing values • Sometimes missing values are benign • Sometimes missing values preclude description (without more causal assumptions) • NO CAUSES IN; NO DESCRIPTION OUT
  29. None
  30. Nick Blurton-Jones (right) interviews a Hadza great-grandmother (second from left)

    and her younger kinswoman (second from right) in 1999. PHOTO: ANNETTE WAGNER FROM FILMING OF TINDIGA—THOSE WHO ARE RUNNING AND HADZABE MEANS: US PEOPLE
  31. 5NITED3TATES &)'52%-ODALAGESOFADULTDEATH ./4%&REQUENCYDISTRIBUTIONOFAGESATDEATH FX FORINDIVIDUALSOVERAGESHOWSSTRONGPEAKSFORHUNTER GATHERERS FORAGER HORTICULTURALISTS ACCULTURATEDHUNTER GATHERERS

    3WEDENn ANDTHE5NITED3TATES                   !GE (UNTER GATHERERS !CCULTURATEDHUNTER GATHERERS &ORAGER HORTICULTURALISTS 3WEDENn FX Gurven & Kaplan 2008 Longevity Among Hunter-Gatherers
  32. Demography not so simple • Most humans did not and

    do not know their birthdays • Many records are estimates or simply falsified • The direction and magnitude of error changes with age • Analogies in many other kinds of data
  33. None
  34. CC-BY-NC-ND 4.0 International lic a Newman 2020 Supercentenarian and remarkable

    age records
  35. Figure 2. Number and per capita rate of attaining supercentenarian

    status across US states, relative Newman 2020 Supercentenarian and remarkable age records
  36. Demography not so simple • Most humans do not know

    their birthdays • Many records are estimates or simply falsified • The direction and magnitude of error changes with age • CONCLUSION: Most census records do not describe the target population • But there is hope! Can use causal knowledge to refine age/ fertility estimates, get better descriptions.
  37. Causal information about age • Every human has exactly one

    biological father and one biological mother • Human gestation is about 270 ± 15 days • Female fertility tightly bounded between 20 and 45 years in most populations • If we know family structure and birth order, we can do a lot with these facts • All of this can be made algorithmic, repeatable, audit-able
  38. Hitting the Target • Basic problem: Sample is not the

    target • Post-stratification & Transport: Transparent, principled methods for extrapolating from sample to population • Post-strat requires casual model of reasons sample differs from population • NO CAUSES IN; NO DESCRIPTION OUT
  39. Cartoon example A B C D Four age groups:

  40. Cartoon example A B C D Four age groups: Proportions

    of sample: A B C D
  41. Multi-level regression & post-stratification (MRP)

  42. X Y Age Attitude Selection nodes

  43. X Y Age Attitude S Selection
 by Age Selection nodes

    S : “Sample differs because of differences in what I point to”
  44. Selection ubiquitous • Many sources of data are already filtered

    by selection effects • Crime statistics • Employment & job performance • Health • Preservation & curation
  45. X Y Age Attitude S Selection
 by Age “Young people

    don’t answer their phones”
  46. X Y S “Anarchists don’t answer their phones” X Y

    S “Young people don’t answer their phones”
  47. X Y Age Attitude S Selection
 by Age “Young people

    don’t answer their phones and misreport their age” X̣ Reported
 Age
  48.  *mbH 6`K2rQ`F 7Q` *`Qbb@*mHim`H :2M2`HBx#BHBiv .QKBMBF .2zM2`1∗ - CmHB

    JX _Q?`2`2  _B+?`/ J+1H`2i?1 1.2T`iK2Mi Q7 >mKM "2?pBQ`- 1+QHQ;v M/ *mHim`2- Jt SHM+F AMbiBimi2 7Q` 1pQHmiBQM`v M@ i?`QTQHQ;v- G2BTxB;- :2`KMv 2.2T`iK2Mi Q7 Sbv+?QHQ;v- G2BTxB; lMBp2`bBiv- G2BTxB;- :2`KMv ∗*Q``2bTQM/BM; mi?Q`, /QKBMBFn/2zM2`!2pXKT;X/2 "2?pBQ`H `2b2`+?2`b BM+`2bBM;Hv `2+Q;MBx2 i?2 M22/ 7Q` bKTH2b 7`QK KQ`2 /Bp2`b2 TQTmHiBQMb i?i +Tim`2 i?2 #`2/i? Q7 ?mKM 2tT2`B2M+2X *m``2Mi ii2KTib iQ 2bi#HBb? ;2M2`HBx#BHBiv +`Qbb TQTmHiBQMb 7Q+mb QM i?`2ib iQ pHB/Biv M/ i?2 ++mKmHiBQM Q7 H`;2 +`Qbb@+mHim`H /ib2ibX 6Q` +QMiBMm2/ T`Q;`2bb- KQ`2 /Bp2`b2 /i M/ HBbib Q7 i?BM;b i?i +M ;Q r`QM; `2 MQi bm{+B2Miě r2 HbQ M22/  7`K2rQ`F i?i H2ib mb /2i2`KBM2 r?B+? BM72`2M+2b +M #2 /`rM M/ ?Qr iQ KF2 BM7Q`KiBp2 +`Qbb@+mHim`H +QKT`BbQMbX q2 BMi`Q/m+2  7Q`KH ;2M2`iBp2 +mbH KQ/2HBM; 7`K2rQ`F M/ QmiHBM2 bBKTH2 ;`T?B+H +`Bi2`B iQ /2`Bp2 MHviB+ bi`i2;B2b M/ BKTHB2/ ;2M2`HBxiBQM 7`QK +mbH /B;`KbX lbBM; #Qi? bBKmHi2/ M/ `2H /i- r2 /2KQMbi`i2 K2i?Q/b iQ T`QD2+i M/ +QKT`2 2biBKi2b +`Qbb TQTmHiBQMbX q2 +QM+Hm/2 rBi?  /Bb+mbbBQM Q7 ?Qr  7Q`KH 7`K2rQ`F 7Q` ;2M2`HBx#BHBiv +M bbBbi `2b2`+?2`b BM /2bB;MBM; KtBKHHv BM7Q`KiBp2 +`Qbb@+mHim`H bim/B2b M/ i?mb T`QpB/2b  KQ`2 bQHB/ 7QmM/iBQM 7Q` +mKmHiBp2 M/ ;2M2`HBx#H2 #2?pBQ`H `2b2`+?X E2vrQ`/b, *`Qbb@+mHim`H `2b2`+?- ;2M2`HBx#BHBiv- q1A_. bKTH2b T`Q#H2K- +mbH BM72`2M+2- TQbibi`iB}+iBQMX RX AMi`Q/m+iBQM h?2 #2?pBQ`H M/ bQ+BH b+B2M+2b ?p2 #22M +`BiB+Bx2/ 7Q` `2HvBM; HKQbi 2t+HmbBp2Hv QM q1A_. bKTH2b BM r?B+? KQbi T`iB+BTMib `2 q2bi2`M- 2/m+i2/- M/ 7`QK BM/mbi`B@ HBx2/- `B+?- M/ /2KQ+`iB+ +QmMi`B2b U>2M`B+? 2i HX kyRy- TB+2HH 2i HX kyky- >2M`B+? kykyVX _2b2`+? ?b 2bi#HBb?2/ bm#biMiBH +`Qbb@+mHim`H p`BiBQM BM F2v Tbv+?QHQ;B+H 8 /QKBMb- bm+? b i?BMFBM; bivH2b U2X;X Jbm/ M/ LBb#2ii kyyR- LBb#2ii M/ JBvKQiQ Coming next month to a preprint server near you
  49. Many Qs are really post-strat Qs • Justified descriptions require

    causal information and post- stratification • Other tasks are structurally similar • Causal effects also require post-stratification. e.g. vaccines • Proper time trends account for changes in measurement/ population, post-strat correctly for each time period • Comparison is post-stratification from one population to another
  50. Honest Methods
 for
 Modest Questions Satellites Surv Almanacs Collection veys

    Exca Archives tales urveys Ethnography Re rds A y
  51. Simple 4-step plan for honest digital scholarship • (1) What

    are we trying to describe? • (2) What is the ideal data for doing so? • (3) What data do we actually have? • (4) What causes the differences between (2) and (3)? • (5) [optional] Is there a statistical way to use (3) + (4) to accomplish (1)?