Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal Thinking for Descriptive Research

Causal Thinking for Descriptive Research

Abstract: Causal inference is hard, and everyone knows it. It is less
recognized that descriptive and comparative scholarship also rely upon
causal inference. How data are sampled and curated influences how we
should process the data, in order to accurately describe or compare
the people, times, and places of interest. I'll present some examples
to illustrate the problems that ignoring causal structure can create,
along with some solutions.

Richard McElreath

September 21, 2021
Tweet

More Decks by Richard McElreath

Other Decks in Education

Transcript

  1. Causal Thinking
 for
 Descriptive Research Biography Surv Almanacs Collection veys

    Exca Folktales tales urveys Ethnography R ds A Richard McElreath MPI-EVA Leipzig
  2. No causes in; nothing much out Tests of non-causal laws

    require background information too and much of it will necessarily be causal. There is a slogan in the book about attempts to infer causes purely from probabilities: “No causes in; no causes out.” But the stronger conclusion is intended as well: “No causes in; nothing much out.” Nancy Cartwright 1989 Nature’s Capacities and Their Measurement
  3. Casual thinking for everyone • The principles that license causal

    inference in experiments are fundamentally the same as those that license description • Causal inference depends upon trustworthy description • Description depends upon causal information • Big data alone just means bigger bias
  4.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown 'ĶĴłĿIJ ƉƍƏ .JTTJOH WBMVFT JO UIF Moralizing_gods EBUB ćF CMVF QPJOUT CPUI PQFO BOE ĕMMFE BSF PCTFSWFE WBMVFT GPS UIF QSFTFODF PG CFMJFGT BCPVU NPSBMJ[JOH HPET ćF x TZNCPMT BSF VOLOPXOT UIF NJTTJOH WBMVFT Where is your god now? McElreath 2020 Statistical Rethinking, page 514
  5. Where is your god now? • Missing data are not

    the real problem. • The cause of the missingness is the problem. • Data for Hawaii: Captain James Cook (1728–1779)
  6. Missing methods • A circus of methods for handling missing

    values • drop all cases with missing values • replace missing values with means • “multiple imputation” • Bayesian imputation • prayer MISSING DATA ARE A CAUSAL
 PROBLEM
  7. Ỵ D Y Missing is a mechanism true observed with

    missing values mechanism (a dog)
  8. Ỵ X D Y Less benign missingness -1 0 1

    2 -2 -1 0 1 2 3 X Y If we can model X –> Y, can recover p(Y)
  9. Ỵ X D Y Utterly evil missingness -2 -1 0

    1 2 -2 -1 0 1 2 X Y If can model Y –> D, some hope
  10.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown 'ĶĴłĿIJ ƉƍƏ .JTTJOH WBMVFT JO UIF Moralizing_gods EBUB ćF CMVF QPJOUT CPUI PQFO BOE ĕMMFE BSF PCTFSWFE WBMVFT GPS UIF QSFTFODF PG CFMJFGT BCPVU NPSBMJ[JOH HPET ćF x TZNCPMT BSF VOLOPXOT UIF NJTTJOH WBMVFT Where is your god now? McElreath 2020 Statistical Rethinking, page 514
  11.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  12.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  13.   .*44*/( %"5" "/% 05)&3 0110356/*5*&4 -10000 -8000 -6000

    -4000 -2000 0 2000 2 3 4 5 6 7 8 Time (year) Population size Moralizing gods present Moralizing gods absent Moralizing gods unknown literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMVFT BSF GPS OPOMJUFS PG BOZ LJOE JO NPTU DBTFT "OE BT ZPV DBO TFF JO XJUI TNBMMFS QPMJUJFT ćJT JT QPTTJCMZ CFDBVTF T UP CF MJUFSBUF ćFTF EBUB BSF TUSVDUVSFE CZ UIF J[JOH HPET BOE NJTTJOH WBMVFT #FOFBUI UIBU N NPSBMJ[JOH HPET DPVME CF DPNNPO PS SBSF EFQF
  14. G̣ L D G P Only large P have L

    = 1 G observed only when L = 1 No info re assoc P and G when L = 0 No way to reconstruct distribution of G 3 DPEF  with( Moralizing_gods , table( gods=moralizin literacy gods 0 1 <NA> 0 16 1 0 1 9 310 0 <NA> 442 86 0   PG  NJTTJOH WBMV PG BOZ LJOE JO NPTU DBTFT "O XJUI TNBMMFS QPMJUJFT ćJT JT UP CF MJUFSBUF ćFTF EBUB BSF J[JOH HPET BOE NJTTJOH WBMVF NPSBMJ[JOH HPET DPVME CF DPN
  15. Missed opportunities • Many sources of data have partial missingness

    • Essential to explore causes of the missing values • Sometimes missing values are benign • Sometimes missing values preclude description (without more causal assumptions) • NO CAUSES IN; NO DESCRIPTION OUT
  16. Nick Blurton-Jones (right) interviews a Hadza great-grandmother (second from left)

    and her younger kinswoman (second from right) in 1999. PHOTO: ANNETTE WAGNER FROM FILMING OF TINDIGA—THOSE WHO ARE RUNNING AND HADZABE MEANS: US PEOPLE
  17. 5NITED3TATES &)'52%-ODALAGESOFADULTDEATH ./4%&REQUENCYDISTRIBUTIONOFAGESATDEATH FX FORINDIVIDUALSOVERAGESHOWSSTRONGPEAKSFORHUNTER GATHERERS FORAGER HORTICULTURALISTS ACCULTURATEDHUNTER GATHERERS

    3WEDENn ANDTHE5NITED3TATES                   !GE (UNTER GATHERERS !CCULTURATEDHUNTER GATHERERS &ORAGER HORTICULTURALISTS 3WEDENn FX Gurven & Kaplan 2008 Longevity Among Hunter-Gatherers
  18. Demography not so simple • Most humans did not and

    do not know their birthdays • Many records are estimates or simply falsified • The direction and magnitude of error changes with age • Analogies in many other kinds of data
  19. Figure 2. Number and per capita rate of attaining supercentenarian

    status across US states, relative Newman 2020 Supercentenarian and remarkable age records
  20. Demography not so simple • Most humans do not know

    their birthdays • Many records are estimates or simply falsified • The direction and magnitude of error changes with age • CONCLUSION: Most census records do not describe the target population • But there is hope! Can use causal knowledge to refine age/ fertility estimates, get better descriptions.
  21. Causal information about age • Every human has exactly one

    biological father and one biological mother • Human gestation is about 270 ± 15 days • Female fertility tightly bounded between 20 and 45 years in most populations • If we know family structure and birth order, we can do a lot with these facts • All of this can be made algorithmic, repeatable, audit-able
  22. Hitting the Target • Basic problem: Sample is not the

    target • Post-stratification & Transport: Transparent, principled methods for extrapolating from sample to population • Post-strat requires casual model of reasons sample differs from population • NO CAUSES IN; NO DESCRIPTION OUT
  23. X Y Age Attitude S Selection
 by Age Selection nodes

    S : “Sample differs because of differences in what I point to”
  24. Selection ubiquitous • Many sources of data are already filtered

    by selection effects • Crime statistics • Employment & job performance • Health • Preservation & curation
  25. X Y S “Anarchists don’t answer their phones” X Y

    S “Young people don’t answer their phones”
  26. X Y Age Attitude S Selection
 by Age “Young people

    don’t answer their phones and misreport their age” X̣ Reported
 Age
  27.  *mbH 6`K2rQ`F 7Q` *`Qbb@*mHim`H :2M2`HBx#BHBiv .QKBMBF .2zM2`1∗ - CmHB

    JX _Q?`2`2  _B+?`/ J+1H`2i?1 1.2T`iK2Mi Q7 >mKM "2?pBQ`- 1+QHQ;v M/ *mHim`2- Jt SHM+F AMbiBimi2 7Q` 1pQHmiBQM`v M@ i?`QTQHQ;v- G2BTxB;- :2`KMv 2.2T`iK2Mi Q7 Sbv+?QHQ;v- G2BTxB; lMBp2`bBiv- G2BTxB;- :2`KMv ∗*Q``2bTQM/BM; mi?Q`, /QKBMBFn/2zM2`!2pXKT;X/2 "2?pBQ`H `2b2`+?2`b BM+`2bBM;Hv `2+Q;MBx2 i?2 M22/ 7Q` bKTH2b 7`QK KQ`2 /Bp2`b2 TQTmHiBQMb i?i +Tim`2 i?2 #`2/i? Q7 ?mKM 2tT2`B2M+2X *m``2Mi ii2KTib iQ 2bi#HBb? ;2M2`HBx#BHBiv +`Qbb TQTmHiBQMb 7Q+mb QM i?`2ib iQ pHB/Biv M/ i?2 ++mKmHiBQM Q7 H`;2 +`Qbb@+mHim`H /ib2ibX 6Q` +QMiBMm2/ T`Q;`2bb- KQ`2 /Bp2`b2 /i M/ HBbib Q7 i?BM;b i?i +M ;Q r`QM; `2 MQi bm{+B2Miě r2 HbQ M22/  7`K2rQ`F i?i H2ib mb /2i2`KBM2 r?B+? BM72`2M+2b +M #2 /`rM M/ ?Qr iQ KF2 BM7Q`KiBp2 +`Qbb@+mHim`H +QKT`BbQMbX q2 BMi`Q/m+2  7Q`KH ;2M2`iBp2 +mbH KQ/2HBM; 7`K2rQ`F M/ QmiHBM2 bBKTH2 ;`T?B+H +`Bi2`B iQ /2`Bp2 MHviB+ bi`i2;B2b M/ BKTHB2/ ;2M2`HBxiBQM 7`QK +mbH /B;`KbX lbBM; #Qi? bBKmHi2/ M/ `2H /i- r2 /2KQMbi`i2 K2i?Q/b iQ T`QD2+i M/ +QKT`2 2biBKi2b +`Qbb TQTmHiBQMbX q2 +QM+Hm/2 rBi?  /Bb+mbbBQM Q7 ?Qr  7Q`KH 7`K2rQ`F 7Q` ;2M2`HBx#BHBiv +M bbBbi `2b2`+?2`b BM /2bB;MBM; KtBKHHv BM7Q`KiBp2 +`Qbb@+mHim`H bim/B2b M/ i?mb T`QpB/2b  KQ`2 bQHB/ 7QmM/iBQM 7Q` +mKmHiBp2 M/ ;2M2`HBx#H2 #2?pBQ`H `2b2`+?X E2vrQ`/b, *`Qbb@+mHim`H `2b2`+?- ;2M2`HBx#BHBiv- q1A_. bKTH2b T`Q#H2K- +mbH BM72`2M+2- TQbibi`iB}+iBQMX RX AMi`Q/m+iBQM h?2 #2?pBQ`H M/ bQ+BH b+B2M+2b ?p2 #22M +`BiB+Bx2/ 7Q` `2HvBM; HKQbi 2t+HmbBp2Hv QM q1A_. bKTH2b BM r?B+? KQbi T`iB+BTMib `2 q2bi2`M- 2/m+i2/- M/ 7`QK BM/mbi`B@ HBx2/- `B+?- M/ /2KQ+`iB+ +QmMi`B2b U>2M`B+? 2i HX kyRy- TB+2HH 2i HX kyky- >2M`B+? kykyVX _2b2`+? ?b 2bi#HBb?2/ bm#biMiBH +`Qbb@+mHim`H p`BiBQM BM F2v Tbv+?QHQ;B+H 8 /QKBMb- bm+? b i?BMFBM; bivH2b U2X;X Jbm/ M/ LBb#2ii kyyR- LBb#2ii M/ JBvKQiQ Coming next month to a preprint server near you
  28. Many Qs are really post-strat Qs • Justified descriptions require

    causal information and post- stratification • Other tasks are structurally similar • Causal effects also require post-stratification. e.g. vaccines • Proper time trends account for changes in measurement/ population, post-strat correctly for each time period • Comparison is post-stratification from one population to another
  29. Simple 4-step plan for honest digital scholarship • (1) What

    are we trying to describe? • (2) What is the ideal data for doing so? • (3) What data do we actually have? • (4) What causes the differences between (2) and (3)? • (5) [optional] Is there a statistical way to use (3) + (4) to accomplish (1)?