Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal Inference 2022 Week 2

Will Lowe
February 28, 2022
16

Causal Inference 2022 Week 2

Will Lowe

February 28, 2022
Tweet

Transcript

  1. CAUSAL INFERENCE Last week, only more so Will Lowe Data

    Science Lab, Hertie School 2022-02-15
  2. PLAN 1 → e connection between potential outcomes and graphs

    → Laziness, impatience and hubris → Causal inference and statistics → Big data (and other small data stacked in a raincoat) → Faithfulness “When we understand that brain, we’ll have gured out neuroscience” Not Gen. Stanley McChrystal
  3. GRAPHS AND INTERVENTIONS (LAST WEEK) 2 Does encouraging children to

    watch Sesame Street improve reading? → Encouraged: { , } → Watched: { , } → Letter recognition score: [ , ] → Previous score: [ , ] → Grade, Likes children’s TV... E W L P G T
  4. GRAPHS AND INTERVENTIONS (LAST WEEK) 2 Does encouraging children to

    watch Sesame Street improve reading? → Encouraged: { , } → Watched: { , } → Letter recognition score: [ , ] → Previous score: [ , ] → Grade, Likes children’s TV... E W L P G T D e causal e ect of W is what happens to L if you remove W’s inbound connections and change its value → intervention on W makes a new graph → in this graph (L( ), L( ) ) ⊥ ⊥ W E W L P G T How to make sense of this?
  5. GRAPHS AND POTENTIAL OUTCOMES 3 Let’s simplify the second graph

    and include case-speci c noise (єW ,єG ,єL)i a vector of idiosyncratic features of each case, drawn from a population of student TV-watching events єW W L єL G єG
  6. GRAPHS AND POTENTIAL OUTCOMES 3 Let’s simplify the second graph

    and include case-speci c noise (єW ,єG ,єL)i a vector of idiosyncratic features of each case, drawn from a population of student TV-watching events єW W L єL G єG If watching a ects letter recognition, then L ⊥ ⊥ W because L is a function of W. Speci cally, this function: L = I[W = ] L( ) + I[W = ] L( ) So how can it also be true that (L( ) , L( ) ) ⊥ ⊥ W
  7. GRAPHS AND SWIGS 4 Regular graph de nes no intervention

    world Manipulated graph (no → W) de nes intervention world (Richardson & Robins, ) → Generate W as usual → Set W to for everyone, and generate L( ) єW W L( ) єL G єG (same argument for any other value of W)
  8. GRAPHS AND POTENTIAL OUTCOMES 5 єW W L( ) єL

    G єG → G, W,єG ,єW ,єL are in no intervention world → L is in intervention world → Consistency requires that P(L W = ) = P(L( ) W = ) Actual W is disconnected from L( ), so L( ) ⊥ ⊥ W
  9. GRAPHS AND POTENTIAL OUTCOMES 5 єW W L( ) єL

    G єG → G, W,єG ,єW ,єL are in no intervention world → L is in intervention world → Consistency requires that P(L W = ) = P(L( ) W = ) Actual W is disconnected from L( ), so L( ) ⊥ ⊥ W W G ? єW W L( ) єL G єG Actual W is still connected, via G, so L( ) ⊥ ⊥ W [sad trombone] But G is observed! [happy cornet]
  10. CONFOUNDING, OBSERVED AND UNOBSERVED 6 єW W L( ) єL

    G єG However, → G is a common cause of W and L (hence L( )) → Conditioning on common causes makes variables independent L( ) ⊥ ⊥ W G [Cheery baroque cornet theme]
  11. CONFOUNDING, OBSERVED AND UNOBSERVED 6 єW W L( ) єL

    G єG However, → G is a common cause of W and L (hence L( )) → Conditioning on common causes makes variables independent L( ) ⊥ ⊥ W G [Cheery baroque cornet theme] W ? is happy solution is not available when we do not observe the confounder, e.g. → T: Fondness for children’s TV єW W L( ) єL T [non-parametric sad trombone] By pleasing accident an instrument may help
  12. MEDIATION 7 W ? → Memory improvements M (unlikely, but

    mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L
  13. MEDIATION 7 W ? → Memory improvements M (unlikely, but

    mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L What happens if we condition on M? If all goes amazingly well: → Without conditioning on M, we identity the total e ect of W → Conditioning on M, we identify the direct e ect of W W → L → By comparing the two we can gure out the indirect e ect of W W → M → L
  14. MEDIATION 7 W ? → Memory improvements M (unlikely, but

    mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L What happens if we condition on M? If all goes amazingly well: → Without conditioning on M, we identity the total e ect of W → Conditioning on M, we identify the direct e ect of W W → L → By comparing the two we can gure out the indirect e ect of W W → M → L Spoiler: all does not usually go amazingly well → See Week for epic disappointment
  15. COLLIDER BIAS 8 W ? → Note: substantively, this is

    a bit odd. → A real and common example is coming right up єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) so L( ) ⊥ ⊥ W
  16. COLLIDER BIAS 8 W ? → Note: substantively, this is

    a bit odd. → A real and common example is coming right up єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) so L( ) ⊥ ⊥ W єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) L( ) ⊥ ⊥ W C [sad brass section] In fact there is lots more spurious association here – more than is drawn!
  17. SAMPLE SELECTION IS CONDITIONING TOO 9 Research question → Is

    watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan?
  18. SAMPLE SELECTION IS CONDITIONING TOO 9 Research question → Is

    watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan? → Nope, but why? W L єL S
  19. SAMPLE SELECTION IS CONDITIONING TOO 9 Research question → Is

    watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan? → Nope, but why? W L єL S We could in principle x our mistake by also conditioning on єL, but we can’t measure that → Why would that work in principle? We may also be able to rescue this plan as a case-control design (see What If)
  20. NOT ALWAYS BAD 10 W L єL єW S Q

    → Why is this ok? (within limits)
  21. NOT ALWAYS BAD 10 W L єL єW S Q

    → Why is this ok? (within limits) → What would happen if part of W → L was mediated by M? → What would happen if W → L was confounded, e.g. by G? → Could we x that?
  22. GETTING SYSTEMATIC 11 Two questions: → When can we identify

    causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L
  23. GETTING SYSTEMATIC 11 Two questions: → When can we identify

    causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L Task: condition on {Z} that blocks all paths except the one we want E W L P G T
  24. GETTING SYSTEMATIC 11 Two questions: → When can we identify

    causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L Task: condition on {Z} that blocks all paths except the one we want E W L P G T → Are W and L d-separated by (T, G)? → How about just (G)? or (E, T, G, P)?
  25. D-SEPARATION WITHOUT TEARS 12 Intuition is essential, but generalized computationally

    realized intuition is better I → Dagitty automates nding d-separating variable sets: → http://www.dagitty.net/ dags.html → Bonus: enumerates all conditional independencies
  26. CAUSAL INFERENCE AND STATISTICS 13 [narrator: they weren’t the same

    picture] S Using data to learn about a (sometimes implicit) intact generative graph → statistical estimators / inference → empirical estimands → sampling distribution, standard errors, bias , variance, priors & posteriors Neural networks, classi ers, random forests, bagging, boosting, (Bishop, ; Hastie et al., ), ‘Bayesian networks’ (Neapolitan, )
  27. CAUSAL INFERENCE AND STATISTICS 14 [narrator: they weren’t the same

    picture] C Using data to learn about what would happen to a (sometimes implicit) intact generative graph if you manipulated parts of it → causal inference → causal estimands (‘e ects’) → experiment, instrument, confounding, bias mediation, generalization, missing data
  28. TERMINOLOGY: IDENTIFICATION 16 Terminology across these domains is a problem.

    Sometimes it’s clear why the word is the same S In Y = β + XβXγX + є βX and γX cannot be reduced to a function of observed data C When E[YX ] cannot be reduced to a function of observed data → e.g. because of unobserved confounding, colliders
  29. TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ

    is biased if for xed N EN [ ˆ δ] − δ ≠
  30. TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ

    is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞
  31. TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ

    is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance)
  32. TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ

    is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance) C An empirical estimand does not identify a causal estimand when E[Y X] − E[YX ] ≠ is is a population / mechanism property, so increasing N will not help
  33. TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ

    is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance) C An empirical estimand does not identify a causal estimand when E[Y X] − E[YX ] ≠ is is a population / mechanism property, so increasing N will not help ‘B ’ e kinds of bias we run into in policy contexts may involve any combination of → statistical bias, e.g. stereotypes → causal bias, e.g. institutional prejudice → accurate but ‘undesirable’ inferences
  34. FAITHFULNESS 18 X Y Z where Z = єZ X

    = γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX?
  35. FAITHFULNESS 18 X Y Z where Z = єZ X

    = γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX? For all parameter combination like that that exactly cancel Z ⊥ ⊥ Y Despite the presence of a link from Z to Y is is an example of unfaithfulness: an independence relationship in the data not implied by the graph (Zhang & Spirtes, ) → In theory: is never happens → In nite samples: is happens, but only by accident (Although a lot of parameter space is nearly unfaithful Uhler et al. ( )) → In practice: More o en because we make it happen
  36. MENG ON BIG DATA 19 So what is Meng’s data

    quality ρR,G, and why might we care? єG G R єR Biased sampling ρR,G ≠
  37. MENG ON BIG DATA 20 єG G R єR Simple

    random sampling єG G D R єR Demographic confounding єG G R єR Non-ignorable missingness
  38. MENG ON BIG DATA 20 єG G R єR Simple

    random sampling єG G D R єR Demographic confounding єG G R єR Non-ignorable missingness Mohan and Pearl ( ) is the causal inference story behind the traditional missing data categorizations → Missing Completely at Random (MCAR) → Missing at Random (MAR) → Not Missing at Random (NMAR)
  39. CONSEQUENCES 21 I I Di erential bias in the sampling

    → larger measurement errors → di erential measurement errors → bad election predictions For data scientists: When can data size o set data quality problems?
  40. FAITHFULNESS 22 X Y Z where Z = єZ X

    = γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX?
  41. FAITHFULNESS 22 X Y Z where Z = єZ X

    = γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX? For all parameter combination like that that exactly cancel Z ⊥ ⊥ Y Despite the presence of a link from Z to Y is is an example of unfaithfulness: an independence relationship in the data not implied by the graph (Zhang & Spirtes, ) → In theory: is never happens → In nite samples: is happens, but only by accident (Although a lot of parameter space is nearly unfaithful Uhler et al. ( )) → In practice: More o en because we make it happen
  42. PLAN 23 → e connection between potential outcomes and graphs

    → Laziness, impatience and hubris → Causal inference and statistics → Big data (and other small data stacked in a raincoat) → Faithfulness “When we understand that brain, we’ll have gured out neuroscience” Not Gen. Stanley McChrystal
  43. REFERENCES 24 Bishop, C. M. ( ). “Pattern recognition and

    machine learning.” Springer. Hastie, T., Tibshirani, R., & Friedman, J. ( ). “ e elements of statistical learning: Data mining, inference, and prediction.” Springer Verlag. Hünermund, P., Kaminski, J., & Schmitt, C. ( , May ). Causal machine learning and business decision making (SSRN). [SSRN]. Mohan, K., & Pearl, J. ( ). “Graphical models for processing missing data.” Journal of the American Statistical Association, , – . Neapolitan, R. E. ( ). “Probabilistic reasoning in expert systems: eory and algorithms.” Wiley. Richardson, T. S., & Robins, J. M. ( ). Single world intervention graphs (SWIGs): A uni cation of the counterfactual and graphical approaches to causality. Uhler, C., Raskutti, G., Bühlmann, P., & Yu, B. ( ). “Geometry of the faithfulness assumption in causal inference.” e Annals of Statistics, ( ), – . Zhang, J., & Spirtes, P. ( ). “Detection of Unfaithfulness and Robust Causal Inference.” Minds and Machines, ( ), – .