Causal Inference 2022 Week 2

CAUSAL INFERENCE Last week, only more so Will Lowe Data
Science Lab, Hertie School 2022-02-15

PLAN 1 → e connection between potential outcomes and graphs
→ Laziness, impatience and hubris → Causal inference and statistics → Big data (and other small data stacked in a raincoat) → Faithfulness “When we understand that brain, we’ll have gured out neuroscience” Not Gen. Stanley McChrystal

GRAPHS AND INTERVENTIONS (LAST WEEK) 2 Does encouraging children to
watch Sesame Street improve reading? → Encouraged: { , } → Watched: { , } → Letter recognition score: [ , ] → Previous score: [ , ] → Grade, Likes children’s TV... E W L P G T

GRAPHS AND INTERVENTIONS (LAST WEEK) 2 Does encouraging children to
watch Sesame Street improve reading? → Encouraged: { , } → Watched: { , } → Letter recognition score: [ , ] → Previous score: [ , ] → Grade, Likes children’s TV... E W L P G T D e causal e ect of W is what happens to L if you remove W’s inbound connections and change its value → intervention on W makes a new graph → in this graph (L( ), L( ) ) ⊥ ⊥ W E W L P G T How to make sense of this?

GRAPHS AND POTENTIAL OUTCOMES 3 Let’s simplify the second graph
and include case-speci c noise (єW ,єG ,єL)i a vector of idiosyncratic features of each case, drawn from a population of student TV-watching events єW W L єL G єG

GRAPHS AND POTENTIAL OUTCOMES 3 Let’s simplify the second graph
and include case-speci c noise (єW ,єG ,єL)i a vector of idiosyncratic features of each case, drawn from a population of student TV-watching events єW W L єL G єG If watching a ects letter recognition, then L ⊥ ⊥ W because L is a function of W. Speci cally, this function: L = I[W = ] L( ) + I[W = ] L( ) So how can it also be true that (L( ) , L( ) ) ⊥ ⊥ W

GRAPHS AND SWIGS 4

GRAPHS AND SWIGS 4 Regular graph de nes no intervention
world Manipulated graph (no → W) de nes intervention world (Richardson & Robins, ) → Generate W as usual → Set W to for everyone, and generate L( ) єW W L( ) єL G єG (same argument for any other value of W)

GRAPHS AND POTENTIAL OUTCOMES 5 єW W L( ) єL
G єG → G, W,єG ,єW ,єL are in no intervention world → L is in intervention world → Consistency requires that P(L W = ) = P(L( ) W = ) Actual W is disconnected from L( ), so L( ) ⊥ ⊥ W

GRAPHS AND POTENTIAL OUTCOMES 5 єW W L( ) єL
G єG → G, W,єG ,єW ,єL are in no intervention world → L is in intervention world → Consistency requires that P(L W = ) = P(L( ) W = ) Actual W is disconnected from L( ), so L( ) ⊥ ⊥ W W G ? єW W L( ) єL G єG Actual W is still connected, via G, so L( ) ⊥ ⊥ W [sad trombone] But G is observed! [happy cornet]

CONFOUNDING, OBSERVED AND UNOBSERVED 6 єW W L( ) єL
G єG However, → G is a common cause of W and L (hence L( )) → Conditioning on common causes makes variables independent L( ) ⊥ ⊥ W G [Cheery baroque cornet theme]

CONFOUNDING, OBSERVED AND UNOBSERVED 6 єW W L( ) єL
G єG However, → G is a common cause of W and L (hence L( )) → Conditioning on common causes makes variables independent L( ) ⊥ ⊥ W G [Cheery baroque cornet theme] W ? is happy solution is not available when we do not observe the confounder, e.g. → T: Fondness for children’s TV єW W L( ) єL T [non-parametric sad trombone] By pleasing accident an instrument may help

MEDIATION 7 W ? → Memory improvements M (unlikely, but
mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L

mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L What happens if we condition on M? If all goes amazingly well: → Without conditioning on M, we identity the total e ect of W → Conditioning on M, we identify the direct e ect of W W → L → By comparing the two we can gure out the indirect e ect of W W → M → L

mnemonic!) єW W L( ) єL M єC It’s still the case that L( ) ⊥ ⊥ W (phew) Notice, the e ect of W on L has two ‘paths’ W → M → L, W → L What happens if we condition on M? If all goes amazingly well: → Without conditioning on M, we identity the total e ect of W → Conditioning on M, we identify the direct e ect of W W → L → By comparing the two we can gure out the indirect e ect of W W → M → L Spoiler: all does not usually go amazingly well → See Week for epic disappointment

COLLIDER BIAS 8 W ? → Note: substantively, this is
a bit odd. → A real and common example is coming right up єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) so L( ) ⊥ ⊥ W

COLLIDER BIAS 8 W ? → Note: substantively, this is
a bit odd. → A real and common example is coming right up єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) so L( ) ⊥ ⊥ W єW W L( ) єL C єC C is a common e ect of W and L (hence L( )) L( ) ⊥ ⊥ W C [sad brass section] In fact there is lots more spurious association here – more than is drawn!

SAMPLE SELECTION IS CONDITIONING TOO 9 Research question → Is
watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan?

watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan? → Nope, but why? W L єL S

watching is more e ective for bad letter recognizers? Research plan → Look at just the bad letter recognizers Operationalization → de ne S = if L > τ, otherwise → Restrict to cases where S = → Measure the W-L relationship Is this a good plan? → Nope, but why? W L єL S We could in principle x our mistake by also conditioning on єL, but we can’t measure that → Why would that work in principle? We may also be able to rescue this plan as a case-control design (see What If)

NOT ALWAYS BAD 10 W L єL єW S Q
→ Why is this ok? (within limits)

NOT ALWAYS BAD 10 W L єL єW S Q
→ Why is this ok? (within limits) → What would happen if part of W → L was mediated by M? → What would happen if W → L was confounded, e.g. by G? → Could we x that?

GETTING SYSTEMATIC 11 Two questions: → When can we identify
causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L

causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L Task: condition on {Z} that blocks all paths except the one we want E W L P G T

causal e ects by conditioning? → What should we condition on, if we can? One answer: → Condition on whatever d-separates W from L Task: condition on {Z} that blocks all paths except the one we want E W L P G T → Are W and L d-separated by (T, G)? → How about just (G)? or (E, T, G, P)?

D-SEPARATION WITHOUT TEARS 12 Intuition is essential, but generalized computationally
realized intuition is better I → Dagitty automates nding d-separating variable sets: → http://www.dagitty.net/ dags.html → Bonus: enumerates all conditional independencies

CAUSAL INFERENCE AND STATISTICS 13

CAUSAL INFERENCE AND STATISTICS 13 [narrator: they weren’t the same
picture] S Using data to learn about a (sometimes implicit) intact generative graph → statistical estimators / inference → empirical estimands → sampling distribution, standard errors, bias , variance, priors & posteriors Neural networks, classi ers, random forests, bagging, boosting, (Bishop, ; Hastie et al., ), ‘Bayesian networks’ (Neapolitan, )

CAUSAL INFERENCE AND STATISTICS 14 [narrator: they weren’t the same
picture] C Using data to learn about what would happen to a (sometimes implicit) intact generative graph if you manipulated parts of it → causal inference → causal estimands (‘e ects’) → experiment, instrument, confounding, bias mediation, generalization, missing data

CAUSAL INFERENCE AND STATISTICS 15 (Hünermund et al., )

TERMINOLOGY: IDENTIFICATION 16 Terminology across these domains is a problem.
Sometimes it’s clear why the word is the same S In Y = β + XβXγX + є βX and γX cannot be reduced to a function of observed data C When E[YX ] cannot be reduced to a function of observed data → e.g. because of unobserved confounding, colliders

TERMINOLOGY: BIAS 17 S An estimator ˆ δ for δ
is biased if for xed N EN [ ˆ δ] − δ ≠

is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞

is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance)

is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance) C An empirical estimand does not identify a causal estimand when E[Y X] − E[YX ] ≠ is is a population / mechanism property, so increasing N will not help

is biased if for xed N EN [ ˆ δ] − δ ≠ even though it might be consistent, i.e. EN [ ˆ δ] → δ as N → ∞ Bias in asymptotically unbiased estimators like Maximum Likelihood reduces with larger N Importantly, performance, e.g. mean squared error MSE = EN [( ˆ δ − EN [ ˆ δ]) ] can o en be improved by adding bias (it decreases estimator variance) C An empirical estimand does not identify a causal estimand when E[Y X] − E[YX ] ≠ is is a population / mechanism property, so increasing N will not help ‘B ’ e kinds of bias we run into in policy contexts may involve any combination of → statistical bias, e.g. stereotypes → causal bias, e.g. institutional prejudice → accurate but ‘undesirable’ inferences

FAITHFULNESS 18 X Y Z where Z = єZ X
= γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX?

= γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX? For all parameter combination like that that exactly cancel Z ⊥ ⊥ Y Despite the presence of a link from Z to Y is is an example of unfaithfulness: an independence relationship in the data not implied by the graph (Zhang & Spirtes, ) → In theory: is never happens → In nite samples: is happens, but only by accident (Although a lot of parameter space is nearly unfaithful Uhler et al. ( )) → In practice: More o en because we make it happen

MENG ON BIG DATA 19 So what is Meng’s data
quality ρR,G, and why might we care? єG G R єR Biased sampling ρR,G ≠

MENG ON BIG DATA 20 єG G R єR Simple
random sampling єG G D R єR Demographic confounding єG G R єR Non-ignorable missingness

MENG ON BIG DATA 20 єG G R єR Simple
random sampling єG G D R єR Demographic confounding єG G R єR Non-ignorable missingness Mohan and Pearl ( ) is the causal inference story behind the traditional missing data categorizations → Missing Completely at Random (MCAR) → Missing at Random (MAR) → Not Missing at Random (NMAR)

CONSEQUENCES 21 I I Di erential bias in the sampling
→ larger measurement errors → di erential measurement errors → bad election predictions For data scientists: When can data size o set data quality problems?

= γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX?

= γ + ZγZ + єX Y = β + XβX + ZβZ + єY But what if βZ = −γZ βX? For all parameter combination like that that exactly cancel Z ⊥ ⊥ Y Despite the presence of a link from Z to Y is is an example of unfaithfulness: an independence relationship in the data not implied by the graph (Zhang & Spirtes, ) → In theory: is never happens → In nite samples: is happens, but only by accident (Although a lot of parameter space is nearly unfaithful Uhler et al. ( )) → In practice: More o en because we make it happen

PLAN 23 → e connection between potential outcomes and graphs
→ Laziness, impatience and hubris → Causal inference and statistics → Big data (and other small data stacked in a raincoat) → Faithfulness “When we understand that brain, we’ll have gured out neuroscience” Not Gen. Stanley McChrystal

REFERENCES 24 Bishop, C. M. ( ). “Pattern recognition and
machine learning.” Springer. Hastie, T., Tibshirani, R., & Friedman, J. ( ). “ e elements of statistical learning: Data mining, inference, and prediction.” Springer Verlag. Hünermund, P., Kaminski, J., & Schmitt, C. ( , May ). Causal machine learning and business decision making (SSRN). [SSRN]. Mohan, K., & Pearl, J. ( ). “Graphical models for processing missing data.” Journal of the American Statistical Association, , – . Neapolitan, R. E. ( ). “Probabilistic reasoning in expert systems: eory and algorithms.” Wiley. Richardson, T. S., & Robins, J. M. ( ). Single world intervention graphs (SWIGs): A uni cation of the counterfactual and graphical approaches to causality. Uhler, C., Raskutti, G., Bühlmann, P., & Yu, B. ( ). “Geometry of the faithfulness assumption in causal inference.” e Annals of Statistics, ( ), – . Zhang, J., & Spirtes, P. ( ). “Detection of Unfaithfulness and Robust Causal Inference.” Minds and Machines, ( ), – .

Causal Inference 2022 Week 2

Causal Inference 2022 Week 2

More Decks by Will Lowe

Featured

Transcript