Dealing with Separation in Logistic Regression Models

Bf99409063473973c7f9d3cf4f882492?s=47 Carlisle Rainey
December 04, 2014

Dealing with Separation in Logistic Regression Models

Slides for a paper available at http://www.carlislerainey.com/papers/separation.pdf

Bf99409063473973c7f9d3cf4f882492?s=128

Carlisle Rainey

December 04, 2014
Tweet

Transcript

  1. Dealing with Separation in Logistic Regression Models Carlisle Rainey Assistant

    Professor University at Buffalo, SUNY rcrainey@buffalo.edu paper, data, and code at crain.co/research
  2. Dealing with Separation in Logistic Regression Models

  3. The prior matters a lot, so choose a good one.

    43 times larger million
  4. The prior matters a lot, so choose a good one.

    1. in practice 2. in theory 3. concepts 4. software
  5. The Prior Matters in Practice

  6. None
  7. None
  8. None
  9. 2 million

  10. 3,000

  11. 100%

  12. 90%

  13. “To expand this program is not unlike adding a thousand

    people to the Titanic.” — July 2012
  14. None
  15. politics need

  16. “Obamacare is going to be horrible for patients. It’s going

    to be horrible for taxpayers. It’s probably the biggest job killer ever.” — October 2010
  17. “Obamacare is going to be horrible for patients. It’s going

    to be horrible for taxpayers. It’s probably the biggest job killer ever.” — October 2010 “While the federal government is committed to paying 100 percent of the cost, I cannot, in good conscience, deny Floridians that need it access to healthcare.” — February 2013
  18. In the tug-of-war between politics and need, which one wins?

  19. Variable Coefficient Confidence Interval Democratic Governor -20.35 [-6,340.06; 6,299.36] %

    Uninsured (Std.) 0.92 [-3.46; 5.30] % Favorable to ACA 0.01 [-0.17; 0.18] GOP Legislature 2.43 [-0.47; 5.33] Fiscal Health 0.00 [-0.02; 0.02] Medicaid Multiplier -0.32 [-2.45; 1.80] % Non-white 0.05 [-0.12; 0.21] % Metropolitan -0.08 [-0.17; 0.02] Constant 2.58 [-7.02; 12.18]
  20. Doesn’t Oppose Opposes Republican 14 16 Democrat 20 0

  21. None
  22. Variable Coefficient Confidence Interval Democratic Governor -26.35 [-126,979.03; 126,926.33] %

    Uninsured (Std.) 0.92 [-3.46; 5.30] % Favorable to ACA 0.01 [-0.17; 0.18] GOP Legislature 2.43 [-0.47; 5.33] Fiscal Health 0.00 [-0.02; 0.02] Medicaid Multiplier -0.32 [-2.45; 1.80] % Non-white 0.05 [-0.12; 0.21] % Metropolitan -0.08 [-0.17; 0.02] Constant 2.58 [-7.02; 12.18]
  23. Variable Coefficient Confidence Interval Democratic Governor -26.35 [-126,979.03; 126,926.33] %

    Uninsured (Std.) 0.92 [-3.46; 5.30] % Favorable to ACA 0.01 [-0.17; 0.18] GOP Legislature 2.43 [-0.47; 5.33] Fiscal Health 0.00 [-0.02; 0.02] Medicaid Multiplier -0.32 [-2.45; 1.80] % Non-white 0.05 [-0.12; 0.21] % Metropolitan -0.08 [-0.17; 0.02] Constant 2.58 [-7.02; 12.18] useless unreasonable This is a failure of maximum likelihood.
  24. Jeffreys’ Prior Zorn (2005)

  25. None
  26. Cauchy Prior Gelman et al. (2008)

  27. None
  28. The Cauchy prior produces… a confidence interval that is 250%

    wider
  29. None
  30. The Cauchy prior produces… a coefficient estimate that is 50%

    larger
  31. The Cauchy prior produces… a risk-ratio estimate that is 43

    million times larger
  32. Different default priors produce different results.

  33. The Prior Matters in Theory

  34. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  35. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  36. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  37. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  38. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  39. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  40. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  41. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  42. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  43. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  44. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  45. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  46. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  47. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  48. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  49. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  50. The prior determines crucial parts of the posterior.

  51. Key Concepts for Choosing a Good Prior

  52. Pr ( yi) = ⇤( c + ssi + 1xi1

    + ... + kxik)
  53. Prior Predictive Distribution p(ynew) = 1 R 1 p(ynew |

    )p( )d( )
  54. 0 B B B B B @ 11 12 13

    . . . 1k 21 22 23 . . . 2k 31 32 33 . . . 3k . . . . . . . . . ... . . . k1 k2 k3 . . . kk 1 C C C C C A
  55. simplify

  56. We Already Know Few Things 1 ⇡ ˆmle 1 2

    ⇡ ˆmle 2 . . . k ⇡ ˆmle k s < 0
  57. 0 B B B B B @ 11 12 13

    . . . 1k 21 22 23 . . . 2k 31 32 33 . . . 3k . . . . . . . . . ... . . . k1 k2 k3 . . . kk 1 C C C C C A
  58. 0 B B B B B @ 11 12 13

    . . . 1k 21 22 23 . . . 2k 31 32 33 . . . 3k . . . . . . . . . ... . . . k1 k2 k3 . . . kk 1 C C C C C A
  59. Partial Prior Predictive Distribution p⇤(ynew) = R 0 1 p(ynew

    | s, ˆmle s )p( s | s  0)d( s)
  60. 1. Choose a prior distribution p( s) . 2. Estimate

    the model coefficients ˆmle . 3. For i in 1 to nsims, do the following: (a) Simulate ˜[i] s ⇠ p( s) . (b) Replace ˆmle s in ˆmle with ˜[i] s , yielding the vector ˜[i] . (c) Calculate and store the quantity of interest ˜ q[i] = q ⇣ ˜[i] ⌘ . 4. Keep only the simulations in the direction of the separation. 5. Summarize the simulations ˜ q using quantiles, histograms, or density plots. 6. If the prior is inadequate, then update the prior distribution p( s) .
  61. Example Nuclear Weapons and War

  62. None
  63. None
  64. None
  65. The prior matters, so robustness checks are critical.

  66. 1 10 100 1000 10000 100000 Risk−Ratio (Log Scale) 0

    500 1000 Counts Informative Normal(0, 4.5) Prior 1% of simulations 1 10 100 1000 10000 100000 Risk−Ratio (Log Scale) Skeptical Normal(0, 2) Prior < 1% of simulations 1 10 100 1000 10000 100000 Risk−Ratio (Log Scale) Enthusiastic Normal(0, 8) Prior 15% of simulations
  67. 0.00 0.05 0.10 0.15 0.20 0.25 Posterior Density Informative Normal(0,

    4.5) Prior Skeptical Normal(0, 2) Prior Enthusiastic Normal(0, 8) Prior −20 −15 −10 −5 0 Coefficient of Symmetric Nuclear Dyads −20 −15 −10 −5 0 Coefficient of Symmetric Nuclear Dyads 0.00 0.05 0.10 0.15 0.20 0.25 Posterior Density Zorn's Default Jeffreys' Prior −20 −15 −10 −5 0 Coefficient of Symmetric Nuclear Dyads Gelman et al.'s Default Cauchy(0, 2.5) Prior
  68. 0.1 1 10 100 1,000 10,000 100,000 Posterior Distribution of

    Risk−Ratio of War in Nonnuclear Dyads Compared to Symmetric Nuclear Dyads • Informative Normal(0, 4.5) Prior 0.1 24.5 1986.4 • Skeptical Normal(0, 2) Prior 0.1 4 31.2 • Enthusiastic Normal(0, 8) Prior 0.1 299.2 499043.2 • Zorn's Default Jefferys' Prior 0.1 3.4 100.2 • Gelman et al.'s Default Cauchy(0, 2.5) Prior 0.1 9.2 25277.4
  69. Software for Choosing a Good Prior

  70. separation (on GitHub)

  71. crain.co/example

  72. # install packages devtools::install_github("carlislerainey/compactr") devtools::install_github("carlislerainey/separation") # load packages library(separation) library(arm)

    # for rescale() # load and recode data data(politics_and_need) d <- politics_and_need d$dem_governor <- 1 - d$gop_governor d$st_percent_uninsured <- rescale(d$percent_uninsured) # formula to use throughout f <- oppose_expansion ~ dem_governor + percent_favorable_aca + gop_leg + st_percent_uninsured + bal2012 + multiplier + percent_nonwhite + percent_metro
  73. Workflow 1. Calculate the PPPD: calc_pppd() 2. Simulate from the

    posterior: sim_post_*() 3. Calculate quantities of interest: calc_qi()
  74. calc_pppd()

  75. # informative prior prior_sims_4.5 <- rnorm(10000, 0, 4.5) pppd <-

    calc_pppd(formula = f, data = d, prior_sims = prior_sims_4.5, sep_var_name = "dem_governor", prior_label = "Normal(0, 4.5)")
  76. plot(pppd)

  77. plot(pppd, log_scale = TRUE)

  78. sim_post_normal() sim_post_gelman() sim_post_jeffreys()

  79. # mcmc estimation post <- sim_post_normal(f, d, sep_var = "dem_governor",

    sd = 4.5, n_sims = 10000, n_burnin = 1000, n_chains = 4)
  80. calc_qi()

  81. # compute quantities of interest ## dem_governor X_pred_list <- set_at_median(f,

    d) x <- c(0, 1) X_pred_list$dem_governor <- x qi <- calc_qi(post, X_pred_list, qi_name = "fd")
  82. plot(qi, xlim = c(-1, 1), xlab = "First Difference", ylab

    = "Posterior Density", main = "The Effect of Democratic Partisanship on Opposing the Expansion")
  83. ## st_percent_uninsured X_pred_list <- set_at_median(f, d) x <- seq(min(d$st_percent_uninsured), max(d$st_percent_uninsured),

    by = 0.1) X_pred_list$st_percent_uninsured <- x qi <- calc_qi(post, X_pred_list, qi_name = "pr")
  84. plot(qi, x, xlab = "Percent Uninsured (Std.)", ylab = "Predicted

    Probability", main = "The Probability of Opposition as the Percent Uninsured (Std.) Varies")
  85. 15 lines

  86. Conclusion

  87. The prior matters a lot, so choose a good one.

  88. The prior matters in practice.

  89. The prior matters in theory.

  90. The partial prior predictive distribution simplifies the choice of prior.

  91. Software makes choosing a prior, estimating the model, and interpreting

    the estimates easy.
  92. What should you do? 1. Notice the problem and do

    something. 2. Recognize the the prior affects the inferences and choose a good one. 3. Assess the robustness of your conclusions to a range of prior distributions.
  93. Questions?

  94. Appendix

  95. −15 −10 −5 0 Posterior Median and 90% HPD for

    Coefficient of Symmetric Nuclear Dyads • Informative Normal(0, 4.5) Prior • Skeptical Normal(0, 2) Prior • Enthusiastic Normal(0, 8) Prior • Zorn's Default Jefferys' Invariant Prior • Gelman et al.'s Default Cauchy(0, 2.5) Prior
  96. 0.0 0.2 0.4 0.6 0.8 1.0 Pr(RR > 1) •

    Informative Normal(0, 4.5) Prior 0.93 • Skeptical Normal(0, 2) Prior 0.86 • Enthusiastic Normal(0, 8) Prior 0.96 • Zorn's Default Jeffreys' Prior 0.79 • Gelman et al.'s Default Cauchy(0, 2.5) Prior 0.9
  97. For 1. a monotonic likelihood p(y| ) decreasing in s,

    2. a proper prior distribution p( | ) , and 3. a large, negative s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  98. None
  99. None
  100. None
  101. None
  102. None
  103. Theorem 1. For a monotonic likelihood p(y| ) increasing [decreasing]

    in s, proper prior distribution p( | ) , and large positive [negative] s, the posterior distribution of s is proportional to the prior distribution for s, so that p( s |y) / p( s | ) .
  104. Proof. Due to separation, p(y| ) is monotonic increasing in

    s to a limit L , so that lim s !1 p(y| s ) = L . By Bayes’ rule, p( |y) = p(y| )p( | ) 1 R 1 p(y| )p( | )d = p(y| )p( | ) p(y| ) | {z } constant w.r.t. . Integrating out the other parameters s = h cons , 1, 2, ..., k i to obtain the posterior distribution of s, p( s |y) = 1 R 1 p(y| )p( | )d s p(y| ) , (1) and the prior distribution of s, p( s | ) = 1 Z 1 p( | )d s . Notice that p( s |y) / p( s | ) iff p( s |y) p( | ) = k , where the constant k 6= 0 .Thus,
  105. p( s | ) = 1 Z 1 p( |

    )d s . Notice that p( s |y) / p( s | ) iff p( s |y) p( s | ) = k , where the constant k 6= 0 .Thus, Theorem 1 implies that lim s !1 p( s |y) p( s | ) = k Substituting in Equation 1, lim s !1 1 R 1 p ( y | ) p ( | ) d s p ( y | ) p( s | ) = k. Multiplying both sides by p(y| ) , which is constant with respect to , lim s !1 1 R 1 p(y| )p( | )d s p( s | ) = kp(y| ). Setting 1 R p(y| )p( | )d s = p(y| s )p( s | ) ,
  106. s !1 p( s | ) Substituting in Equation 1,

    lim s !1 1 R 1 p ( y | ) p ( | ) d s p ( y | ) p( s | ) = k. Multiplying both sides by p(y| ) , which is constant with respect to , lim s !1 1 R 1 p(y| )p( | )d s p( s | ) = kp(y| ). Setting 1 R 1 p(y| )p( | )d s = p(y| s )p( s | ) , lim s !1 p(y| s )p( s | ) p( s | ) = kp(y| ). Canceling p( s | ) in the numerator and denominator, lim s !1 p(y| s ) = kp(y| ).