$30 off During Our Annual Pro Sale. View Details »

Dealing with Separation in Logistic Regression Models

Carlisle Rainey
December 04, 2014

Dealing with Separation in Logistic Regression Models

Slides for a paper available at http://www.carlislerainey.com/papers/separation.pdf

Carlisle Rainey

December 04, 2014
Tweet

More Decks by Carlisle Rainey

Other Decks in Research

Transcript

  1. Dealing with Separation in
    Logistic Regression Models
    Carlisle Rainey
    Assistant Professor
    University at Buffalo, SUNY
    [email protected]
    paper, data, and code at
    crain.co/research

    View Slide

  2. Dealing with Separation in
    Logistic Regression Models

    View Slide

  3. The prior matters a lot,
    so choose a good one.
    43 times larger
    million

    View Slide

  4. The prior matters a lot,
    so choose a good one.
    1. in practice
    2. in theory
    3. concepts
    4. software

    View Slide

  5. The Prior Matters
    in Practice

    View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. 2 million

    View Slide

  10. 3,000

    View Slide

  11. 100%

    View Slide

  12. 90%

    View Slide

  13. “To expand this program is not
    unlike adding a thousand
    people to the Titanic.”

    — July 2012

    View Slide

  14. View Slide

  15. politics need

    View Slide

  16. “Obamacare is going to be horrible
    for patients. It’s going to be horrible
    for taxpayers. It’s probably the
    biggest job killer ever.”
    — October 2010

    View Slide

  17. “Obamacare is going to be horrible
    for patients. It’s going to be horrible
    for taxpayers. It’s probably the
    biggest job killer ever.”
    — October 2010
    “While the federal government is committed
    to paying 100 percent of the cost, I cannot,
    in good conscience, deny Floridians that
    need it access to healthcare.”
    — February 2013

    View Slide

  18. In the tug-of-war between politics and need,
    which one wins?

    View Slide

  19. Variable Coefficient Confidence Interval
    Democratic Governor -20.35 [-6,340.06; 6,299.36]
    % Uninsured (Std.) 0.92 [-3.46; 5.30]
    % Favorable to ACA 0.01 [-0.17; 0.18]
    GOP Legislature 2.43 [-0.47; 5.33]
    Fiscal Health 0.00 [-0.02; 0.02]
    Medicaid Multiplier -0.32 [-2.45; 1.80]
    % Non-white 0.05 [-0.12; 0.21]
    % Metropolitan -0.08 [-0.17; 0.02]
    Constant 2.58 [-7.02; 12.18]

    View Slide

  20. Doesn’t Oppose Opposes
    Republican 14 16
    Democrat 20 0

    View Slide

  21. View Slide

  22. Variable Coefficient Confidence Interval
    Democratic Governor -26.35 [-126,979.03; 126,926.33]
    % Uninsured (Std.) 0.92 [-3.46; 5.30]
    % Favorable to ACA 0.01 [-0.17; 0.18]
    GOP Legislature 2.43 [-0.47; 5.33]
    Fiscal Health 0.00 [-0.02; 0.02]
    Medicaid Multiplier -0.32 [-2.45; 1.80]
    % Non-white 0.05 [-0.12; 0.21]
    % Metropolitan -0.08 [-0.17; 0.02]
    Constant 2.58 [-7.02; 12.18]

    View Slide

  23. Variable Coefficient Confidence Interval
    Democratic Governor -26.35 [-126,979.03; 126,926.33]
    % Uninsured (Std.) 0.92 [-3.46; 5.30]
    % Favorable to ACA 0.01 [-0.17; 0.18]
    GOP Legislature 2.43 [-0.47; 5.33]
    Fiscal Health 0.00 [-0.02; 0.02]
    Medicaid Multiplier -0.32 [-2.45; 1.80]
    % Non-white 0.05 [-0.12; 0.21]
    % Metropolitan -0.08 [-0.17; 0.02]
    Constant 2.58 [-7.02; 12.18]
    useless
    unreasonable
    This is a failure of maximum likelihood.

    View Slide

  24. Jeffreys’ Prior
    Zorn (2005)

    View Slide

  25. View Slide

  26. Cauchy Prior
    Gelman et al. (2008)

    View Slide

  27. View Slide

  28. The Cauchy prior produces…
    a confidence interval that is
    250% wider

    View Slide

  29. View Slide

  30. The Cauchy prior produces…
    a coefficient estimate that is
    50% larger

    View Slide

  31. The Cauchy prior produces…
    a risk-ratio estimate that is
    43 million times larger

    View Slide

  32. Different default priors
    produce different results.

    View Slide

  33. The Prior Matters
    in Theory

    View Slide

  34. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  35. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  36. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  37. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  38. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  39. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  40. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  41. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  42. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  43. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  44. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  45. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  46. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  47. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  48. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  49. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  50. The prior determines
    crucial parts of the posterior.

    View Slide

  51. Key Concepts
    for Choosing a Good Prior

    View Slide

  52. Pr
    (
    yi) = ⇤( c + ssi + 1xi1 +
    ...
    + kxik)

    View Slide

  53. Prior Predictive Distribution
    p(ynew) =
    1
    R
    1
    p(ynew
    | )p( )d( )

    View Slide

  54. 0
    B
    B
    B
    B
    B
    @
    11 12 13 . . . 1k
    21 22 23 . . . 2k
    31 32 33 . . . 3k
    .
    .
    .
    .
    .
    .
    .
    .
    .
    ...
    .
    .
    .
    k1 k2 k3 . . . kk
    1
    C
    C
    C
    C
    C
    A

    View Slide

  55. simplify

    View Slide

  56. We Already Know Few Things
    1
    ⇡ ˆmle
    1
    2
    ⇡ ˆmle
    2
    .
    .
    .
    k
    ⇡ ˆmle
    k
    s < 0

    View Slide

  57. 0
    B
    B
    B
    B
    B
    @
    11 12 13 . . . 1k
    21 22 23 . . . 2k
    31 32 33 . . . 3k
    .
    .
    .
    .
    .
    .
    .
    .
    .
    ...
    .
    .
    .
    k1 k2 k3 . . . kk
    1
    C
    C
    C
    C
    C
    A

    View Slide

  58. 0
    B
    B
    B
    B
    B
    @
    11 12 13 . . . 1k
    21 22 23 . . . 2k
    31 32 33 . . . 3k
    .
    .
    .
    .
    .
    .
    .
    .
    .
    ...
    .
    .
    .
    k1 k2 k3 . . . kk
    1
    C
    C
    C
    C
    C
    A

    View Slide

  59. Partial Prior Predictive
    Distribution
    p⇤(ynew) =
    R 0
    1
    p(ynew
    | s, ˆmle
    s
    )p( s
    | s
     0)d( s)

    View Slide

  60. 1. Choose a prior distribution
    p( s)
    .
    2. Estimate the model coefficients
    ˆmle
    .
    3. For
    i
    in 1 to
    nsims, do the following:
    (a) Simulate
    ˜[i]
    s ⇠ p( s)
    .
    (b) Replace
    ˆmle
    s in
    ˆmle
    with
    ˜[i]
    s , yielding the vector
    ˜[i]
    .
    (c) Calculate and store the quantity of interest
    ˜
    q[i] = q

    ˜[i]

    .
    4. Keep only the simulations in the direction of the separation.
    5. Summarize the simulations
    ˜
    q
    using quantiles, histograms, or density plots.
    6. If the prior is inadequate, then update the prior distribution
    p( s)
    .

    View Slide

  61. Example
    Nuclear Weapons and War

    View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. The prior matters,
    so robustness checks
    are critical.

    View Slide

  66. 1 10 100 1000 10000 100000
    Risk−Ratio (Log Scale)
    0
    500
    1000
    Counts
    Informative Normal(0, 4.5) Prior
    1% of
    simulations
    1 10 100 1000 10000 100000
    Risk−Ratio (Log Scale)
    Skeptical Normal(0, 2) Prior
    < 1% of
    simulations
    1 10 100 1000 10000 100000
    Risk−Ratio (Log Scale)
    Enthusiastic Normal(0, 8) Prior
    15% of
    simulations

    View Slide

  67. 0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    Posterior Density
    Informative Normal(0, 4.5) Prior Skeptical Normal(0, 2) Prior Enthusiastic Normal(0, 8) Prior
    −20 −15 −10 −5 0
    Coefficient of Symmetric Nuclear Dyads
    −20 −15 −10 −5 0
    Coefficient of Symmetric Nuclear Dyads
    0.00
    0.05
    0.10
    0.15
    0.20
    0.25
    Posterior Density
    Zorn's Default Jeffreys' Prior
    −20 −15 −10 −5 0
    Coefficient of Symmetric Nuclear Dyads
    Gelman et al.'s Default Cauchy(0, 2.5) Prior

    View Slide

  68. 0.1 1 10 100 1,000 10,000 100,000
    Posterior Distribution of Risk−Ratio of War in Nonnuclear Dyads
    Compared to Symmetric Nuclear Dyads

    Informative Normal(0, 4.5) Prior
    0.1 24.5 1986.4

    Skeptical Normal(0, 2) Prior
    0.1 4 31.2

    Enthusiastic Normal(0, 8) Prior
    0.1 299.2 499043.2

    Zorn's Default Jefferys' Prior
    0.1 3.4 100.2

    Gelman et al.'s Default Cauchy(0, 2.5) Prior
    0.1 9.2 25277.4

    View Slide

  69. Software
    for Choosing a Good Prior

    View Slide

  70. separation
    (on GitHub)

    View Slide

  71. crain.co/example

    View Slide

  72. # install packages
    devtools::install_github("carlislerainey/compactr")
    devtools::install_github("carlislerainey/separation")
    # load packages
    library(separation)
    library(arm) # for rescale()
    # load and recode data
    data(politics_and_need)
    d <- politics_and_need
    d$dem_governor <- 1 - d$gop_governor
    d$st_percent_uninsured <- rescale(d$percent_uninsured)
    # formula to use throughout
    f <- oppose_expansion ~ dem_governor +
    percent_favorable_aca + gop_leg +
    st_percent_uninsured + bal2012 +
    multiplier + percent_nonwhite +
    percent_metro

    View Slide

  73. Workflow
    1. Calculate the PPPD: calc_pppd()
    2. Simulate from the posterior: sim_post_*()
    3. Calculate quantities of interest: calc_qi()

    View Slide

  74. calc_pppd()

    View Slide

  75. # informative prior
    prior_sims_4.5 <- rnorm(10000, 0, 4.5)
    pppd <- calc_pppd(formula = f,
    data = d,
    prior_sims = prior_sims_4.5,
    sep_var_name = "dem_governor",
    prior_label = "Normal(0, 4.5)")

    View Slide

  76. plot(pppd)

    View Slide

  77. plot(pppd, log_scale = TRUE)

    View Slide

  78. sim_post_normal()
    sim_post_gelman()
    sim_post_jeffreys()

    View Slide

  79. # mcmc estimation
    post <- sim_post_normal(f, d, sep_var = "dem_governor",
    sd = 4.5,
    n_sims = 10000,
    n_burnin = 1000,
    n_chains = 4)

    View Slide

  80. calc_qi()

    View Slide

  81. # compute quantities of interest
    ## dem_governor
    X_pred_list <- set_at_median(f, d)
    x <- c(0, 1)
    X_pred_list$dem_governor <- x
    qi <- calc_qi(post, X_pred_list, qi_name = "fd")

    View Slide

  82. plot(qi, xlim = c(-1, 1),
    xlab = "First Difference",
    ylab = "Posterior Density",
    main = "The Effect of Democratic Partisanship on
    Opposing the Expansion")

    View Slide

  83. ## st_percent_uninsured
    X_pred_list <- set_at_median(f, d)
    x <- seq(min(d$st_percent_uninsured),
    max(d$st_percent_uninsured),
    by = 0.1)
    X_pred_list$st_percent_uninsured <- x
    qi <- calc_qi(post, X_pred_list, qi_name = "pr")

    View Slide

  84. plot(qi, x,
    xlab = "Percent Uninsured (Std.)",
    ylab = "Predicted Probability",
    main = "The Probability of Opposition as the
    Percent Uninsured (Std.) Varies")

    View Slide

  85. 15 lines

    View Slide

  86. Conclusion

    View Slide

  87. The prior matters a lot,
    so choose a good one.

    View Slide

  88. The prior matters
    in practice.

    View Slide

  89. The prior matters
    in theory.

    View Slide

  90. The partial prior predictive distribution
    simplifies the choice of prior.

    View Slide

  91. Software makes choosing a prior,
    estimating the model, and
    interpreting the estimates easy.

    View Slide

  92. What should you do?
    1. Notice the problem and do something.
    2. Recognize the the prior affects the inferences
    and choose a good one.
    3. Assess the robustness of your conclusions to a
    range of prior distributions.

    View Slide

  93. Questions?

    View Slide

  94. Appendix

    View Slide

  95. −15 −10 −5 0
    Posterior Median and 90% HPD for
    Coefficient of Symmetric Nuclear Dyads

    Informative Normal(0, 4.5) Prior

    Skeptical Normal(0, 2) Prior

    Enthusiastic Normal(0, 8) Prior

    Zorn's Default Jefferys' Invariant Prior

    Gelman et al.'s Default Cauchy(0, 2.5) Prior

    View Slide

  96. 0.0 0.2 0.4 0.6 0.8 1.0
    Pr(RR > 1)

    Informative Normal(0, 4.5) Prior 0.93

    Skeptical Normal(0, 2) Prior 0.86

    Enthusiastic Normal(0, 8) Prior 0.96

    Zorn's Default Jeffreys' Prior 0.79

    Gelman et al.'s Default Cauchy(0, 2.5) Prior 0.9

    View Slide

  97. For
    1. a monotonic likelihood
    p(y| )
    decreasing in s,
    2. a proper prior distribution
    p( | )
    , and
    3. a large, negative s,
    the posterior distribution of s is proportional to the prior distribution for s, so
    that
    p( s
    |y) / p( s
    | )
    .

    View Slide

  98. View Slide

  99. View Slide

  100. View Slide

  101. View Slide

  102. View Slide

  103. Theorem 1. For a monotonic likelihood
    p(y| )
    increasing [decreasing] in s,
    proper prior distribution
    p( | )
    , and large positive [negative] s, the posterior
    distribution of s is proportional to the prior distribution for s, so that
    p( s
    |y) /
    p( s
    | )
    .

    View Slide

  104. Proof. Due to separation,
    p(y| )
    is monotonic increasing in s to a limit
    L
    , so
    that
    lim
    s
    !1
    p(y|
    s
    ) = L
    . By Bayes’ rule,
    p( |y) =
    p(y| )p( | )
    1
    R
    1
    p(y| )p( | )d
    =
    p(y| )p( | )
    p(y| )
    | {z }
    constant w.r.t.
    .
    Integrating out the other parameters s
    = h
    cons
    , 1, 2, ...,
    k
    i
    to obtain the
    posterior distribution of s,
    p(
    s
    |y) =
    1
    R
    1
    p(y| )p( | )d
    s
    p(y| )
    ,
    (1)
    and the prior distribution of s,
    p(
    s
    | ) =
    1
    Z
    1
    p( | )d
    s
    .
    Notice that
    p(
    s
    |y) / p(
    s
    | )
    iff
    p(
    s
    |y)
    p( | )
    = k
    , where the constant
    k 6= 0
    .Thus,

    View Slide

  105. p(
    s
    | ) =
    1
    Z
    1
    p( | )d
    s
    .
    Notice that
    p(
    s
    |y) / p(
    s
    | )
    iff
    p(
    s
    |y)
    p(
    s
    | )
    = k
    , where the constant
    k 6= 0
    .Thus,
    Theorem 1 implies that
    lim
    s
    !1
    p(
    s
    |y)
    p(
    s
    | )
    = k
    Substituting in Equation 1,
    lim
    s
    !1
    1
    R
    1
    p
    (
    y
    | )
    p
    ( | )
    d s
    p
    (
    y
    | )
    p(
    s
    | )
    = k.
    Multiplying both sides by
    p(y| )
    , which is constant with respect to ,
    lim
    s
    !1
    1
    R
    1
    p(y| )p( | )d
    s
    p(
    s
    | )
    = kp(y| ).
    Setting
    1
    R
    p(y| )p( | )d
    s
    = p(y|
    s
    )p(
    s
    | )
    ,

    View Slide

  106. s
    !1 p(
    s
    | )
    Substituting in Equation 1,
    lim
    s
    !1
    1
    R
    1
    p
    (
    y
    | )
    p
    ( | )
    d s
    p
    (
    y
    | )
    p(
    s
    | )
    = k.
    Multiplying both sides by
    p(y| )
    , which is constant with respect to ,
    lim
    s
    !1
    1
    R
    1
    p(y| )p( | )d
    s
    p(
    s
    | )
    = kp(y| ).
    Setting
    1
    R
    1
    p(y| )p( | )d
    s
    = p(y|
    s
    )p(
    s
    | )
    ,
    lim
    s
    !1
    p(y|
    s
    )p(
    s
    | )
    p(
    s
    | )
    = kp(y| ).
    Canceling
    p(
    s
    | )
    in the numerator and denominator,
    lim
    s
    !1
    p(y|
    s
    ) = kp(y| ).

    View Slide