Data Science and Decisions 2022: Week 6

Will Lowe
April 20, 2022

    Bias, variance, and error Groupthink and other bad outcomes Multiverse models in science Motivating multiverses with Bayes e limits of model averaging Deliberation and social choice
    each of whom predicts a target Y as ˆ Ym = Y + єm and compare this to a committee prediction ˆ Y Com = M M m ˆ Ym
    each of whom predicts a target Y as ˆ Ym = Y + єm and compare this to a committee prediction ˆ Y Com = M M m ˆ Ym S Model m has mean squared error (MSE) Em = E[( ˆ Ym − Y) ] = E[єm] e average MSE error of M such models is E Av = M M m Em = M M m E[єm]
    each of whom predicts a target Y as ˆ Ym = Y + єm and compare this to a committee prediction ˆ Y Com = M M m ˆ Ym S Model m has mean squared error (MSE) Em = E[( ˆ Ym − Y) ] = E[єm] e average MSE error of M such models is E Av = M M m Em = M M m E[єm] C E Com = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ⎛ ⎝ M M m ˆ Ym ˆ YCom −Y ⎞ ⎠ ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ M M m єm ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ If model errors are E[єm] = (well-speci ed) E[єmєm′ ] = m ≠ m′ (uncorrelated) then E Com = M M m E[є ] = M EAv
    each of whom predicts a target Y as ˆ Ym = Y + єm and compare this to a committee prediction ˆ Y Com = M M m ˆ Ym S Model m has mean squared error (MSE) Em = E[( ˆ Ym − Y) ] = E[єm] e average MSE error of M such models is E Av = M M m Em = M M m E[єm] C E Com = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ⎛ ⎝ M M m ˆ Ym ˆ YCom −Y ⎞ ⎠ ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ M M m єm ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ If model errors are E[єm] = (well-speci ed) E[єmєm′ ] = m ≠ m′ (uncorrelated) then E Com = M M m E[є ] = M EAv Simply averaging gives us M times less error!
    E Model combination as a model design strategy is known as ensemble learning Important examples → Bagging (‘bootstrap aggregating’ Breiman, ) → Boosting (Freund & Schapire, ) Hint: When someone suggests Deep Learning, try XGBoost rst!
    E Model combination as a model design strategy is known as ensemble learning Important examples → Bagging (‘bootstrap aggregating’ Breiman, ) → Boosting (Freund & Schapire, ) Hint: When someone suggests Deep Learning, try XGBoost rst! W ? Model combination helps navigates the bias-variance decomposition → Individual models can a ord to be less biased and quite variable → because averaging reduces committee variance → e complete model has at least as little bias as its components Let’s brie y review the bias-variance tradeo ...
    0.00 0.25 0.50 0.75 1.00 Degree 0 1 2 3 6 8 MSE = variance + bias + noise → Model : no error, high variance, low bias → Model : high error, low variance, high bias → Model : lowest error, mid variance, mid bias 0.00 0.25 0.50 0.75 0 1 2 3 4 5 6 7 8 9 Degree RMSE Sample in out In-sample error is a good guide to out of sample performance...until it isn’t → is is the bias-variance tradeo
    Com ≤ EAv Whether we get M times less error depends on our A → Correct model speci cations, e.g. no missing variables / unconfounded → Uncorrelated errors, e.g. errors independent in time and group What if these are not true? → Build a better model! → Learn the model error correlation matrix and make a weighted average Perfectly correlated judgment errors too
    to Janis): → Group cohesiveness (correlated errors) → Group insulation (model mispeci cation) → Leadership style (weighting, correlated errors) → Methodical information processing procedures S Increase committee diversity, e.g. board composition
    to Janis): → Group cohesiveness (correlated errors) → Group insulation (model mispeci cation) → Leadership style (weighting, correlated errors) → Methodical information processing procedures S Increase committee diversity, e.g. board composition O en thought to work because of → increased representation → new perspectives From our perspective, maybe because increased diversity → decorrelates errors → makes for better speci ed models (participants) Note: → ‘playing devil’s advocate’ (or WWJD?) fakes actual diversity by anti-correlating errors
    erent ways to study the same question Inter-researcher disagreement is o en driven by di erent → subsets of data, coding strategies → DV, IV, control choices → variable transformations, e.g. logs and cut-points → statistical assumptions, e.g. xed vs random e ects, interactions, and non-linearities
    erent ways to study the same question Inter-researcher disagreement is o en driven by di erent → subsets of data, coding strategies → DV, IV, control choices → variable transformations, e.g. logs and cut-points → statistical assumptions, e.g. xed vs random e ects, interactions, and non-linearities Identify models with decision makers How to make a nal inference? → Refuse! → Choose the best: model selection → Combine them: model averaging Multiverse analysis (Steegen et al., )
    and are below → Did they con rm hypothesis ? → Is the probability of hypothesis / ? How to combine these estimates?
    preprocessing choices, of data D where Z is our z-score. M By the Law of Total Probability P(Z D) = M m P(Z, M = m D) = M m P(Z M = m, D)P(M = m D) P(M = m D) is sometimes called a ‘Bayes Factor’
    preprocessing choices, of data D where Z is our z-score. M By the Law of Total Probability P(Z D) = M m P(Z, M = m D) = M m P(Z M = m, D)P(M = m D) P(M = m D) is sometimes called a ‘Bayes Factor’ If we don’t both to distinguish between how likely each model is in the light of the data then, maybe... P(M = m D) = P(M = m) = M (a ‘ at’ prior) What if we wanted to do things properly? Consider a model with parameters β Bayesian inferences about β should be based on P(β D, M = m) which is... P(D β, M = m)P(β M = m) P(D M = m) e marginal likelihood of the data is the denominator → How likely each M makes the data, averaging out uncertainty about β from which comes the Bayes Factor P(M = m D) = P(D M = m) P(M = m) P(D)
    D) = P(M = m) = M (though not impossible) → If some model speci cations are a priori more likely than others P(M = m) ≠ M → If model speci cation choices are independent of D, i.e. you don’t learn anything about them by seeing the data P(M = m D) ≠ P(M = m)
    D) = P(M = m) = M (though not impossible) → If some model speci cations are a priori more likely than others P(M = m) ≠ M → If model speci cation choices are independent of D, i.e. you don’t learn anything about them by seeing the data P(M = m D) ≠ P(M = m) M We can also use it to select a single model (but why?)
    D) = P(M = m) = M (though not impossible) → If some model speci cations are a priori more likely than others P(M = m) ≠ M → If model speci cation choices are independent of D, i.e. you don’t learn anything about them by seeing the data P(M = m D) ≠ P(M = m) M We can also use it to select a single model (but why?) O ’ R → in exible model: few Ds but more likely → exible model: many Ds less likely Flexibility and data t trade o (Bishop, )
    expertise on this kind of data: P(M = m D) for my D → Small numbers of experts (but not one) → Not large numbers of uninformed Tetlock and Gardner ( ) o ers two cognitive styles → Hedgehogs (assign high probability to relatively few data sets) → Foxes (assign lower probability to a wider range) We should probably think of these as a continuum
    to average D → R: rainfall → Y: crop yields → P: political con ict → F: famine Question: → What is the e ect of political con ict on famine? What variables to choose? (List, )
    to average D → R: rainfall → Y: crop yields → P: political con ict → F: famine Question: → What is the e ect of political con ict on famine? What variables to choose? (List, ) In a model of F → Expert would condition on Y → Expert would not condition on Y → Expert would only condition on Y if they wanted the ‘direct e ect’ of P
    t their models and we can average the predictions → Predictively this might be ne? → e marginal e ect of P on F is a mixture of very di erent ‘worlds’ We will be working with Expert ’s model (it’s a superset of the other two) But there’s no guarantee that the superset of more experts will make any sense (or even be acyclic!)
    Gibbard and Satterthwaite (Gibbard, ; Satterthwaite, ) suggest → Intuitive criteria for ‘voting’ cannot be simultaneously satis ed T [with deliberation] there would not be any need for an aggregating mechanism, since a rational discussion would tend to produce unanimous preferences Jon Elster, in Matravers and Pike ( ) Deliberation increases single peakedness, a.k.a. meta-agreement (Dryzek & List, ) Subsequent voting will be immune to Arrow-like problems
    sophisticated respondents have low-D preferences → eir views on di erent policies are not independent → Equivalently, their e ective D is lower than the number of questions you ask them S Unsophisticated voters have nearly independent preferences → Many preference inference methods assume sophistication, e.g. averages of directed responses, scaling models, etc. → ese will put sophisticated voters in the right place, and unsophisticated voters in the middle anyway despite them being mostly elsewhere Ideology is a regularizer / dimensionality reducer / preference structurer... Deliberation increases sophistication!
    and other bad outcomes Multiverse models in science Motivating multiverses with Bayes e limits of model averaging Deliberation and social choice
    Rumelhart, ) with one hidden layer: A universal approximator, due to the internal non-linearity (Hornik et al., ) A linear combination of J models E[Y X . . . XD] = J j βjϕj(X . . . XD) where ϕj is a non-linear function of a linear combination of inoput data ϕj(X . . . XD) = ( + exp(− d βjd Xd )) at’s → a model with D × J + J parameters → A regression on the output of J logistic regressions on the input data