50 52 54 56 58 60 number of parameters deviance N = 20 N(0,1) N(0,0.5) N(0,0.2) 1 2 3 4 5 260 265 270 275 280 285 number of parameters deviance N = 100 'ĶĴłĿIJ ƎƑ 3FHVMBSJ[JOH QSJPST BOE PVUPGTBNQMF EFWJBODF ćF QPJOUT JO in sample out of sample in sample out of sample
in theory: “Information criteria” • Information, because use of deviance based on information theoretic analysis • Criteria, because used to compare models • Information criteria estimate deviance out of sample • AIC, DIC, WAIC, many others
care about accumulated error over learning, aka prequential error • Consider the humble wurst • Grill-only or boil-then-grill? • Want to consume each wurst • How to learn and eat well at same time? • AIC not the right scenario
require flat priors • Does require reasonably Gaussian posterior • Does require effective parameter count << N • Computed from posterior samples • DIC function in rethinking David J. Spiegelhalter (1953–)
predictive density is log probability of data, averaged over posterior distribution: • Does all calculations pointwise • For each separable piece of data yi : • (1) compute Pr(yi |theta) for each sample of theta • (2) average the likelihoods • Sum all the log average likelihoods • 1000 observations and 5000 samples => 5 million likelihoods = TVN PG MPHT PG BWFSBHF MJLFMJIPPET MQQE = / J= MPH 1S(ZJ|θ) 1S(θ)Eθ = / J= MPH & θ 1S(ZJ|θ) MQQE = / J= MPH 1S(ZJ|θ) 1S(θ)Eθ = / J= MPH & θ 1S(ZJ|θ) MQQE = / J= MPH 4 4 T= 1S(ZJ|θT) Q8"*$ = / J= WBS θ MPH 1S(ZJ|θ) 8"*$ = −MQQE + Q8"*$
agree • When mean isn’t good summary of posterior, DIC can go squirrelly • pD < 0 • Mixture models routinely frustrate DIC • In any event, don’t mix criteria; use WAIC for all models or rather DIC for all models • Drawback: WAIC requires separating data • Time series: Is entire trend for each unit an observation? • Spatial/network models: All outcomes joint?
uncertainty about models, in addition to uncertainty about parameters • Model averaging: Simulate predictions, averaging over uncertainty about models • don’t average parameters, but only predictions
For more than one model, can average the averages • Do not average parameter estimates, just predictions • Because parameters in different models live in different small worlds => don’t mean same thing, even if named same thing • But predictions reference common large world
for each model • Compute distribution of predictions for each model • Mix predictions using model weights • Result is one kind of prediction ensemble • Such ensembles can outperform single-model predictions
year ending in digit “0” died in office • W. H. Harrison first, “Old Tippecanoe” • Lincoln, Garfield, McKinley, Harding, FD Roosevelt • J. F. Kennedy last, assassinated in 1963 • Reagan broke the curse! • Trying all possible models: A formula for overfitting • Be thoughtful • Model averaging mitigates the curse • Admit data exploration
complex models than AIC/DIC/WAIC recommend • Theory says predictor important, so estimate it • If you have a theory-motivated model, you want to know what data says about it • Good reasons to use flat priors • If regularizing >> flat, then why ever use flat? • Lots of sensible answers, but my favorite: • Flat prior lets you study the likelihood, and often that’s the most important thing