AIC/DIC/WAIC as ways to: • guard against overfitting and underfitting • explicitly compare models • Introduce regularizing priors as complementary strategy • Learn how to average predictions across models AIC DIC WAIC
the data. Too simple models both fit and predict poorly. • Overfitting: Learning too much from the data. Complex models always fit better, but often predict worse. • Need to find a model that navigates between underfitting and overfitting
an outcome. • How to quantify uncertainty? Should be: 1. Continuous 2. Increasing with number of possible events 3. Additive • These criteria intuitive, but effectiveness is why we keep using them • Like Bayes: intuitive, but effectiveness is reason to use
(1916–2001) Uncertainty in a probability distribution is average (minus) log-probability of an event. VODFSUBJOUZ PWFS UIF GPVS DPNCJOBUJPOT PG UIFTF FWFOUTSBJOIPU ODPME TIJOFIPU TIJOFDPMETIPVME CF UIF TVN PG UIF TFQBSBUF VO BJOUJFT POF GVODUJPO UIBU TBUJTĕFT UIFTF EFTJEFSBUB ćJT GVODUJPO JT VTVBMMZ ijļĿĺĮŁĶļĻ IJĻŁĿļĽņ BOE IBT B TVSQSJTJOHMZ TJNQMF EFĕOJUJPO *G JČFSFOU QPTTJCMF FWFOUT BOE FBDI FWFOU J IBT QSPCBCJMJUZ QJ BOE XF G QSPCBCJMJUJFT Q UIFO UIF VOJRVF NFBTVSF PG VODFSUBJOUZ XF TFFL JT )(Q) = − & MPH(QJ) = − O J= QJ MPH(QJ). PSET VODFSUBJOUZ DPOUBJOFE JO B QSPCBCJMJUZ EJTUSJCVUJPO JT UIF BWFS PHQSPCBCJMJUZ PG BO FWFOU NJHIU SFGFS UP B UZQF PG XFBUIFS MJLF SBJO PS TIJOF PS B QBSUJDVMBS E PS FWFO B QBSUJDVMBS OVDMFPUJEF JO B %/" TFRVFODF 8IJMF JUT OPU JOUP UIF EFUBJMT PG UIF EFSJWBUJPO PG ) JU JT XPSUI QPJOUJOH PVU UIBU VU UIJT GVODUJPO JT BSCJUSBSZ &WFSZ QBSU PG JU EFSJWFT GSPN UIF UISFF
How accurate is q, for describing p? • Distance from q to p: Divergence */'03."5*0/ 5)&03: "/% .0%&- 1&3'03."/$& PS FYBNQMF UIBU UIF USVF EJTUSJCVUJPO PG FWFOUT JT Q = ., Q = . OTUFBE UIBU UIFTF FWFOUT IBQQFO XJUI QSPCBCJMJUJFT R = ., R = DI BEEJUJPOBM VODFSUBJOUZ IBWF XF JOUSPEVDFE BT B DPOTFRVFODF PG , R} UP BQQSPYJNBUF Q = {Q, Q} ćF GPSNBM BOTXFS UP UIJT RVFT VQPO ) BOE IBT B TJNJMBSMZ TJNQMF GPSNVMB %,-(Q, R) = J QJ MPH(QJ) − MPH(RJ) . HVBHF UIF EJWFSHFODF JT UIF BWFSBHF EJČFSFODF JO MPH QSPCBCJMJUZ CF FU Q BOE NPEFM R ćJT EJWFSHFODF JT KVTU UIF EJČFSFODF CFUXFFO ćF FOUSPQZ PG UIF UBSHFU EJTUSJCVUJPO Q BOE UIF FOUSPQZ BSJTJOH UP QSFEJDU Q 8IFO Q = R XF LOPX UIF BDUVBM QSPCBCJMJUJFT PG UIF U DBTF Good news, everyone! Distance from q to p is the average difference in log-probability.
training and testing, size N • Fit model to training sample, get Dtrain • Use fit to training to compute Dtest • Difference Dtest – Dtrain is overfitting
4 5 45 50 55 60 65 number of parameters deviance N = 20 in out +1SD –1SD 1 2 3 4 5 250 260 270 280 290 300 number of parameters deviance N = 100 in out 'ĶĴłĿIJ ƎƏ %FWJBODF JO BOE PVU PG TBNQMF *O FBDI QMPU NPEFMT XJUI EJG GFSFOU OVNCFST PG QSFEJDUPS WBSJBCMFT BSF TIPXO PO UIF IPSJ[POUBM BYJT %F WJBODF BDSPTT UIPVTBOE TJNVMBUJPOT JT TIPXO PO UIF WFSUJDBM #MVF TIPXT Data generating model Deviance overfits
50 52 54 56 58 60 number of parameters deviance N = 20 N(0,1) N(0,0.5) N(0,0.2) 1 2 3 4 5 260 265 270 275 280 285 number of parameters deviance N = 100 'ĶĴłĿIJ ƎƑ 3FHVMBSJ[JOH QSJPST BOE PVUPGTBNQMF EFWJBODF ćF QPJOUT JO in sample out of sample in sample out of sample