Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistical Rethinking - Lecture 07

Statistical Rethinking - Lecture 07

Lecture 07, Model Comparison (1), from Statistical Rethinking: A Bayesian Course with R Examples

Richard McElreath

January 27, 2015
Tweet

More Decks by Richard McElreath

Other Decks in Education

Transcript

  1. Ockham’s Razor? William of Ockham (c.1288–c.1348) Numquam ponenda est pluralitas

    sine necessitate. (Plurality should never be posited without necessity.)
  2. Stargazing • Stargazing: Using asterisks (p < 0.05) to decide

    which variables improve prediction • Arbitrary 5% is arbitrary; doesn’t optimize anything Coefficients: Estimate Std. Error z value Pr(z) a 1.5699e+02 9.3802e-16 1.6736e+17 < 2.2e-16 *** b1 1.6540e-01 6.6628e-14 2.4825e+12 < 2.2e-16 *** b2 -4.7063e-02 3.2586e-13 -1.4443e+11 < 2.2e-16 *** b3 1.9168e-03 5.6805e-11 3.3743e+07 < 2.2e-16 *** b4 -1.4002e-05 6.6694e-11 -2.0994e+05 < 2.2e-16 *** b5 -4.7965e-07 4.7818e-08 -1.0031e+01 < 2.2e-16 *** b6 6.6002e-09 9.5819e-10 6.8882e+00 5.651e-12 *** tau 1.2132e-01 5.2829e-20 2.2965e+18 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 * * *
  3. Goals this week • Understand overfitting and underfitting • Learn

    AIC/DIC/WAIC as ways to: • guard against overfitting and underfitting • explicitly compare models • Introduce regularizing priors as complementary strategy • Learn how to average predictions across models AIC DIC WAIC
  4. The Problem with Parameters • Underfitting: Learning too little from

    the data. Too simple models both fit and predict poorly. • Overfitting: Learning too much from the data. Complex models always fit better, but often predict worse. • Need to find a model that navigates between underfitting and overfitting
  5. The Problem with Parameters Figure 6.2 afarensis sapiens  

    07&3'*55*/( 3&(6-"3*;"5*0/ "/% */'03."5*0/ $3*5&3*" 30 40 50 60 70 600 800 1000 1200 body mass (kg) brain volume (cc) afarensis africanus habilis boisei rudolfensis ergaster sapiens 'ĶĴłĿIJ ƎƊ "WFSBHF CSBJO WPMVNF UJNFUFST BHBJOTU CPEZ NBTT JO LJMPH IPNJOJO TQFDJFT 8IBU NPEFM CFTU SFMBUJPOTIJQ CFUXFFO CSBJO TJ[F BOE iWBSJBODF FYQMBJOFE w 3 JT EFĕOFE BT
  6. Hominin brains • Simplest model:  .0%&- 4&-&$5*0/ $0.1"3*40/ "/%

    "7&3"(*/( NPEFM UIBU SFMBUFT CSBJO TJ[F UP CPEZ TJ[F JT UIF MJOFBS POF *U XJMM FM XF DPOTJEFS WJ ∼ /PSNBM(µJ, σ) µJ = α + β NJ Z UIBU UIF BWFSBHF CSBJO WPMVNF WJ PG TQFDJFT J JT B MJOFBS GVODUJPO NJ 1SJPST BSF OFDFTTBSJMZ ĘBU IFSF TJODF XFSF VTJOH )* /PX ĕU F EBUB VTJOH )* &+ ʍ *00 ǒ !1ʅ! ǰ OH UP QMPU UIF ĕU NPEFM MJLF XF EJE JO QSFWJPVT DIBQUFST MFUT GPDVT PG IPX XFMM UIJT NPEFM ĕUT UIF EBUB ćF DPOWFOUJPOBM NFBTVSF FYU JT 3 UIF QSPQPSUJPO PG WBSJBODF iFYQMBJOFEw CZ UIF NPEFM *   .0%&- 4&-&$5*0/ $0.1"3*40/ "/% "7&3"(*/( ćF TJNQMFTU NPEFM UIBU SFMBUFT CSBJO TJ[F UP CPEZ TJ[F JT UIF MJOFBS POF *U XJMM CF UIF ĕSTU NPEFM XF DPOTJEFS WJ ∼ /PSNBM(µJ, σ) µJ = α + β NJ ćJT JT KVTU UP TBZ UIBU UIF BWFSBHF CSBJO WPMVNF WJ PG TQFDJFT J JT B MJOFBS GVODUJPO PG JUT CPEZ NBTT NJ 1SJPST BSF OFDFTTBSJMZ ĘBU IFSF TJODF XFSF VTJOH )* /PX ĕU UIJT NPEFM UP UIF EBUB VTJOH )* 3 DPEF  *ǃǑƾ ʆǦ )*ǯ /&+ ʍ *00 ǒ !1ʅ! ǰ *OTUFBE PG QBVTJOH UP QMPU UIF ĕU NPEFM MJLF XF EJE JO QSFWJPVT DIBQUFST MFUT GPDVT PO UIF RVFTUJPO PG IPX XFMM UIJT NPEFM ĕUT UIF EBUB ćF DPOWFOUJPOBM NFBTVSF PG ĕU JO UIJT DPOUFYU JT 3 UIF QSPQPSUJPO PG WBSJBODF iFYQMBJOFEw CZ UIF NPEFM * QVU iFYQMBJOFEw JO TDBSF RVPUFT CFDBVTF FYQMBOBUJPO JNQMJFT VOEFSTUBOEJOH BOE UIBU NBZ OPU CF UIF DBTF XJUI TVDI NPEFMT 8IBU JT SFBMMZ NFBOU IFSF JT UIBU UIF MJOFBS NPEFM SFUSPEJDUT TPNF QSPQPSUJPO PG UIF UPUBM WBSJBUJPO JO UIF EBUB JU XBT ĕU  5)& 130#-&. 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.49 (a) 800 1200 ain volume (cc) R^2 = 0.68 (c)
  7. Hominin brains • Why not parabola? IF CPUUPN PG UIF

    PVUQVU GSPN 02**/6ǯ*ǃǑƾǰ BOE ZPVMM ĕOE UIF BCFMFE UIFSF i.VMUJQMF 3TRVBSFEw NF PUIFS NPEFMT UP DPNQBSF UP UIF ĕU PG *ǃǑƾ 8FMM DPOTJEFS ĕWF BDI NPSF DPNQMFY UIBO UIF MBTU &BDI PG UIFTF NPEFMT XJMM KVTU PG IJHIFS EFHSFF 'PS FYBNQMF B TFDPOEEFHSFF QPMZOPNJBM UIBU UP CSBJO TJ[F JT B QBSBCPMB *O NBUI GPSN JU JT WJ ∼ /PSNBM(µJ, σ) µJ = α + β NJ + β N J JMZ BEET POF NPSF QBSBNFUFS β CVU VTFT BMM PG UIF TBNF EBUB BT PEFM UP UIF EBUB &+ ʍ *00 ʀ ǯ*00ʋƿǰ ǒ !1ʅ! ǰ HF  JG UIBU ǯ*00ʋƿǰ UIJOH DPOGVTFT ZPV UIF SFTU PG UIF NPEFM GBNJMJFT ćF NPEFMT *ǃǑǀ UISPVHI *ǃǑǃ BSF F GPVSUIEFHSFF ĕęIEFHSFF BOE TJYUIEFHSFF QPMZOPNJBMT CVJMU UP ćF SFNBJOJOH WBSJBUJPO JT KVTU UIF WBSJBUJPO PG UIF SFTJEVBMT QBHF   :PV DBO FBTJMZ DPNQVUF 3 ZPVSTFMG XJUI 3 DPEF  ƾ Ǧ 3/ǯ/"0&!ǯ*ǃǑƾǰǰǵ3/ǯ!ɢ/&+ǰ DZƾDz ƽǑǁdžƽƾǂDž 5BLF B MPPL BU UIF CPUUPN PG UIF PVUQVU GSPN 02**/6ǯ*ǃǑƾǰ BOE ZPVMM ĕOE UIF TBNF OVNCFS MBCFMFE UIFSF i.VMUJQMF 3TRVBSFEw -FUT HFU TPNF PUIFS NPEFMT UP DPNQBSF UP UIF ĕU PG *ǃǑƾ 8FMM DPOTJEFS ĕWF NPSF NPEFMT FBDI NPSF DPNQMFY UIBO UIF MBTU &BDI PG UIFTF NPEFMT XJMM KVTU CF B QPMZOPNJBM PG IJHIFS EFHSFF 'PS FYBNQMF B TFDPOEEFHSFF QPMZOPNJBM UIBU SFMBUFT CPEZ TJ[F UP CSBJO TJ[F JT B QBSBCPMB *O NBUI GPSN JU JT WJ ∼ /PSNBM(µJ, σ) µJ = α + β NJ + β N J ćJT NPEFM GBNJMZ BEET POF NPSF QBSBNFUFS β CVU VTFT BMM PG UIF TBNF EBUB BT *ǃǑƾ 'JU UIJT NPEFM UP UIF EBUB 3 DPEF  *ǃǑƿ ʆǦ )*ǯ /&+ ʍ *00 ʀ ǯ*00ʋƿǰ ǒ !1ʅ! ǰ -PPL CBDL BU QBHF  JG UIBU ǯ*00ʋƿǰ UIJOH DPOGVTFT ZPV /PX MFUT ĕU UIF SFTU PG UIF NPEFM GBNJMJFT ćF NPEFMT *ǃǑǀ UISPVHI *ǃǑǃ BSF KVTU UIJSEEFHSFF GPVSUIEFHSFF ĕęIEFHSFF BOE TJYUIEFHSFF QPMZOPNJBMT CVJMU BOE ĕU JO UIF TBNF XBZ )FSF JT UIF DPEF UP ĕU BMM PG UIFN UP UIF EBUB 3 DPEF  *ǃǑǀ ʆǦ )*ǯ /&+ ʍ *00 ʀ ǯ*00ʋƿǰ ʀ ǯ*00ʋǀǰ ǒ !1ʅ! ǰ  5)& 130#-&. 8*5) 1"3".&5&34 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.49 (a) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.54 (b) 800 1200 brain volume (cc) R^2 = 0.68 (c) 800 1200 brain volume (cc) R^2 = 0.81 (d)
  8. Hominin brains • Why not higher order polynomials? BSPVOE XJMEMZ

    JO UIJT JOUFSWBM *O 'ĶĴłĿIJ ƎƋ G FTQFDJBMMZ UIF TXJOH JT TP FYUSFN UIBU * IBE UP FYUFOE UIF SBOHF PG UIF WFSUJDBM BYJT UP EJTQMBZ UIF EFQUI BU XIJDI UI QSFEJDUFE NFBO ĕOBMMZ UVSOT CBDL BSPVOE "U BSPVOE LH UIF NPEFM QSFEJD B OFHBUJWF CSBJO TJ[F ćF NPEFM QBZT OP QSJDF ZFU GPS UIJT BCTVSEJUZ CFDBVT UIFSF BSF OP DBTFT JO UIF EBUB XJUI CPEZ NBTT OFBS LH 8IZ EPFT UIF TJYUIEFHSFF QPMZOPNJBM ĕU QFSGFDUMZ #FDBVTF JU IBT FOPVH QBSBNFUFST UP BTTJHO POF UP FBDI QPJOU PG EBUB ćF NPEFMT FRVBUJPO GPS UIF NFB IBT  QBSBNFUFST µJ = α + β NJ + β N J + β N J + β N J + β N J + β N J , BOE UIFSF BSF  TQFDJFT UP QSFEJDU CSBJO TJ[FT GPS 4P FČFDUJWFMZ UIJT NPEFM BTTJHO POF QBSBNFUFS UP KVTU SFJUFSBUF FBDI PCTFSWFE CSBJO TJ[F ćJT JT B HFOFSBM QIF OPNFOPO *G ZPV BEPQU B NPEFM GBNJMZ XJUI FOPVHI QBSBNFUFST ZPV DBO ĕU UI EBUB FYBDUMZ #VU TVDI B NPEFM XJMM NBLF SBUIFS BCTVSE QSFEJDUJPOT GPS ZFUUPCF PCTFSWFE DBTFT 3FUIJOLJOH .PEFM ĕUUJOH BT DPNQSFTTJPO "OPUIFS QFSTQFDUJWF PO UIF BCTVSE NPE KVTU BCPWF JT UP DPOTJEFS UIBU NPEFM ĕUUJOH DBO CF DPOTJEFSFE B GPSN PG ıĮŁĮ İļĺĽĿIJŀ ŀĶļĻ 1BSBNFUFST TVNNBSJ[F SFMBUJPOTIJQT BNPOH UIF EBUB ćFTF TVNNBSJFT DPNQSF UIF EBUB JOUP B TJNQMFS GPSN BMUIPVHI XJUI MPTT PG JOGPSNBUJPO iMPTTZw DPNQSFTTJPO
  9. Figure 6.3 35 40 45 50 55 60 400 800

    1200 body mass (kg) brain volume (cc) R^2 = 0.49 (a) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.54 (b) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.68 (c) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.81 (d) 1500 e (cc) R^2 = 0.99 (e) 800 1200 e (cc) R^2 = 1.00 (f) body mass (kg) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.68 (c) 35 40 45 50 55 60 500 1000 1500 body mass (kg) brain volume (cc) R^2 = 0.99 (e) 'ĶĴłĿIJ ƎƋ 1PMZOPNJBM MJOFBS NPEFMT NJOJO EBUB &BDI QMPU TIPXT UIF QSFEJDUF PG UIF NFBO TIBEFE 3 JT EJTQMBZFE BCP OPNJBM C 4FDPOE EFHSFF D ćJSE E EFHSFF G 4JYUI EFHSFF 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.54 (b) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.81 (d) 0 1200 (cc) R^2 = 1.00 (f) 35 40 45 50 55 60 body mass (kg) 35 40 45 50 55 60 body mass (kg) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.68 (c) 35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) R^2 = 0.81 (d) 35 40 45 50 55 60 500 1000 1500 body mass (kg) brain volume (cc) R^2 = 0.99 (e) 35 40 45 50 55 60 0 400 800 1200 body mass (kg) brain volume (cc) R^2 = 1.00 (f) 'ĶĴłĿIJ ƎƋ 1PMZOPNJBM MJOFBS NPEFMT PG JODSFBTJOH EFHSFF ĕU UP UIF IP NJOJO EBUB &BDI QMPU TIPXT UIF QSFEJDUFE NFBO JO CMBDL XJUI  JOUFSWBM PG UIF NFBO TIBEFE 3 JT EJTQMBZFE BCPWF FBDI QMPU B 'JSTU EFHSFF QPMZ OPNJBM C 4FDPOE EFHSFF D ćJSE EFHSFF E 'PVSUI EFHSFF F 'JęI
  10. Figure 6.5 Underfitting Insensitive to exact data Overfitting Very sensitive

    to exact data  5)& 130#-&. 8*5) 1"3".&5&34  35 40 45 50 55 60 400 800 1200 body mass (kg) brain volume (cc) (a) 35 40 45 50 55 60 0 500 1000 2000 body mass (kg) brain volume (cc) (b) 'ĶĴłĿIJ Ǝƍ 6OEFSĕUUJOH BOE PWFSĕUUJOH BT VOEFSTFOTJUJWJUZ BOE PWFS TFOTJUJWJUZ UP TBNQMF *O CPUI QMPUT B SFHSFTTJPO JT ĕU UP UIF TFWFO TFUT PG EBUB NBEF CZ ESPQQJOH POF SPX GSPN UIF PSJHJOBM EBUB B "O VOEFSĕU NPEFM JT JOTFOTJUJWF UP UIF TBNQMF DIBOHJOH MJUUMF BT JOEJWJEVBM QPJOUT BSF ESPQQFE C "O PWFSĕU NPEFM JT TFOTJUJWF UP UIF TBNQMF DIBOHJOH ESBNBU
  11. Importance of being regular • Want the regular features of

    the sample • Strategies • Cross-validation • Regularizing priors (penalized likelihood) • Information criteria • Science! (iterative group learning) • Proper approach depends upon purpose
  12. The road to AIC/DIC/WAIC • What’s a good target? •

    How measure distance from the target? • How can we estimate that distance? • How can we adjust that estimate to account for overfitting?
  13. How far from truth? • Truth: The real joint probability

    of events • Truth defines probability distribution • Model defines another • Need a way to measure distance of a model from truth • Distance needs to accommodate complexity of prediction task 0CTFSWFE i4P CZ SBUF PG DPSSFDU QSFEJDUJPO BMPOF w UIF OFXDPNFS BOOPVODFT i*N UIF CFTU QFSTPO GPS UIF KPCw ćF OFXDPNFS JT SJHIU %FĕOF IJU SBUF BT UIF BWFSBHF DIBODF PG B DPSSFDU QSFEJDUJPO 4P GPS UIF DVSSFOU XFBUIFSQFSTPO TIF HFUT  ×  +  × . = . IJUT JO  EBZT GPS B SBUF PG ./ = . DPSSFDU QSFEJDUJPOT QFS EBZ *O DPOUSBTU UIF OFXDPNFS HFUT ×+× =  GPS / = . IJUT QFS EBZ ćF OFXDPNFS XJOT  $PTUT BOE CFOFĕUT #VU JUT OPU IBSE UP ĕOE BOPUIFS DSJUFSJPO PUIFS UIBO SBUF PG DPSSFDU QSFEJDUJPO UIBU NBLFT UIF OFXDPNFS MPPL GPPMJTI "OZ DPOTJEFSBUJPO PG DPTUT BOE CFOFĕUT XJMM TVďDF 4VQQPTF GPS FYBNQMF UIBU ZPV IBUF HFUUJOH DBVHIU JO UIF SBJO CVU ZPV BMTP IBUF DBSSZJOH BO VNCSFMMB -FUT EFĕOF UIF DPTU PG HFUUJOH XFU BT − QPJOUT PG IBQQJOFTT BOE UIF DPTU PG DBSSZJOH BO VNCSFMMB BT − QPJOUT PG IBQQJOFTT 4VQQPTF ZPVS DIBODF PG DBSSZJOH BO VNCSFMMB JT FRVBM UP UIF GPSFDBTU QSPCBCJMJUZ PG SBJO :PVS KPC JT OPX UP NBYJNJ[F ZPVS IBQQJOFTT CZ DIPPTJOH B XFBUIFSQFSTPO )FSF BSF ZPVS QPJOUT GPMMPXJOH FJUIFS UIF DVSSFOU XFBUIFSQFSTPO PS UIF OFXDPNFS %BZ           0CTFSWFE 1PJOUT $VSSFOU − − − −. −. −. −. −. −. −. /FXDPNFS − − −       
  14. Information theory • Information: Reduction in uncertainty caused by learning

    an outcome. • How to quantify uncertainty? Should be: 1. Continuous 2. Increasing with number of possible events 3. Additive • These criteria intuitive, but effectiveness is why we keep using them • Like Bayes: intuitive, but effectiveness is reason to use
  15. Information entropy • 1948, Claude Shannon derived information entropy: Shannon

    (1916–2001) Uncertainty in a probability distribution is average (minus) log-probability of an event. VODFSUBJOUZ PWFS UIF GPVS DPNCJOBUJPOT PG UIFTF FWFOUT‰SBJOIPU ODPME TIJOFIPU TIJOFDPME‰TIPVME CF UIF TVN PG UIF TFQBSBUF VO BJOUJFT POF GVODUJPO UIBU TBUJTĕFT UIFTF EFTJEFSBUB ćJT GVODUJPO JT VTVBMMZ ijļĿĺĮŁĶļĻ IJĻŁĿļĽņ BOE IBT B TVSQSJTJOHMZ TJNQMF EFĕOJUJPO *G JČFSFOU QPTTJCMF FWFOUT BOE FBDI FWFOU J IBT QSPCBCJMJUZ QJ BOE XF G QSPCBCJMJUJFT Q UIFO UIF VOJRVF NFBTVSF PG VODFSUBJOUZ XF TFFL JT )(Q) = − & MPH(QJ) = − O J= QJ MPH(QJ).  PSET VODFSUBJOUZ DPOUBJOFE JO B QSPCBCJMJUZ EJTUSJCVUJPO JT UIF BWFS PHQSPCBCJMJUZ PG BO FWFOU NJHIU SFGFS UP B UZQF PG XFBUIFS MJLF SBJO PS TIJOF PS B QBSUJDVMBS E PS FWFO B QBSUJDVMBS OVDMFPUJEF JO B %/" TFRVFODF 8IJMF JUT OPU JOUP UIF EFUBJMT PG UIF EFSJWBUJPO PG ) JU JT XPSUI QPJOUJOH PVU UIBU VU UIJT GVODUJPO JT BSCJUSBSZ &WFSZ QBSU PG JU EFSJWFT GSPN UIF UISFF
  16. Entropy to accuracy • Two probability distributions: p, q •

    How accurate is q, for describing p? • Distance from q to p: Divergence  */'03."5*0/ 5)&03: "/% .0%&- 1&3'03."/$&  PS FYBNQMF UIBU UIF USVF EJTUSJCVUJPO PG FWFOUT JT Q = ., Q = . OTUFBE UIBU UIFTF FWFOUT IBQQFO XJUI QSPCBCJMJUJFT R = ., R = DI BEEJUJPOBM VODFSUBJOUZ IBWF XF JOUSPEVDFE BT B DPOTFRVFODF PG , R} UP BQQSPYJNBUF Q = {Q, Q} ćF GPSNBM BOTXFS UP UIJT RVFT VQPO ) BOE IBT B TJNJMBSMZ TJNQMF GPSNVMB %,-(Q, R) = J QJ MPH(QJ) − MPH(RJ) . HVBHF UIF EJWFSHFODF JT UIF BWFSBHF EJČFSFODF JO MPH QSPCBCJMJUZ CF FU Q BOE NPEFM R  ćJT EJWFSHFODF JT KVTU UIF EJČFSFODF CFUXFFO  ćF FOUSPQZ PG UIF UBSHFU EJTUSJCVUJPO Q BOE UIF FOUSPQZ BSJTJOH UP QSFEJDU Q 8IFO Q = R XF LOPX UIF BDUVBM QSPCBCJMJUJFT PG UIF U DBTF Good news, everyone! Distance from q to p is the average difference in log-probability.
  17. Entropy to accuracy   OTUFBE UIBU UIFTF FWFOUT IBQQFO

    XJUI QSPCBCJMJUJFT R = ., R = DI BEEJUJPOBM VODFSUBJOUZ IBWF XF JOUSPEVDFE BT B DPOTFRVFODF PG , R} UP BQQSPYJNBUF Q = {Q, Q} ćF GPSNBM BOTXFS UP UIJT RVFT VQPO ) BOE IBT B TJNJMBSMZ TJNQMF GPSNVMB %,-(Q, R) = J QJ MPH(QJ) − MPH(RJ) . HVBHF UIF EJWFSHFODF JT UIF BWFSBHF EJČFSFODF JO MPH QSPCBCJMJUZ CF FU Q BOE NPEFM R  ćJT EJWFSHFODF JT KVTU UIF EJČFSFODF CFUXFFO  ćF FOUSPQZ PG UIF UBSHFU EJTUSJCVUJPO Q BOE UIF FOUSPQZ BSJTJOH UP QSFEJDU Q 8IFO Q = R XF LOPX UIF BDUVBM QSPCBCJMJUJFT PG UIF U DBTF ,-(Q, R) = %,-(Q, Q) = J QJ MPH(QJ) − MPH(QJ) = . EJUJPOBM VODFSUBJOUZ JOEVDFE XIFO XF VTF B QSPCBCJMJUZ EJTUSJCVUJPO TFMG ćBUT TPNFIPX B DPNGPSUJOH UIPVHIU #VU NPSF JNQPSUBOUMZ PSF EJČFSFOU GSPN Q UIF EJWFSHFODF %,- BMTP HSPXT FSHFODF DBO EP GPS VT OPX JT IFMQ VT DPOUSBTU EJČFSFOU BQQSPYJNB  */'03."5*0/ 5)&0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 q[1] Divergence of q from p q = p ' J Q U Q Q J U p <- c(0.3,0.7) DKL <- function(p,q) sum(p*(log(p)-log(q))) q1seq <- seq(from=0.01,to=0.99,by=0.01) DKLseq <- sapply(q1seq, function(q1) DKL(p,c(q1,1-q1)) ) plot( q1seq , DKLseq )
  18. Estimating divergence • How to estimate DKL ? • Don’t

    know p! Don’t need it. Focus on difference between two approximating models: 4VQQPTF GPS FYBNQMF UIBU UIF USVF EJTUSJCVUJPO PG FWFOUT JT Q = ., Q = . *G XF CFMJFWF JOTUFBE UIBU UIFTF FWFOUT IBQQFO XJUI QSPCBCJMJUJFT R = ., R = . IPX NVDI BEEJUJPOBM VODFSUBJOUZ IBWF XF JOUSPEVDFE BT B DPOTFRVFODF P VTJOH R = {R, R} UP BQQSPYJNBUF Q = {Q, Q} ćF GPSNBM BOTXFS UP UIJT RVFT UJPO JT CBTFE VQPO ) BOE IBT B TJNJMBSMZ TJNQMF GPSNVMB %,-(Q, R) = J QJ MPH(QJ) − MPH(RJ) . *O QMBJOFS MBOHVBHF UIF EJWFSHFODF JT UIF BWFSBHF EJČFSFODF JO MPH QSPCBCJMJUZ CF UXFFO UIF UBSHFU Q BOE NPEFM R  ćJT EJWFSHFODF JT KVTU UIF EJČFSFODF CFUXFFO UXP FOUSPQJFT ćF FOUSPQZ PG UIF UBSHFU EJTUSJCVUJPO Q BOE UIF FOUSPQZ BSJTJOH GSPN VTJOH R UP QSFEJDU Q 8IFO Q = R XF LOPX UIF BDUVBM QSPCBCJMJUJFT PG UIF FWFOUT *O UIBU DBTF %,-(Q, R) = %,-(Q, Q) = J QJ MPH(QJ) − MPH(QJ) = . ćFSF JT OP BEEJUJPOBM VODFSUBJOUZ JOEVDFE XIFO XF VTF B QSPCBCJMJUZ EJTUSJCVUJPO UP SFQSFTFOU JUTFMG ćBUT TPNFIPX B DPNGPSUJOH UIPVHIU #VU NPSF JNQPSUBOUMZ BT R HSPXT NPSF EJČFSFOU GSPN Q UIF EJWFSHFODF %,- BMTP HSPXT  */'03."5*0/ 5)&03: "/% .0%&- 1&3'03."/$&  EJWFSHFODF PG CPUI R BOE S ćJT UFSN IBT OP FČFDU PO UIF EJTUBODF PG R BOE S GSPN POF BOPUIFS 4P XIJMF XF EPOU LOPX XIFSF Q JT XF DBO FTUJNBUF IPX GBS BQBSU R BOE S BSF BOE XIJDI JT DMPTFS UP UIF UBSHFU *UT BT JG XF DBOU UFMM IPX GBS BOZ QBSUJDVMBS BSDIFS JT GSPN UIF UBSHFU CVU XF DBO UFMM XIJDI BSDIFS JT DMPTFS BOE CZ IPX NVDI %,-(Q, R) − %,-(Q, S) = − J QJ(MPH RJ − MPH QJ) − − J QJ(MPH SJ − MPH QJ) = − J QJ(MPH RJ − MPH SJ) = −(& MPH RJ − & MPH SJ) "MM PG UIJT BMTP NFBOT UIBU BMM XF OFFE UP LOPX JT B NPEFMT BWFSBHF MPH QSPC BCJMJUZ & MPH(RJ) GPS R BOE & MPH(SJ) GPS S ćFTF FYQSFTTJPOT MPPL B MPU MJLF MPH
  19. Deviance • Deviance, estimate of relative information divergence: UBLFT DBSF

    PG QSFTFOUJOH UIF FWFOUT GPS VT 4P XF DBO DPNQBSF UIF BWFSBHF MPH QSPCBCJMJUZ GSPN FBDI NPEFM UP HFU BO FT UJNBUF PG UIF SFMBUJWF EJTUBODF PG FBDI NPEFM GSPN UIF UBSHFU ćJT BMTP NFBOT UIBU UIF BCTPMVUF NBHOJUVEF PG UIFTF WBMVFT XJMM OPU CF JOUFSQSFUBCMF‰OFJUIFS & MPH(RJ) OPS & MPH(SJ) CZ JUTFMG TVHHFTUT B HPPE PS CBE NPEFM 0OMZ UIF EJČFS FODF & MPH(RJ)−& MPH(SJ) JOGPSNT VT BCPVU UIF EJWFSHFODF PG FBDI NPEFM GSPN UIF UBSHFU Q "MM PG UIJT EFMJWFST VT UP B WFSZ DPNNPO NFBTVSF PG NPEFM ĕU POF UIBU BMTP UVSOT PVU UP CF BO BQQSPYJNBUJPO PG ,- EJWFSHFODF UIF ıIJŃĶĮĻİIJ XIJDI JT EF ĕOFE BT %(R) = − J MPH(RJ) XIFSF J JOEFYFT FBDI PCTFSWBUJPO DBTF BOE FBDI RJ JT KVTU UIF MJLFMJIPPE PG DBTF J ćF − JO GSPOU EPFTOU EP BOZUIJOH JNQPSUBOU *UT UIFSF GPS IJTUPSJDBM SFBTPOT :PV DBO DPNQVUF UIF EFWJBODF GPS BOZ NPEFM ZPVWF ĕU BMSFBEZ JO UIJT CPPL KVTU CZ VTJOH UIF ."1 FTUJNBUFT UP DPNQVUF B MPHQSPCBCJMJUZ PG UIF PCTFSWFE EBUB GPS FBDI SPX ćFTF QSPCBCJMJUJFT BSF UIF R WBMVFT ćFO ZPV BEE UIFTF MPH QSPCBCJMJUJFT UPHFUIFS BOE NVMUJQMZ CZ − )FSFT B RVJDL FYBNQMF VTJOH UIF IP NJOJO CSBJO EBUB BHBJO Deviance is the sum log-probability of the data, with a minus-two because reasons. Strange, since not average, but rather sum log-prob. EJWFSHFODF PG CPUI R BOE S ćJT UFSN IBT OP FČFDU PO UIF EJTUBODF PG R POF BOPUIFS 4P XIJMF XF EPOU LOPX XIFSF Q JT XF DBO FTUJNBUF IPX R BOE S BSF BOE XIJDI JT DMPTFS UP UIF UBSHFU *UT BT JG XF DBOU UFMM I QBSUJDVMBS BSDIFS JT GSPN UIF UBSHFU CVU XF DBO UFMM XIJDI BSDIFS JT DMP IPX NVDI %,-(Q, R) − %,-(Q, S) = − J QJ(MPH RJ − MPH QJ) − − J QJ(MPH SJ − = − J QJ(MPH RJ − MPH SJ) = −(& MPH RJ − & MPH "MM PG UIJT BMTP NFBOT UIBU BMM XF OFFE UP LOPX JT B NPEFMT BWFSBH BCJMJUZ & MPH(RJ) GPS R BOE & MPH(SJ) GPS S ćFTF FYQSFTTJPOT MPPL B QSPCBCJMJUJFT PG PVUDPNFT MJLF UIF MPHMJLFMJIPPET ZPVWF CFFO VTJOH B EFFE KVTU TVNNJOH UIF MPHMJLFMJIPPET PG FBDI DBTF QSPWJEFT BO BQQS PG & MPH(RJ) 8F EPOU IBWF UP LOPX UIF Q JOTJEF UIF FYQFDUBUJPO CFDB UBLFT DBSF PG QSFTFOUJOH UIF FWFOUT GPS VT 4P XF DBO DPNQBSF UIF BWFSBHF MPH QSPCBCJMJUZ GSPN FBDI NPEFM UP UJNBUF PG UIF SFMBUJWF EJTUBODF PG FBDI NPEFM GSPN UIF UBSHFU ćJT B P XIJMF XF EPOU LOPX XIFSF Q JT XF DBO FTUJNBUF IPX GBS BQBSU E XIJDI JT DMPTFS UP UIF UBSHFU *UT BT JG XF DBOU UFMM IPX GBS BOZ FS JT GSPN UIF UBSHFU CVU XF DBO UFMM XIJDI BSDIFS JT DMPTFS BOE CZ ,-(Q, S) = − J QJ(MPH RJ − MPH QJ) − − J QJ(MPH SJ − MPH QJ) = − J QJ(MPH RJ − MPH SJ) = −(& MPH RJ − & MPH SJ) MTP NFBOT UIBU BMM XF OFFE UP LOPX JT B NPEFMT BWFSBHF MPH QSPC J) GPS R BOE & MPH(SJ) GPS S ćFTF FYQSFTTJPOT MPPL B MPU MJLF MPH PVUDPNFT MJLF UIF MPHMJLFMJIPPET ZPVWF CFFO VTJOH BMSFBEZ *O NJOH UIF MPHMJLFMJIPPET PG FBDI DBTF QSPWJEFT BO BQQSPYJNBUJPO F EPOU IBWF UP LOPX UIF Q JOTJEF UIF FYQFDUBUJPO CFDBVTF OBUVSF FTFOUJOH UIF FWFOUT GPS VT PNQBSF UIF BWFSBHF MPH QSPCBCJMJUZ GSPN FBDI NPEFM UP HFU BO FT FMBUJWF EJTUBODF PG FBDI NPEFM GSPN UIF UBSHFU ćJT BMTP NFBOT UF NBHOJUVEF PG UIFTF WBMVFT XJMM OPU CF JOUFSQSFUBCMF‰OFJUIFS MPH(SJ) CZ JUTFMG TVHHFTUT B HPPE PS CBE NPEFM 0OMZ UIF EJČFS BT OP FČFDU PO UIF EJTUBODF PG R BOE S GSPN XIFSF Q JT XF DBO FTUJNBUF IPX GBS BQBSU F UBSHFU *UT BT JG XF DBOU UFMM IPX GBS BOZ U XF DBO UFMM XIJDI BSDIFS JT DMPTFS BOE CZ RJ − MPH QJ) − − J QJ(MPH SJ − MPH QJ) RJ − MPH SJ) = −(& MPH RJ − & MPH SJ) FFE UP LOPX JT B NPEFMT BWFSBHF MPH QSPC PS S ćFTF FYQSFTTJPOT MPPL B MPU MJLF MPH MJLFMJIPPET ZPVWF CFFO VTJOH BMSFBEZ *O T PG FBDI DBTF QSPWJEFT BO BQQSPYJNBUJPO F Q JOTJEF UIF FYQFDUBUJPO CFDBVTF OBUVSF VT QSPCBCJMJUZ GSPN FBDI NPEFM UP HFU BO FT NPEFM GSPN UIF UBSHFU ćJT BMTP NFBOT % ∝ Y MQQE = MPH PG QSPEVDU PG BWFSBHF MJLFMJIPPET = TVN PG MPHT PG BWFSBHF MJLFMJIPPET MQQE = / J= MPH 1S(ZJ|θ) 1S(θ)Eθ /
  20. Deviance • Compute it: • Compute log probability of each

    observation • Sum all of these log probabilities • Multiply by –2 • Typical to use MAP estimates for probabilities, but can use entire posterior • Will do so later, when compute WAIC as estimate of deviance UBSHFU Q "MM PG UIJT EFMJWFST VT UP B WFSZ DPNNPO NFBTVSF PG NPEFM ĕU POF UIBU BMTP UVSOT PVU UP CF BO BQQSPYJNBUJPO PG ,- EJWFSHFODF UIF ıIJŃĶĮĻİIJ XIJDI JT EF ĕOFE BT %(R) = − J MPH(RJ) XIFSF J JOEFYFT FBDI PCTFSWBUJPO DBTF BOE FBDI RJ JT KVTU UIF MJLFMJIPPE PG DBTF J ćF − JO GSPOU EPFTOU EP BOZUIJOH JNQPSUBOU *UT UIFSF GPS IJTUPSJDBM SFBTPOT :PV DBO DPNQVUF UIF EFWJBODF GPS BOZ NPEFM ZPVWF ĕU BMSFBEZ JO UIJT CPPL KVTU CZ VTJOH UIF ."1 FTUJNBUFT UP DPNQVUF B MPHQSPCBCJMJUZ PG UIF PCTFSWFE EBUB GPS FBDI SPX ćFTF QSPCBCJMJUJFT BSF UIF R WBMVFT ćFO ZPV BEE UIFTF MPH QSPCBCJMJUJFT UPHFUIFS BOE NVMUJQMZ CZ − )FSFT B RVJDL FYBNQMF VTJOH UIF IP NJOJO CSBJO EBUB BHBJO *ǃǑDž ʆǦ *-ǯ )&01ǯ /&+ ʍ !+,/*ǯ *2 ǒ 0&$* ǰ ǒ *2 ʍ  ʀ ǹ*00 ǰ ǒ !1ʅ! ǒ 01/1ʅ)&01ǯʅ*"+ǯ!ɢ/&+ǰǒʅƽǒ0&$*ʅ0!ǯ!ɢ/&+ǰǰǒ
  21. The road to AIC/DIC/WAIC ✓ What’s a good prediction? ✓

    How far is the model from the target? ✓ How can we estimate that distance? • How can we adjust that estimate to account for overfitting?
  22. Deviance overfits • A meta-model of forecasting: • Two samples:

    training and testing, size N • Fit model to training sample, get Dtrain • Use fit to training to compute Dtest • Difference Dtest – Dtrain is overfitting
  23. NFBTVSFE JO BOE PVU PG TBNQMF VTJOH B TJNQMF QSFEJDUJPO

    TDFOBSJP 5P WJTVBMJ[F UIF SFTVMUT PG UIF UIPVHIU FYQFSJNFOU XIBU XFMM EP OPX JT DPOE UIPVHIU FYQFSJNFOU UIPVTBOE UJNFT GPS FBDI PG  EJČFSFOU MJOFBS SFHSFTTJPO NPEFM UIBU HFOFSBUFT UIF EBUB JT ZJ ∼ /PSNBM(µJ, ) µJ = (.)Y,J − (.)Y,J ćJT DPSSFTQPOET UP B (BVTTJBO PVUDPNF Z GPS XIJDI UIF JOUFSDFQU JT α =  B GPS FBDI PG UXP QSFEJDUPST BSF β = . BOE β = −. ćF NPEFMT GPS EBUB BSF MJOFBS SFHSFTTJPOT XJUI CFUXFFO  BOE  GSFF QBSBNFUFST ćF ĕSTU NPE QBSBNFUFS UP FTUJNBUF JT KVTU B MJOFBS SFHSFTTJPO XJUI BO VOLOPXO NFBO BOE &BDI QBSBNFUFS BEEFE UP UIF NPEFM BEET B QSFEJDUPS WBSJBCMF BOE JUT CFUBDPFď UIF iUSVFw NPEFM IBT OPO[FSP DPFďDJFOUT GPS POMZ UIF ĕSTU UXP QSFEJDUPST XF UIF USVF NPEFM IBT  QBSBNFUFST #Z ĕUUJOH BMM ĕWF NPEFMT XJUI CFUXFFO  BOE UP USBJOJOH TBNQMFT GSPN UIF TBNF QSPDFTTFT XF DBO HFU BO JNQSFTTJPO GPS I CFIBWFT 'ĶĴłĿIJ ƎƏ TIPXT UIF SFTVMUT PG UIPVTBOE TJNVMBUJPOT GPS FBDI NPEFM UZQ GFSFOU TBNQMF TJ[FT ćF GVODUJPO UIBU DPOEVDUT UIF TJNVMBUJPOT JT .$(Ǐ/-$) Data generating model: Models fit to data: µJ = α µJ = α + β Y,J µJ = α + β Y,J + β Y,J µJ = α + β Y,J + β Y,J + β Y,J µJ = α + β Y,J + β Y,J + β Y,J + β Y,J MQQE = MPH PG QSPEVDU PG BWFSBHF MJLFMJIPPET = TVN PG MPHT PG BWFSBHF MJLFMJIPPET / (flat priors) Deviance overfits
  24.   07&3'*55*/( 3&(6-"3*;"5*0/ "/% */'03."5*0/ $3*5&3*" 1 2 3

    4 5 45 50 55 60 65 number of parameters deviance N = 20 in out +1SD –1SD 1 2 3 4 5 250 260 270 280 290 300 number of parameters deviance N = 100 in out 'ĶĴłĿIJ ƎƏ %FWJBODF JO BOE PVU PG TBNQMF *O FBDI QMPU NPEFMT XJUI EJG GFSFOU OVNCFST PG QSFEJDUPS WBSJBCMFT BSF TIPXO PO UIF IPSJ[POUBM BYJT %F WJBODF BDSPTT UIPVTBOE TJNVMBUJPOT JT TIPXO PO UIF WFSUJDBM #MVF TIPXT Data generating model Deviance overfits
  25.   07&3'*55*/( 3&(6-"3*;"5*0/ "/% */'03."5*0/ $3*5&3*" 1 2 3

    4 5 45 50 55 60 65 number of parameters deviance N = 20 in out +1SD –1SD 1 2 3 4 5 250 260 270 280 290 300 number of parameters deviance N = 100 in out 'ĶĴłĿIJ ƎƏ %FWJBODF JO BOE PVU PG TBNQMF *O FBDI QMPU NPEFMT XJUI EJG GFSFOU OVNCFST PG QSFEJDUPS WBSJBCMFT BSF TIPXO PO UIF IPSJ[POUBM BYJT %F WJBODF BDSPTT UIPVTBOE TJNVMBUJPOT JT TIPXO PO UIF WFSUJDBM #MVF TIPXT Deviance overfits
  26. Regularization • Use informative, conservative priors to reduce overfitting =>

    model learns less from sample • But if too informative, model learns too little • Such priors are regularizing 1 0 1 2 3 rameter value /PSNBM(, ) ćJO TPMJE /PSNBM(, .) ćJDL TPMJE /PSNBM(, .) T SFBMMZ POF PG UVOJOH #VU BT ZPVMM TFF FWFO NJME TLFQUJDJTN DBO IFMQ B BOE EPJOH CFUUFS JT BMM XF DBO SFBMMZ IPQF GPS JO UIF MBSHF XPSME XIFSF OP JT PQUJNBM DPOTJEFS UIJT (BVTTJBO NPEFM ZJ ∼ /PSNBM(µJ, σ) µJ = α + βYJ α ∼ /PSNBM(, ) β ∼ /PSNBM(, ) σ ∼ 6OJGPSN(, ) E QSBDUJDF UIBU UIF QSFEJDUPS Y JT TUBOEBSEJ[FE TP UIBU JUT TUBOEBSE EFWJBUJPO JT [FSP ćFO UIF QSJPS PO α JT B OFBSMZĘBU QSJPS UIBU IBT OP QSBDUJDBM FČFDU   07&3'*55*/( 3&(6-"3*;"5*0/ -3 -2 -1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 parameter value Density 'ĶĴłĿIJ TUSPOH TUBOEBS ĕUUJOH /PSNB TPMJE / regularizing prior N(0,1) N(0,0.5) N(0,0.2)
  27. Regularization  3&(6-"3*;"5*0/  1 2 3 4 5 48

    50 52 54 56 58 60 number of parameters deviance N = 20 N(0,1) N(0,0.5) N(0,0.2) 1 2 3 4 5 260 265 270 275 280 285 number of parameters deviance N = 100 'ĶĴłĿIJ ƎƑ 3FHVMBSJ[JOH QSJPST BOE PVUPGTBNQMF EFWJBODF ćF QPJOUT JO   07&3'*55*/( 3&(6-"3*;"5*0/ " -3 -2 -1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 parameter value Density 'ĶĴłĿIJ TUSPOH TUBOEBSE ĕUUJOH /PSNBM TPMJE / 4P UIF QSPCMFN JT SFBMMZ POF PG UVOJOH #VU BT ZP NPEFM EP CFUUFS BOE EPJOH CFUUFS JT BMM XF DBO SF NPEFM OPS QSJPS JT PQUJNBM N(0,1) N(0,0.5) N(0,0.2) in sample out of sample
  28. Regularization  3&(6-"3*;"5*0/  1 2 3 4 5 48

    50 52 54 56 58 60 number of parameters deviance N = 20 N(0,1) N(0,0.5) N(0,0.2) 1 2 3 4 5 260 265 270 275 280 285 number of parameters deviance N = 100 'ĶĴłĿIJ ƎƑ 3FHVMBSJ[JOH QSJPST BOE PVUPGTBNQMF EFWJBODF ćF QPJOUT JO in sample out of sample in sample out of sample