Slide 1

Slide 1 text

Statistical Rethinking 07: Fitting Over and Under 2022

Slide 2

Slide 2 text

Mikołaj Kopernik (1473–1543)

Slide 3

Slide 3 text

Copernican Model

Slide 4

Slide 4 text

Problems of Prediction What function describes these points? (fitting, compression) What function explains these points? (causal inference) What would happen if we changed a point’s mass? (intervention) What is the next observation from the same process? (prediction) 35 40 45 50 55 60 600 800 1000 1200 mass (kg) brain volume (cc)

Slide 5

Slide 5 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 6

Slide 6 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 7

Slide 7 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 8

Slide 8 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 9

Slide 9 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped In: 318 Out: 619

Slide 10

Slide 10 text

Bayesian Cross-Validation We use the entire posterior, not just a point prediction Cross-validation score is: MJNJU UIFPSFN UP QSPWJEF B NFBTVSF PG UIF TUBOEBSE FSSPS TĽŀĶŀ = / WBS(QTJT J) NCFS PG PCTFSWBUJPOT BOE QTJT J JT UIF 14*4 FTUJNBUF GPS PCTFSWBUJPO J *G UIJT F TFOTF CF TVSF UP MPPL BU UIF DPEF CPY BU UIF FOE PG UIJT TFDUJPO QBHF FUPTNPPUIFE DSPTTWBMJEBUJPO $SPTTWBMJEBUJPO FTUJNBUFT UIF PVUPGTBNQMF ĿIJıĶİŁĶŃIJıIJĻŀĶŁņ MQQE QBHF *G ZPV IBWF / PCTFSWBUJPOT BOE ĕU UIF QQJOH B TJOHMF PCTFSWBUJPO ZJ FBDI UJNF UIFO UIF PVUPGTBNQMF MQQE JT UIF TVN BDZ GPS FBDI PNJUUFE ZJ MQQE $7 = / J= 4 4 T= MPH 1S(ZJ|θ−J,T) NQMFT GSPN B .BSLPW DIBJO BOE θ−J,T JT UIF TUI TBNQMF GSPN UIF QPTUFSJPS EJTUSJ S PCTFSWBUJPOT PNJUUJOH ZJ NQMJOH SFQMBDFT UIF DPNQVUBUJPO PG / QPTUFSJPS EJTUSJCVUJPOT CZ VTJOH BO FTUJNBUF -5 -4 -3 -2 -1 0 0.0 0.5 1.0 1.5 log posterior prob of observation Density

Slide 11

Slide 11 text

Bayesian Cross-Validation FSF / JT UIF OVNCFS PG PCTFSWBUJPOT BOE QTJT J JT UIF 14*4 FTUJNBUF GPS PCTFSWBUJPO J *G TOU RVJUF NBLF TFOTF CF TVSF UP MPPL BU UIF DPEF CPY BU UIF FOE PG UIJT TFDUJPO QBHF FSUIJOLJOH 1BSFUPTNPPUIFE DSPTTWBMJEBUJPO $SPTTWBMJEBUJPO FTUJNBUFT UIF PVUPGTBN ĴĽļĶĻŁńĶŀIJĽĿIJıĶİŁĶŃIJıIJĻŀĶŁņ MQQE QBHF *G ZPV IBWF / PCTFSWBUJPOT BOE ĕU EFM / UJNFT ESPQQJOH B TJOHMF PCTFSWBUJPO ZJ FBDI UJNF UIFO UIF PVUPGTBNQMF MQQE JT UIF T IF BWFSBHF BDDVSBDZ GPS FBDI PNJUUFE ZJ MQQE $7 = / J= 4 4 T= MPH 1S(ZJ|θ−J,T) SF T JOEFYFT TBNQMFT GSPN B .BSLPW DIBJO BOE θ−J,T JT UIF TUI TBNQMF GSPN UIF QPTUFSJPS EJ PO DPNQVUFE GPS PCTFSWBUJPOT PNJUUJOH ZJ *NQPSUBODF TBNQMJOH SFQMBDFT UIF DPNQVUBUJPO PG / QPTUFSJPS EJTUSJCVUJPOT CZ VTJOH BO FTUJN IF JNQPSUBODF PG FBDI J UP UIF QPTUFSJPS EJTUSJCVUJPO 8F ESBX TBNQMFT GSPN UIF GVMM QPTUFSJPS VUJPO Q(θ|Z) CVU XF XBOU TBNQMFT GSPN UIF SFEVDFE MFBWFPOFPVU QPTUFSJPS EJTUSJCVUJPO Q(θ|Z XF SFXFJHIU FBDI TBNQMF T CZ UIF JOWFSTF PG UIF QSPCBCJMJUZ PG UIF PNJUUFE PCTFSWBUJPO log pointwise predictive density Pages 210 and 218 N data points S samples from posterior log probability of each point i, computed with posterior that omits point i average log probability for point i

Slide 12

Slide 12 text

In: 318 Out: 619 In: 289 Out: 865 [1] [2]

Slide 13

Slide 13 text

318 619 289 865 In: 201 Out: 12,538 [3] [1] [2]

Slide 14

Slide 14 text

201 12,538 120 25,530 7 293,840 318 619 289 865 [1] [2] [3] [4] [5]

Slide 15

Slide 15 text

Cross-validation For simple models, increasing parameters improves fit to sample But may reduce accuracy of predictions out of sample Most accurate model trades off flexibility with overfitting 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 polynomial terms relative error error in sample error out of sample

Slide 16

Slide 16 text

1st degree polynomial 6th degree polynomial

Slide 17

Slide 17 text

2nd degree polynomial error in sample error out of sample 1 2 3 4 5 6 200 600 1000 polynomial terms prediction error

Slide 18

Slide 18 text

2nd degree polynomial error in sample error out of sample 1 2 3 4 5 150 200 250 300 polynomial terms prediction error

Slide 19

Slide 19 text

Regularization Overfitting depends upon the priors Skeptical priors have tighter variance, reduce flexibility Regularization: Function finds regular features of process Good priors are often tighter than you think!

Slide 20

Slide 20 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) μ i = α + m ∑ j= 1 β j xj i

Slide 21

Slide 21 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) β j ∼ Normal(0,10) β j ∼ Normal(0,1) β j ∼ Normal(0,1)

Slide 22

Slide 22 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) β j ∼ Normal(0,10) β j ∼ Normal(0,1) β j ∼ Normal(0,1) β j ∼ Normal(0,0.5) β j ∼ Normal(0,0.5)

Slide 23

Slide 23 text

in out 1 2 3 4 5 150 200 250 300 350 400 polynomial terms prediction error β j ∼ Normal(0,0.1)

Slide 24

Slide 24 text

Regularizing priors How to choose width of prior? For causal inference, use science For pure prediction, can tune the prior using cross-validation Many tasks are a mix of inference and prediction No need to be perfect, just better

Slide 25

Slide 25 text

PAUSE

Slide 26

Slide 26 text

Prediction penalty 1 2 3 4 5 150 200 250 300 polynomial terms prediction error 1 2 3 4 5 0 50 100 150 200 polynomial terms out-of-sample penalty in out

Slide 27

Slide 27 text

Penalty prediction For N points, cross-validation requires fitting N models What if you could compute the penalty from a single model fit? Good news! You can: Importance sampling (PSIS) Information criteria (WAIC) 1 2 3 4 5 0 50 100 150 200 polynomial terms out-of-sample penalty

Slide 28

Slide 28 text

Importance Sampling Importance sampling: Use a single posterior distribution for N points to sample from each posterior for N–1 points Key idea: Point with low probability has a strong influence on posterior distribution Can use pointwise probabilities to reweight samples from posterior -4 -2 0 2 4

Slide 29

Slide 29 text

-4 -2 0 2 4 observations posterior

Slide 30

Slide 30 text

-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Slide 31

Slide 31 text

Smooth Importance Sampling Prof Aki Vehtari (Helsinki), smooth estimator Importance sampling tends to be unreliable, has high variance Pareto-smoothed importance sampling (PSIS) more stable (lower variance) Useful diagnostics Identifies important (high leverage) points (“outliers”)

Slide 32

Slide 32 text

Akaike information criterion Estimate information-theoretic measure of predictive accuracy 
 (K-L Distance) For flat priors and large samples: Hirotugu Akaike (1927–2009)
 [ah–ka–ee–kay] AIC = (−2) × lppd + 2k number of parameters log pointwise predictive density

Slide 33

Slide 33 text

Widely Applicable IC AIC of historical interest now Widely Applicable Information Criterion (WAIC) Sumio Watanabe ( ) 2010 N UIF EBUB HFOFSBUJOH NPEFM .PSF JOUVJUJWFMZ UIF "LBJLF SFTVMU USBEFT PO B TZNNFUSZ TBNQMF BOE PVUPGTBNQMF ćF UXP TBNQMFT BSF FYDIBOHFBCMF ćFZ DPNF GSPN UIF TBNF TTVNQUJPO P XF DPNQVUF 8"*$ 6OGPSUVOBUFMZ JUT HFOFSBMJUZ DPNFT BU UIF FYQFOTF PG B NPSF E GPSNVMB #VU SFBMMZ JU KVTU IBT UXP QJFDFT BOE ZPV DBO DPNQVUF CPUI EJSFDUMZ MFT GSPN UIF QPTUFSJPS EJTUSJCVUJPO 8"*$ JT KVTU UIF MPHQPTUFSJPSQSFEJDUJWF QE QBHF UIBU XF DBMDVMBUFE FBSMJFS QMVT B QFOBMUZ QSPQPSUJPOBM UP UIF WBSJBODF FSJPS QSFEJDUJPOT 8"*$(Z, Θ) = − MQQE − J WBS Θ MPH Q(ZJ|Θ) QFOBMUZ UFSN UIF PCTFSWBUJPOT BOE Θ JT UIF QPTUFSJPS EJTUSJCVUJPO ćF QFOBMUZ UFSN NFBOT IF WBSJBODF JO MPHQSPCBCJMJUJFT GPS FBDI PCTFSWBUJPO J BOE UIFO TVN VQ UIFTF P HFU UIF UPUBM QFOBMUZw 4P ZPV DBO UIJOL PG FBDI PCTFSWBUJPO BT IBWJOH JUT PXO Very similar to PSIS score, but no automatic diagnostics

Slide 34

Slide 34 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error 1 2 3 4 5 30 35 40 45 50 55 60 polynomial terms log pointwise predictive density WAIC PSIS lppd leave-one-out
 cross-validation

Slide 35

Slide 35 text

Overfit Underfit WAIC,PSIS,CV measure overfitting Regularization manages overfitting None directly address causal inference All important to understanding how statistical inference works

Slide 36

Slide 36 text

Model Mis-selection Do not use predictive criteria (WAIC, PSIS, CV) to choose a causal estimate Predictive criteria actually prefer confounds & colliders Example: Plant growth experiment H0 H1 T F

Slide 37

Slide 37 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i Wrong adjustment set for total causal effect of treatment (blocks mediating path) Correct adjustment set for total causal effect of treatment H0 H1 T F

Slide 38

Slide 38 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0 4 8 12 effect of treatment (posterior) Density correct biased

Slide 39

Slide 39 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i m6.8 m6.7 350 360 370 380 390 400 410 deviance PSIS H1 ~ H0 + T + F H1 ~ H0 + T Wrong model wins at prediction

Slide 40

Slide 40 text

m6.8 m6.7 350 360 370 380 390 400 410 deviance PSIS H1 ~ H0 + T + F H1 ~ H0 + T Wrong model wins at prediction Score in sample Score out of sample Standard error of score PSIS contrast and standard error

Slide 41

Slide 41 text

1.2 1.4 1.6 1.8 2.0 growth no fungus yo fungus treatment control H0 H1 T F Why does the wrong model win at prediction?

Slide 42

Slide 42 text

1.2 1.4 1.6 1.8 2.0 growth no fungus yo fungus 1.2 1.4 1.6 1.8 2.0 growth control treatment treatment control fungus no fungus H0 H1 T F Why does the wrong model win at prediction? Fungus is in fact a better predictor than treatment

Slide 43

Slide 43 text

Model Mis-selection Do not use predictive criteria (WAIC, PSIS, CV) to choose a causal estimate However, many analyses are mixes of inferential and predictive chores Still need help finding good functional descriptions while avoiding overfitting H0 H1 T F

Slide 44

Slide 44 text

Outliers & Robust Regression Some points are more influential than others “Outliers”: Observations in the tails of predictive distribution Outliers indicate predictions are possibly overconfident, unreliable The model doesn’t expect enough variation -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Slide 45

Slide 45 text

Outliers & Robust Regression Dropping outliers is bad: Just ignores the problem; predictions are still bad! It’s the model that’s wrong, not the data First, quantify influence of each point Second, use a mixture model (robust regression) -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Slide 46

Slide 46 text

Outliers & Robust Regression Divorce rate example Maine and Idaho both highly unusual Maine: high divorce for trend Idaho: low divorce for trend -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 47

Slide 47 text

Outliers & Robust Regression Quantify influence: PSIS k statistic WAIC penalty term (“effective number of parameters”) 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 PSIS Pareto k WAIC penalty Idaho Maine

Slide 48

Slide 48 text

Outliers & Robust Regression 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 PSIS Pareto k WAIC penalty Idaho Maine -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 49

Slide 49 text

Mixing Gaussians -6 -4 -2 0 2 4 6 0.0 0.4 0.8 value density

Slide 50

Slide 50 text

Mixing Gaussians -6 -4 -2 0 2 4 6 0.0 0.2 0.4 value density Student-t Gaussian

Slide 51

Slide 51 text

m5.3 <- quap( alist( D ~ dnorm( mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat ) m5.3t <- quap( alist( D ~ dstudent( 2 , mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat )

Slide 52

Slide 52 text

m5.3 <- quap( alist( D ~ dnorm( mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat ) m5.3t <- quap( alist( D ~ dstudent( 2 , mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat ) -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 bA (effect of age of marriage) Density Student-t model Gaussian model

Slide 53

Slide 53 text

Robust Regressions Unobserved heterogeneity => mixture of Gaussians Thick tails means model is less surprised by extreme values Less surprise, possibly better predictions if extreme values are rare -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 54

Slide 54 text

Problems of Prediction What is the next observation from the same process? (prediction) Possible to make very good predictions without knowing causes Optimizing prediction does not reliably reveal causes Powerful tools (PSIS, regularization) for measuring and managing accuracy 1 2 3 4 5 150 200 250 300 polynomial terms prediction error

Slide 55

Slide 55 text

Course Schedule Week 1 Bayesian inference Chapters 1, 2, 3 Week 2 Linear models & Causal Inference Chapter 4 Week 3 Causes, Confounds & Colliders Chapters 5 & 6 Week 4 Overfitting / MCMC Chapters 7, 8, 9 Week 5 Generalized Linear Models Chapters 10, 11 Week 6 Integers & Other Monsters Chapters 11 & 12 Week 7 Multilevel models I Chapter 13 Week 8 Multilevel models II Chapter 14 Week 9 Measurement & Missingness Chapter 15 Week 10 Generalized Linear Madness Chapter 16 https://github.com/rmcelreath/stat_rethinking_2022

Slide 56

Slide 56 text

No content