Statistical Rethinking 2023 - Lecture 07

Slide 1

Slide 1 text

Statistical Rethinking 7. Fitting Over & Under 2023

Slide 2

Slide 2 text

Mikołaj Kopernik (1473–1543)

Slide 3

Slide 3 text

In nite causes, nite data Estimator might exist, but not be useful Struggle against causation: How to use causal assumptions to design estimators, contrast alternative models Struggle against data: How to make the estimators work X Y Z B A C

Slide 4

Slide 4 text

Problems of Prediction What function describes these points? ( tting, compression) What function explains these points? (causal inference) What would happen if we changed a point’s mass? (intervention) What is the next observation from the same process? (prediction) 35 40 45 50 55 60 600 800 1000 1200 mass (kg) brain volume (cc)

Slide 5

Slide 5 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 6

Slide 6 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 7

Slide 7 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 8

Slide 8 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped

Slide 9

Slide 9 text

Leave-one-out cross-validation (1) Drop one point (2) Fit line to remaining (3) Predict dropped point (4) Repeat (1) with next point (5) Score is error on dropped In: 318 Out: 619

Slide 10

Slide 10 text

Bayesian Cross-Validation We use the entire posterior, not just a point prediction Cross-validation score is: MJNJU UIFPSFN UP QSPWJEF B NFBTVSF PG UIF TUBOEBSE FSSPS TĽŀĶŀ = / WBS(QTJT J) NCFS PG PCTFSWBUJPOT BOE QTJT J JT UIF 14*4 FTUJNBUF GPS PCTFSWBUJPO J *G UIJT F TFOTF CF TVSF UP MPPL BU UIF DPEF CPY BU UIF FOE PG UIJT TFDUJPO QBHF FUPTNPPUIFE DSPTTWBMJEBUJPO $SPTTWBMJEBUJPO FTUJNBUFT UIF PVUPGTBNQMF ĿĲıĶİŁĶŃĲıĲĻŀĶŁņ MQQE QBHF *G ZPV IBWF / PCTFSWBUJPOT BOE ĕU UIF QQJOH B TJOHMF PCTFSWBUJPO ZJ FBDI UJNF UIFO UIF PVUPGTBNQMF MQQE JT UIF TVN BDZ GPS FBDI PNJUUFE ZJ MQQE $7 = / J= 4 4 T= MPH 1S(ZJ|θ−J,T) NQMFT GSPN B .BSLPW DIBJO BOE θ−J,T JT UIF TUI TBNQMF GSPN UIF QPTUFSJPS EJTUSJ S PCTFSWBUJPOT PNJUUJOH ZJ NQMJOH SFQMBDFT UIF DPNQVUBUJPO PG / QPTUFSJPS EJTUSJCVUJPOT CZ VTJOH BO FTUJNBUF -5 -4 -3 -2 -1 0 0.0 0.5 1.0 1.5 log posterior prob of observation Density

Slide 11

Slide 11 text

Bayesian Cross-Validation FSF / JT UIF OVNCFS PG PCTFSWBUJPOT BOE QTJT J JT UIF 14*4 FTUJNBUF GPS PCTFSWBUJPO J *G TOU RVJUF NBLF TFOTF CF TVSF UP MPPL BU UIF DPEF CPY BU UIF FOE PG UIJT TFDUJPO QBHF FSUIJOLJOH 1BSFUPTNPPUIFE DSPTTWBMJEBUJPO $SPTTWBMJEBUJPO FTUJNBUFT UIF PVUPGTBN ĴĽļĶĻŁńĶŀĲĽĿĲıĶİŁĶŃĲıĲĻŀĶŁņ MQQE QBHF *G ZPV IBWF / PCTFSWBUJPOT BOE ĕU EFM / UJNFT ESPQQJOH B TJOHMF PCTFSWBUJPO ZJ FBDI UJNF UIFO UIF PVUPGTBNQMF MQQE JT UIF T IF BWFSBHF BDDVSBDZ GPS FBDI PNJUUFE ZJ MQQE $7 = / J= 4 4 T= MPH 1S(ZJ|θ−J,T) SF T JOEFYFT TBNQMFT GSPN B .BSLPW DIBJO BOE θ−J,T JT UIF TUI TBNQMF GSPN UIF QPTUFSJPS EJ PO DPNQVUFE GPS PCTFSWBUJPOT PNJUUJOH ZJ *NQPSUBODF TBNQMJOH SFQMBDFT UIF DPNQVUBUJPO PG / QPTUFSJPS EJTUSJCVUJPOT CZ VTJOH BO FTUJN IF JNQPSUBODF PG FBDI J UP UIF QPTUFSJPS EJTUSJCVUJPO 8F ESBX TBNQMFT GSPN UIF GVMM QPTUFSJPS VUJPO Q(θ|Z) CVU XF XBOU TBNQMFT GSPN UIF SFEVDFE MFBWFPOFPVU QPTUFSJPS EJTUSJCVUJPO Q(θ|Z XF SFXFJHIU FBDI TBNQMF T CZ UIF JOWFSTF PG UIF QSPCBCJMJUZ PG UIF PNJUUFE PCTFSWBUJPO log pointwise predictive density Pages 210 and 218 N data points S samples from posterior log probability of each point i, computed with posterior that omits point i average log probability for point i

Slide 12

Slide 12 text

In: 318 Out: 619 In: 289 Out: 865 [1] [2]

Slide 13

Slide 13 text

318 619 289 865 In: 201 Out: 12,538 [3] [1] [2]

Slide 14

Slide 14 text

201 12,538 120 25,530 7 293,840 318 619 289 865 [1] [2] [3] [4] [5]

Slide 15

Slide 15 text

Cross-validation For simple models, more parameters improves t to sample But may reduce accuracy of predictions out of sample Most accurate model trades o exibility with over tting 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 polynomial terms relative error error in sample error out of sample

Slide 16

Slide 16 text

1st degree polynomial 6th degree polynomial

Slide 17

Slide 17 text

2nd degree polynomial error in sample error out of sample 1 2 3 4 5 6 200 600 1000 polynomial terms prediction error

Slide 18

Slide 18 text

2nd degree polynomial error in sample error out of sample 1 2 3 4 5 150 200 250 300 polynomial terms prediction error

Slide 19

Slide 19 text

Regularization Over tting depends upon the priors Skeptical priors have tighter variance, reduce exibility Regularization: Function nds regular features of process Good priors are o en tighter than you think!

Slide 20

Slide 20 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) μ i = α + m ∑ j=1 β j xj i

Slide 21

Slide 21 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) β j ∼ Normal(0,10) β j ∼ Normal(0,1) β j ∼ Normal(0,1)

Slide 22

Slide 22 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out β j ∼ Normal(0,10) β j ∼ Normal(0,10) β j ∼ Normal(0,1) β j ∼ Normal(0,1) β j ∼ Normal(0,0.5) β j ∼ Normal(0,0.5)

Slide 23

Slide 23 text

in out 1 2 3 4 5 150 200 250 300 350 400 polynomial terms prediction error β j ∼ Normal(0,0.1)

Slide 24

Slide 24 text

Regularizing priors How to choose width of prior? For causal inference, use science For pure prediction, can tune the prior using cross-validation Many tasks are a mix of inference and prediction No need to be perfect, just better

Slide 25

Slide 25 text

PAUSE

Slide 26

Slide 26 text

Prediction penalty 1 2 3 4 5 150 200 250 300 polynomial terms prediction error in out

Slide 27

Slide 27 text

Prediction penalty 1 2 3 4 5 150 200 250 300 polynomial terms prediction error 1 2 3 4 5 0 50 100 150 200 polynomial terms out-of-sample penalty in out

Slide 28

Slide 28 text

Penalty prediction For N points, cross-validation requires tting N models What if you could estimate the penalty from a single model t? Good news! You can: Importance sampling (PSIS) Information criteria (WAIC) 1 2 3 4 5 0 50 100 150 200 polynomial terms out-of-sample penalty

Slide 29

Slide 29 text

1 2 3 4 5 150 200 250 300 polynomial terms prediction error 1 2 3 4 5 30 35 40 45 50 55 60 polynomial terms log pointwise predictive density WAIC PSIS lppd leave-one-out cross-validation

Slide 30

Slide 30 text

Over t Under t WAIC,PSIS,CV measure over tting Regularization manages over tting None directly address causal inference Important for understanding statistical inference

Slide 31

Slide 31 text

Model Mis-selection Do not use predictive criteria (WAIC, PSIS, CV) to choose a causal estimate Predictive criteria actually prefer confounds & colliders Example: Plant growth experiment H0 H1 T F

Slide 32

Slide 32 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i Wrong adjustment set for total causal e ect of treatment (blocks mediating path) Correct adjustment set for total causal e ect of treatment H0 H1 T F

Slide 33

Slide 33 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0 4 8 12 effect of treatment (posterior) Density correct biased

Slide 34

Slide 34 text

H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i + β F F i H 1 ∼ Normal(μ i , σ) μ i = H 0 × p i p i = α + β T T i m6.8 m6.7 350 360 370 380 390 400 410 deviance PSIS H1 ~ H0 + T + F H1 ~ H0 + T Wrong model wins at prediction

Slide 35

Slide 35 text

m6.8 m6.7 350 360 370 380 390 400 410 deviance PSIS H1 ~ H0 + T + F H1 ~ H0 + T Wrong model wins at prediction Score in sample Score out of sample Standard error of score PSIS contrast and standard error

Slide 36

Slide 36 text

1.2 1.4 1.6 1.8 2.0 growth no fungus yo fungus treatment control H0 H1 T F Why does the wrong model win at prediction?

Slide 37

Slide 37 text

1.2 1.4 1.6 1.8 2.0 growth no fungus yo fungus 1.2 1.4 1.6 1.8 2.0 growth control treatment treatment control fungus no fungus H0 H1 T F Why does the wrong model win at prediction? Fungus is in fact a better predictor than treatment

Slide 38

Slide 38 text

Model Mis-selection Do not use predictive criteria (WAIC, PSIS, CV) to choose a causal estimate However, many analyses are mixes of inferential and predictive chores Still need help nding good functional descriptions while avoiding over tting H0 H1 T F

Slide 39

Slide 39 text

Outliers & Robust Regression Some points are more in uential than others “Outliers”: Observations in the tails of predictive distribution Outliers indicate predictions are possibly overcon dent, unreliable e model doesn’t expect enough variation -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Slide 40

Slide 40 text

Outliers & Robust Regression Dropping outliers is bad: Just ignores the problem; predictions are still bad! It’s the model that’s wrong, not the data First, quantify in uence of each point Second, use a mixture model (robust regression) -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4

Slide 41

Slide 41 text

Outliers & Robust Regression Divorce rate example Maine and Idaho both unusual Maine: high divorce for trend Idaho: low divorce for trend -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 42

Slide 42 text

Outliers & Robust Regression Quantify in uence: PSIS k statistic WAIC penalty term (“e ective number of parameters”) 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 PSIS Pareto k WAIC penalty Idaho Maine

Slide 43

Slide 43 text

Outliers & Robust Regression 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 PSIS Pareto k WAIC penalty Idaho Maine -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 44

Slide 44 text

Mixing Gaussians -6 -4 -2 0 2 4 6 0.0 0.4 0.8 value density

Slide 45

Slide 45 text

Mixing Gaussians -6 -4 -2 0 2 4 6 0.0 0.2 0.4 value density Student-t Gaussian

Slide 46

Slide 46 text

m5.3 <- quap( alist( D ~ dnorm( mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat ) m5.3t <- quap( alist( D ~ dstudent( 2 , mu , sigma ) , mu <- a + bM*M + bA*A , a ~ dnorm( 0 , 0.2 ) , bM ~ dnorm( 0 , 0.5 ) , bA ~ dnorm( 0 , 0.5 ) , sigma ~ dexp( 1 ) ) , data = dat )

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Robust Regressions Unobserved heterogeneity => mixture of Gaussians ick tails means model is less surprised by extreme values Usually impossible to estimate distribution of extreme values Student-t regression as default? -2 -1 0 1 2 3 -2 -1 0 1 2 Age at marriage (std) Divorce rate (std) Idaho Maine

Slide 49

Slide 49 text

https://www.vox.com/2015/5/21/8635369/pinker-taleb

Slide 50

Slide 50 text

Problems of Prediction What is the next observation from the same process? (prediction) Possible to make very good predictions without knowing causes Optimizing prediction does not reliably reveal causes Powerful tools (PSIS, regularization) for measuring and managing accuracy 1 2 3 4 5 150 200 250 300 polynomial terms prediction error

Slide 51

Slide 51 text

Course Schedule Week 1 Bayesian inference Chapters 1, 2, 3 Week 2 Linear models & Causal Inference Chapter 4 Week 3 Causes, Confounds & Colliders Chapters 5 & 6 Week 4 Over tting / MCMC Chapters 7, 8, 9 Week 5 Generalized Linear Models Chapters 10, 11 Week 6 Integers & Other Monsters Chapters 11 & 12 Week 7 Multilevel models I Chapter 13 Week 8 Multilevel models II Chapter 14 Week 9 Measurement & Missingness Chapter 15 Week 10 Generalized Linear Madness Chapter 16 https://github.com/rmcelreath/stat_rethinking_2023

Slide 52

Slide 52 text

No content