Statistical Rethinking 2022 Lecture 13

by Richard McElreath

Slide 1

Slide 1 text

Statistical Rethinking 13: Multi-Multilevel Models 2022

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

complete pooling partial pooling no pooling S i ∼ Binomial(D i , p i ) logit(p i ) = α T[i] ¯ α ∼ Normal(0,1.5) α j ∼ Normal( ¯ α, σ) σ ∼ Exponential(1) 200 400 600 sigma PSIS score 0.1 0.15 0.23 0.34 0.52 0.78 1.18 1.79 2.7 4.07

Slide 4

Slide 4 text

Practical Difficulties Varying effects are a good default, but… (1) How to use more than one cluster type at the same time? For example stories and participants (2) How to calculate predictions (3) How to sample chains efficiently 2 4 6 8 10 12 1 2 3 4 5 6 7 Story Response 0 10 20 30 40 50 1 2 3 4 5 6 7 Participant Response

Slide 5

Slide 5 text

Clusters & features Clusters: Kinds of groups in the data Features: Aspects of the model (parameters) that vary by cluster 2 4 6 8 10 12 1 2 3 4 5 6 7 Story Response 0 10 20 30 40 50 1 2 3 4 5 6 7 Participant Response Cluster  tanks  stories  individuals  departments Features  survival  treatment effect  average response  admission rate, bias

Slide 6

Slide 6 text

2 4 6 8 10 12 1 2 3 4 5 6 7 Story Response 0 10 20 30 40 50 1 2 3 4 5 6 7 Participant Response Add clusters: More index variables, more population priors (this lecture) Add features: More parameters, more dimensions in each population prior (next lecture) Cluster  tanks  stories  individuals  departments Features  survival  treatment effect  average response  admission rate, bias

Slide 7

Slide 7 text

focal actor right lever left lever food empty

Slide 8

Slide 8 text

partner food food food empty to actor to partner “prosocial option”

Slide 9

Slide 9 text

#*/0. Prosocial chimpanzees data(chimpanzees) 504 trials, 7 actors, 6 blocks 4 treatments: (1) right, no partner  (2) left, no partner  (3) right, partner  (4) left, partner

Slide 10

Slide 10 text

Prosocial chimpanzees data(chimpanzees) 504 trials, 7 actors, 6 blocks 4 treatments: (1) right, no partner  (2) left, no partner  (3) right, partner  (4) left, partner P T A B block actor pulled left side condition

Slide 11

Slide 11 text

Prosocial chimpanzees P i ∼ Bernoulli(p i ) logit(p i ) = β T[i],B[i] + α A[i] block actor treatment pulled left P T A B block actor pulled left side condition

Slide 12

Slide 12 text

Prosocial chimpanzees P T A B block actor pulled left side condition P i ∼ Bernoulli(p i ) logit(p i ) = β T[i],B[i] + α A[i]

Slide 13

Slide 13 text

Prosocial chimpanzees Probability of left lever P i ∼ Bernoulli(p i ) log-odds of left lever logit(p i ) = β T[i],B[i] + α A[i]

Slide 14

Slide 14 text

Prosocial chimpanzees Probability of left lever P i ∼ Bernoulli(p i ) α j ∼ Normal( ¯ α, σ A ) log-odds of left lever Prior for actor effects (handedness) logit(p i ) = β T[i],B[i] + α A[i]

Slide 15

Slide 15 text

Prosocial chimpanzees Probability of left lever P i ∼ Bernoulli(p i ) α j ∼ Normal( ¯ α, σ A ) log-odds of left lever Prior treatment/block effects Prior for actor effects (handedness) logit(p i ) = β T[i],B[i] + α A[i] β j,k ∼ Normal(0,σ B )

Slide 16

Slide 16 text

Prosocial chimpanzees Probability of left lever P i ∼ Bernoulli(p i ) α j ∼ Normal( ¯ α, σ A ) σ A , σ B ∼ Exponential(1) log-odds of left lever Prior treatment/block effects Prior for actor effects (handedness) Prior for “variance components” logit(p i ) = β T[i],B[i] + α A[i] β j,k ∼ Normal(0,σ B )

Slide 17

Slide 17 text

Pooling treatment effects?! Why is it reasonable to partially pool treatment effects? Because treatments are not completely different Because there are many possible treatments, you used a few Because it results in better estimates If parameters get the same prior, usually better to learn the prior from the sample β j,k ∼ Normal(0,σ B )

Slide 18

Slide 18 text

data(chimpanzees) d <- chimpanzees d$treatment <- 1 + d$prosoc_left + 2*d$condition dat <- list( P = d$pulled_left, A = d$actor, B = d$block, T = d$treatment ) # block interactions mBT <- ulam( alist( P ~ bernoulli( p ) , logit(p) <- b[T,B] + a[A], ## adaptive priors matrix[T,B]:b ~ dnorm( 0 , sigma_B ), a[A] ~ dnorm( a_bar , sigma_A ), ## hyper-priors a_bar ~ dnorm( 0 , 1.5 ), sigma_A ~ dexp(1), sigma_B ~ dexp(1) ) , data=dat , chains=4 , cores=4 ) > precis(mBT,3) mean sd 5.5% 94.5% n_eff Rhat4 b[1,1] -0.23 0.37 -0.85 0.34 1301 1.00 b[1,2] -0.01 0.34 -0.56 0.51 2716 1.00 b[1,3] 0.32 0.36 -0.22 0.92 919 1.00 b[1,4] 0.11 0.35 -0.44 0.68 1790 1.00 b[1,5] -0.36 0.37 -0.98 0.17 922 1.00 b[1,6] -0.24 0.35 -0.85 0.28 1327 1.01 b[2,1] 0.09 0.35 -0.44 0.67 2071 1.00 b[2,2] -0.01 0.36 -0.58 0.56 1689 1.00 b[2,3] -0.12 0.34 -0.69 0.41 1738 1.00 b[2,4] 0.32 0.38 -0.21 0.98 1120 1.00 b[2,5] 0.20 0.35 -0.32 0.79 1769 1.00 b[2,6] 0.67 0.45 0.02 1.42 333 1.01 b[3,1] -0.36 0.38 -1.01 0.17 735 1.00 b[3,2] -0.03 0.36 -0.61 0.53 2061 1.00 b[3,3] -0.21 0.35 -0.79 0.33 1464 1.00 b[3,4] -0.46 0.38 -1.11 0.06 527 1.01 b[3,5] 0.03 0.37 -0.53 0.62 2478 1.00 b[3,6] -0.34 0.37 -0.99 0.19 883 1.00 b[4,1] -0.37 0.40 -1.06 0.20 873 1.01 b[4,2] 0.28 0.36 -0.24 0.88 1205 1.00 b[4,3] 0.27 0.35 -0.24 0.86 1105 1.00 b[4,4] 0.08 0.37 -0.49 0.68 1828 1.00 b[4,5] 0.06 0.34 -0.48 0.60 2025 1.00 b[4,6] 0.45 0.41 -0.13 1.17 597 1.01 a[1] -0.35 0.27 -0.78 0.09 1408 1.00 a[2] 4.70 1.25 3.09 7.01 1214 1.00 a[3] -0.62 0.27 -1.08 -0.20 1481 1.00 a[4] -0.64 0.27 -1.07 -0.20 1424 1.00 a[5] -0.36 0.27 -0.78 0.06 1237 1.00 a[6] 0.60 0.26 0.18 1.04 1654 1.00 a[7] 2.15 0.39 1.56 2.82 1436 1.00 a_bar 0.62 0.69 -0.47 1.71 1635 1.00 sigma_A 2.02 0.66 1.17 3.17 1482 1.00 sigma_B 0.46 0.18 0.18 0.74 198 1.02

Slide 19

Slide 19 text

Slide 20

Slide 20 text

-2 -1 0 1 2 treatment log-odds R/N L/N R/P L/P treatment/block effects actor effects 0.0 0.2 0.4 0.6 0.8 1.0 actor probability pull left 1 2 3 4 5 6 7

Slide 21

Slide 21 text

-2 -1 0 1 2 treatment log-odds R/N L/N R/P L/P 0 1 2 3 4 5 0.0 0.4 0.8 sigma_A Density 0 1 2 3 4 5 0.0 1.5 sigma_B Density 0.0 0.2 0.4 0.6 0.8 1.0 actor probability pull left 1 2 3 4 5 6 7 actor variation treatment/block variation

Slide 22

Slide 22 text

Variance does not add! In linear models, variance components are additive Total variation in outcome is sum of the components Not true for generalized linear models Link function breaks additivity Variation in one component moderates variation in the others 0 1 2 3 4 5 0.0 0.4 0.8 sigma_A Density 0 1 2 3 4 5 0.0 1.5 sigma_B Density actor variation treatment/block variation

Slide 23

Slide 23 text

Multilevel predictions & effects How to compute predictions and interventions (causal effects)? Predict for same groups: Use varying effect estimates for each group Predict for new groups: Ignore varying effect estimates, marginalize over population distribution 0 1 2 3 4 5 0.0 0.4 0.8 sigma_A Density 0 1 2 3 4 5 0.0 1.5 sigma_B Density actor variation treatment/block variation

Slide 24

Slide 24 text

New groups Reedfrog intervention Target population 50% predation, 25% large tadpoles What is causal effect of increasing size to 75% large? library(rethinking) data(reedfrogs) d <- reedfrogs dat <- list( S = d$surv, D = d$density, T = 1:nrow(d), P = ifelse(d$pred=="no",1L,2L), G = ifelse(d$size=="small",1L,2L) ) mSPG <- ulam( alist( S ~ binomial( D , p ), logit(p) <- a[T] + b[P,G], a[T] ~ normal( 0 , sigma ), matrix[P,G]:b ~ normal( 0 , 1 ), sigma ~ exponential( 1 ) ), data=dat , chains=4 , cores=4 )

Slide 25

Slide 25 text

post <- extract.samples(mSPG) # sim under status quo n_groups <- 1000 n_samples <- 2000 S1 <- matrix(0,nrow=n_samples,ncol=n_groups) for ( s in 1:n_groups ) { # sim a tank from posterior population aT <- rnorm(n_samples,0,post$sigma) # sample P and G for this group P <- sample( 1:2 , size=1 , prob=c(0.5,0.5) ) # 50% pred G <- sample( 1:2 , size=1 , prob=c(0.75,0.25) ) # 25% large # sim survival p <- inv_logit( aT + post$b[,P,G] ) S1[,s] <- rbinom(2000,35,p) }

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

# intervention - 50% large S2 <- matrix(0,nrow=n_samples,ncol=n_groups) for ( s in 1:n_groups ) { # sim a tank from posterior population aT <- rnorm(n_samples,0,post$sigma) # sample P and G for this group P <- sample( 1:2 , size=1 , prob=c(0.5,0.5) ) # 50% pred G <- sample( 1:2 , size=1 , prob=c(0.25,0.75) ) # 75% large # sim survival p <- inv_logit( aT + post$b[,P,G] ) S2[,s] <- rbinom(n_samples,35,p) } 0 50000 100000 150000 number surviving Frequency 0 3 6 9 13 18 23 28 33 Reedfrog intervention

Slide 31

Slide 31 text

less survival more survival -30 -20 -10 0 10 20 30 0.00 0.01 0.02 0.03 0.04 change in survival proportion of simulation

Slide 32

Slide 32 text

Multilevel predictions Group variation moderates causal effects Averaging over group variation means simulating groups (or using estimates from observed groups as appropriate) If you have a generative model, you can simulate interventions for new targets 0 50000 100000 150000 number surviving Frequency 0 3 6 9 13 18 23 28 33 0 50000 100000 150000 number surviving Frequency 0 3 6 9 13 18 23 28 33

Slide 33

Slide 33 text

PAUSE

Slide 34

Slide 34 text

v ∼ Normal(0,0.5) x ∼ Normal(0, exp(v))

Slide 35

Slide 35 text

v ∼ Normal(0,_) x ∼ Normal(0, exp(v))

Slide 36

Slide 36 text

v ∼ Normal(0,3) x ∼ Normal(0, exp(v)) divergent transitions

Slide 37

Slide 37 text

Divergent transitions Why? Same step size not optimal everywhere High curvature = simulation cannot follow surface What can we do? (1) use a smaller step size (2) reparameterize! v ∼ Normal(0,3) x ∼ Normal(0, exp(v))

Slide 38

Slide 38 text

v ∼ Normal(0,3) x ∼ Normal(0, exp(v)) Small step size Small step size helps, but makes exploration slow

Slide 39

Slide 39 text

v ∼ Normal(0,3) x ∼ Normal(0, exp(v)) “Centered”

Slide 40

Slide 40 text

v ∼ Normal(0,3) z ∼ Normal(0,1) x = z exp(v) v ∼ Normal(0,3) x ∼ Normal(0, exp(v)) “Centered” “Non-centered”

Slide 41

Slide 41 text

v ∼ Normal(0,3) z ∼ Normal(0,1) x = z exp(v)

Slide 42

Slide 42 text

Slide 43

Slide 43 text

m13.7 <- ulam( alist( v ~ normal(0,3), x ~ normal(0,exp(v)) ), data=list(N=1) , chains=4 ) m13.7nc <- ulam( alist( v ~ normal(0,3), z ~ normal(0,1), gq> real[1]:x <<- z*exp(v) ), data=list(N=1) , chains=4 ) Warning: 112 of 2000 (6.0%) transitions ended with a divergence. > precis( m13.7 ) mean sd 5.5% 94.5% n_eff Rhat4 v 1.41 2.37 -1.84 5.93 7 1.46 x 35.93 168.42 -21.15 258.86 30 1.19 > precis( m13.7nc ) mean sd 5.5% 94.5% n_eff Rhat4 v -0.04 3.12 -5.17 4.84 1380 1 z -0.01 0.96 -1.60 1.51 1495 1 x -19.34 899.98 -30.81 24.86 1874 1

Slide 44

Slide 44 text

0 200 400 600 800 1000 -5 0 5 10 n_eff = 7 v 0 200 400 600 800 1000 -1000 0 500 n_eff = 30 x

Slide 45

Slide 45 text

0 200 400 600 800 1000 -10 -5 0 5 10 n_eff = 1380 v 0 200 400 600 800 1000 -3 -2 -1 0 1 2 3 n_eff = 1495 z 0 200 400 600 800 1000 -20000 0 10000 n_eff = 1874 x 0 200 400 600 800 1000 -5 0 5 10 n_eff = 7 v 0 200 400 600 800 1000 -1000 0 500 n_eff = 30 x

Slide 46

Slide 46 text

Non-centered varying effects S i ∼ Binomial(D i , p i ) α j ∼ Normal( ¯ α, σ) σ ∼ Exponential(1) logit(p i ) = α T[i] “Centered” ¯ α ∼ Normal(0,1.5)

Slide 47

Slide 47 text

Non-centered varying effects S i ∼ Binomial(D i , p i ) α j ∼ Normal( ¯ α, σ) logit(p i ) = α T[i] logit(p i ) = ¯ α + z T[i] × σ z j ∼ Normal(0,1) S i ∼ Binomial(D i , p i ) “Centered” “Non-centered” σ ∼ Exponential(1) ¯ α ∼ Normal(0,1.5) σ ∼ Exponential(1) ¯ α ∼ Normal(0,1.5)

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Non-centered chimpanzees P i ∼ Bernoulli(p i ) α j ∼ Normal( ¯ α, σ A ) σ A , σ B ∼ Exponential(1) logit(p i ) = β T[i],B[i] + α A[i] β j,k ∼ Normal(0,σ B ) ¯ α ∼ Normal(0,1.5) “Centered”

Slide 50

Slide 50 text

Non-centered chimpanzees P i ∼ Bernoulli(p i ) α j ∼ Normal( ¯ α, σ A ) σ A , σ B ∼ Exponential(1) logit(p i ) = β T[i],B[i] + α A[i] β j,k ∼ Normal(0,σ B ) P i ∼ Bernoulli(p i ) logit(p i ) = ¯ α + (z α,A[i] )σ A + (z β,T[i],B[i] )σ B z α,j ∼ Normal(0,1) z β,j ∼ Normal(0,1) σ A , σ B ∼ Exponential(1) ¯ α ∼ Normal(0,1.5) ¯ α ∼ Normal(0,1.5) “Centered” “Non-centered”

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

mBT <- ulam( alist( P ~ bernoulli( p ) , logit(p) <- b[T,B] + a[A], ## adaptive priors matrix[T,B]:b ~ dnorm( 0 , sigma_B ), a[A] ~ dnorm( a_bar , sigma_A ), ## hyper-priors a_bar ~ dnorm( 0 , 1.5 ), sigma_A ~ dexp(1), sigma_B ~ dexp(1) ) , data=dat , chains=4 , cores=4 ) mBTnc <- ulam( alist( P ~ bernoulli( p ) , logit(p) <- a_bar + z_a[A]*sigma_A + z_b[T,B]*sigma_B , ## adaptive priors matrix[T,B]:z_b ~ dnorm( 0 , 1 ), z_a[A] ~ dnorm( 0 , 1 ), ## hyper-priors a_bar ~ dnorm( 0 , 1.5 ), sigma_A ~ dexp(1), sigma_B ~ dexp(1), gq> vector[A]:a <<- a_bar + z_a*sigma_A, gq> matrix[T,B]:b <<- z_b*sigma_B ) , data=dat , chains=4 , cores=4 ) P i ∼ Bernoulli(p i ) logit(p i ) = ¯ α + (z α,A[i] )σ A + (z β,T[i],B[i] )σ B z α,j ∼ Normal(0,1) z β,j ∼ Normal(0,1) σ A , σ B ∼ Exponential(1) ¯ α ∼ Normal(0,1.5)

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

non-centered better centered better n_eff = 672 sigma_A n_eff = 716 sigma_B 500 1500 2500 500 1500 2500 effective sample size (centered) effective sample size (non-centered)

Slide 58

Slide 58 text

Practical Solutions Research problems = technical problems (1) Use more than one cluster type (2) Calculate predictions (3) Sample chains efficiently Practice leads to mastery

Slide 59

Slide 59 text

Course Schedule Week 1 Bayesian inference Chapters 1, 2, 3 Week 2 Linear models & Causal Inference Chapter 4 Week 3 Causes, Confounds & Colliders Chapters 5 & 6 Week 4 Overfitting / MCMC Chapters 7, 8, 9 Week 5 Generalized Linear Models Chapters 10, 11 Week 6 Ordered categories & Multilevel models Chapters 12 & 13 Week 7 More Multilevel models Chapters 13 & 14 Week 8 Multilevel models & Gaussian processes Chapter 14 Week 9 Measurement & Missingness Chapter 15 Week 10 Generalized Linear Madness Chapter 16 https://github.com/rmcelreath/stat_rethinking_2022

Slide 60

Slide 60 text

No content