FISH 6003: Week 5 - Model validation and selection

Chapter 5: Model selection and validation CatchRate ~ Poisson (μ
ij ) E(CatchRate) = μ ij Log(μ ij ) = GearType ij + Temperature ij + FleetDeployment i FleetDeployment i ~ N(0, σ2) Using lme4: m <- glmer(CatchRate ~ GearType + Temperature + (1 | FleetDeployment), family = poisson) FISH 6003 FISH 6003: Statistics and Study Design for Fisheries Brett Favaro

Land Acknowledgment We would like to respectfully acknowledge the territory
in which we gather as the ancestral homelands of the Beothuk, and the island of Newfoundland as the ancestral homelands of the Mi’kmaq and Beothuk. We would also like to recognize the Inuit of Nunatsiavut and NunatuKavut and the Innu of Nitassinan, and their ancestors, as the original people of Labrador. We strive for respectful partnerships with all the peoples of this province as we search for collective healing and true reconciliation and honour this beautiful land together. http://www.mun.ca/aboriginal_affairs/

Confound vs. Collinearity • CONFOUND: A variable that is not
a variable of interest distorts a relationship for which you are testing • COLLINEARITY: Variables in a model that are highly correlated with each other Y: Swimming speed Y: Swimming speed Covariates: Body size, body weight Puig-Pons (2018) X: Body size Major influence, not measured or modelled: Temperature

Today • Model validation • Standardization • Variance Inflation Factors
(VIFs) • Model selection

Credit where it is due: This lecture heavily informed by:
http://www.highstat.com/index.php/our- 10-courses

At this stage you know: • How to perform a
data exploration • The basic logic of linear regression • How to run a linear model in R • How to interpret results of linear models • But how do we know if the model is describing our data?

Recall: Data: Huang et al (2017) Analysing Ecological Data: Zuur
et al. 2007

Wrecksi = βo + β1 Hurricanesi + error • In
our data exploration, we made note of a few things about the data • Count, bounded at zero (negative values impossible) • Observations were independent (good) • Against our better judgment, we fit a simple linear model. How can we test whether it fit? • The answer lies in the residuals

Testing assumptions • Normality • Homogeneity of variance • Independence
• Fixed X To properly assess these, we need multiple observations at every X We don’t have that, so we have to settle for looking at residuals

Testing assumptions Normality 1. Plot a histogram of residuals 2.
Use a Q-Q plot Should be approx. normal in shape Each dot is a ‘standardized residual’ If the points follow the line, it’s normally distributed If they don’t… divergence from normal distribution

Testing assumptions Homogeneity of Variance Plot residuals vs. fitted values
If it looked like this… Fake “Stars in the night” Homogeneity assumption violated Actual data: Cone-shaped pattern … you’d say homogeneity is confirmed Can we explain this pattern? Plot residuals against all covariates

Testing assumptions Common problems Homogeneity of Variance - Model is
missing a covariate or polynomial term - Model fit is very poor at high values - Lots of low values, fewer high values (probably need a GLM, negative binomial distribution. Come back later)

Solutions when homogeneity of variance violated (Zuur et al. 2007):
1. Apply a transformation But: Log(0) is undefined Options: - Remove - Add a SMALL CONSTANT: small relative to magnitude of covariate (here, 0.01) - Don’t log transform

1. Apply a transformation 2. Add explanatory variables Can’t do this for terns data because we have no other variables

1. Apply a transformation 2. Add explanatory variables 3. Add an interaction term Can’t do this for terns data because we have no other variables

4. Add polynomial terms of the explanatory variable Didn’t really
help much Also, you should have a good reason to do this Note that it caused both terms to be NS

1. Apply a transformation 2. Add explanatory variables 3. Add an interaction term 4. Add a polynomial 5. Fit a generalized linear model (later this course) 6. Fit a mixed model (later this course) 7. Fit a generalized additive model (Another course) 8. Allow for variance to change using generalized least squares (Another course) Don’t just try these randomly. Models are used to describe reality. You should have an a priori reason for applying a solution above: Not just “to improve model fit.”

Independence Testing assumptions We assess independence of data points by
looking at the study design We can also assess it by looking at residuals versus each covariate Here, we have just the one covariate But… things to watch out for: Temporal correlation Model misfit (e.g. influential covariate missing from the model)

Independence Testing assumptions Solutions: Violation due to study design, i.e.
pseudoreplication - Incorporate study design in model - E.g. If observations nested within ~ 5 or fewer “Sites” incorporate “Site” as a covariate - If more than 10 levels of “Site,” make it a ‘random effect’ term (come back in a few weeks) - E.g. If time series data, compute an “autocorrelation function” (Not sure if we will do this in the present course) - Extreme cases: If the study design is too flawed, give up. Descriptive stats and qualitative descriptions only Violation due to model misfit - Specify a new model - Include additional explanatory variables - Come back in GLM lecture

Fixed X Testing assumptions Do we believe that our X
positions are accurate? If not… solution is beyond the scope of what we’re doing here. Try to avoid this problem by: - Designing a good study - Minimizing the number of steps between data collection and analysis - Recall the lionfish paper (Week 4), where the covariate was biomass per hectare. Uncertainty existed in: - Count of fish - Length of fish (visual estimate) - Length-weight relationship curve (more uncertainty added going from length → weight) - Converting count to density (due to estimation of transect area) - Instead, use the actual counts, thus eliminating three sources of X variation

Also: Look for highly-influential points If something has a Cook’s
distance > 1, it is considered an ‘outlier’ that could be having an undue impact on your model fit Real data Fake No problem If it looked like this: - Tells us that this observation is highly influential - May or may not be a problem - Options: - Live with it - Exclude the outlier - Transform 

Coefficient of determination After model validation is complete: R2 is
the proportion of total variance in Y explained by X R2 assesses this model in isolation and is the last thing you report. Do not use it to compare models. R2 can be high even in poor fits.

Recap Linear models require: • Normality • Homogeneity of variance
• Independence • Fixed X Test with: - Histogram of residuals, qqplot - Plot residuals vs fitted values - Common sense, plot residuals vs. covariates (those in the model, and those not) - Common sense If these are violated… - Some things can be dealt with statistically (transform, add a covariate, etc.) - Some things cannot (study design) - Document all decisions (including “ignore”)

Variance Inflation Factor Recall that during data exploration, we identified
colinear covariates and chose to eliminate some from the model You can also compute a Variance Inflation Factor after running a model Bottom line: VIF > 3 is bad VIF is essentially a metric of how much your overall model variance is inflated due to the presence of a covariate. High value = the information contained by this covariate is expressed by other variables We already looked for collinearity with pair plots, one relationship at a time. VIF operates in “higher dimensional space” – assesses collinearity as a whole

http://www.cruisingguides.com/sites/default/files/Lionfish.jpg http://coraltriangle.blogs.pan da.org/seareefs/wwf- university-of-south-pacific-at- work-in-the-great-sea-reef

VIF > 3 is bad Options: 1. Sequentially remove highest
VIF value. Re-fit model. (Bad idea here because ‘culling’ is our variable of primary interest!) 2. Leave the highest one in, remove the next-highest one. Re-fit. Consequence: We will miss significant effects

VIF > 3 is bad Options: 1. Sequentially remove highest
VIF value. Re-fit model. (Bad idea here because ‘culling’ is our variable of primary interest!) 2. Leave the highest one in, remove the next-highest one. Re-fit. Consequence: We will miss significant effects Let’s remove Culling:PercentageCover interaction

Removing interactions reduced VIF Good practice: Check VIF prior to
‘model selection’ * Any time we change our model, we must re-validate!

Standardization In multiple regression, we include variables that can have
very different scales Covariate Range Shannon-Weiner index 1.5 - 2.3 Predator density 0 - 240 Percentage coral cover 17 - 28 Substrate height 34.5 - 57 In these cases, it is good practice to standardize covariates prior to analysis Do this with the scale() function

Covariate Range Shannon-Weiner index 1.5 - 2.3 Predator density 0
- 240 Percentage coral cover 17 - 28 Substrate height 34.5 - 57 These are standardized AND centred – scale() does this by default +1 means that covariate value is one SD above the mean (which is zero, when centred) A scaled covariate with value of 2 is 2 SD above the average covariate value A scaled covariate with value -1 is 1 SD below the average covariate value

For every one-unit increase in Shannon-Wiener index, PreyDensity dropped by
258.9 units For every one-SD increase in Shannon-Wiener index, PreyDensity dropped by 60.5 units

Should I standardize? Probably yes, if: • Covariates are on
very different scales • Covariates are measuring very different physical things Probably no if: • Some covariates have very high variability, and we think that is important in our model. • We have a ‘baseline’ level that remains unchanged across experiments • E.g. We are doing 10 experiments, in each case we use the same dosage of food Don’t just do this to improve model fit. Have an a priori reason for it

VIF on a standardized model: VIF on unstandardized:

Next: Model Selection • How do I decide what covariates
to put on X? • Ecology is complicated.

Principle of parsimony Our approach must balance the benefits of
including terms with the cost of complexity If we look only at fit (as determined by R2) we will always favour models with many covariates (because each covariate will explain some of the variation) But then you would end up with very complex models that have little explanatory power We want to favour models with fewer terms – the principle of parsimony

Terminology: Full vs. Nested models Full model: All variables under
consideration included in the model Nested model: Anything less than above Don’t confuse ‘nested model’ with ‘nested study design’

What do I include? Y: Prey density X: Culling, diversity
(Shannon Weiner), predator density, Coral cover, substrate height Option 1: Full model, interpret output as-is. No selection procedure.

Option 1: Full model, interpret output as-is. No selection procedure.
• May be appropriate in a controlled experiment with clear hypotheses • Tightly controlled lab experiment with well-developed hypothesis • Field study with one manipulation (e.g. fishing gear type) with no temporal or spatial autocorrelation (due to strong study design, and verified through data exploration) Even though there are non-significant terms, leave them in (but remove non-significant interactions) Pros: - Simplicity - No perception of “P-hacking” or other shenanigans

But what if you have multiple competing hypotheses, each of
which are ecologically sensible? Option 2: Hypothesis testing • Two methods: 1. Use the t-statistic: Highest p-value (Weakest evidence to reject H0 of no effect) Drop the term with the highest P-value and refit until all terms are ‘significant’

Main effect NS… …but interaction IS! Awkward situation Must leave
in PercentageCover, remove next most non-significant term Drop Substrate Height Final model. Validate and interpret

Option 2: Hypothesis testing • Two methods: 1. Use the
t-statistic • Falls apart if you have factors with more than two levels • Only works for linear models with normal distributions (i.e. doesn’t work with GLM’s… learn about this later) Recall the Urchin paper: Some months “significant” others not Alternate method: 2. F-test

F-test Compare a full model with a nested model Ho
– Models are equally good Highly significant P-value = model would be worse if we removed this term Notice how it doesn’t give us a value for Culling or PercentageCover alone. Why? Procedure: Remove term with highest P-value. Refit.

Final model. Validate and interpret Drop Drop

Problems with Hypothesis testing 1. T-statistic 2. F-test • Lots
of steps – multiple comparisons problem. 5% chance of being wrong at each step! • If there is collinearity you missed, then stepping backwards could lead you on the wrong path • Only works with nested models Backwards selection – Start from full model and work backwards Forward selection – Start from reduced model, work towards full

Option 3: Goodness of Fit Use the “Akaike Information Criteria”
(AIC) Score: lower is better. Difference > 2 = meaningful Additional terms in the model increase AIC. Punishes complexity AIC meaningless on its own. It is solely for comparison Can be used with nested and non-nested models

Nested Not nested AIC is ~12 points higher than mod2
Strong evidence that mod2 is superior AIC is exactly two points different between mod and mod2. Weak evidence mod2 is better. A) Pick the simpler model (mod2) and make that your Final model. Validate + interpret B) Conduct Model averaging. (Not in this course)

This is what the AIC score would be if you
REMOVED this term If you removed the Interaction, AIC would be 285. If you remove Substrate Height AIC would be 280 (a better model) If you remove PredDensity, AIC would be 279 (equally good as removing Substrate Height)

Option 4: Information Theoretic Approach Put on your ecologist hat
Declare hypotheses a priori (usually 10-15) Which has most support, via AIC?

Final model. Validate and interpret

http://journals.plos.org/plosone/article/file?id= 10.1371/journal.pone.0126778&type=printable

Key Points • Stepwise selection is a process that can
be informed via t or F-tests. • Only works with nested models • AIC gives information, but does not tell you what to do with it. • Can work with nested or non-nested models. As long as the same thing is on Y, you’re good to go • Information Theoretic Approach • Is a process, informed by AIC

Load and verify Dataset Data Exploration Fit model Validate Report
Define model selection approach Workflow • Acquire from your own study, from a database, or from another paper • Who, what, where, when, why? (metadata) • How much data? What is included? Who owns it? What are you allowed to do with it? • Create R project with sensible folder layout • Make the data tidy • One row = one obs • One column = one var • One cell = one value • Verify data integrity • Did it load correctly? • Are data types what they should be? • Are there impossible values? • Are factor levels correct? • Follow Zuur et al., 2010 1. Outliers Y and X 2. Homogeneity Y 3. Normality Y 4. Zero trouble Y 5. Collinearity X 6. Relationships Y and X 7. Interactions 8. Independence Y • Note especially any dependency structure - “as-is” - Hypothesis testing - Goodness of Fit - Information Theoretic - Check assumptions of the model you’re using. For regular LM’s: - Normality of residuals - Homogeneity of variance of residuals - Independence - Fixed X Re-Validate

1. State an ecological question 2. Draw the study design
• Note dependency structure, sample sizes per treatment, etc 3. Prep the data • Keep it tidy • Check all variables for typos, entry errors, impossible values • Note the types of data (counts, continuous measurement, yes/no, proportions, Likert, etc.) • Calculate derived quantities from raw data 4. Data exploration • Outliers Y and X • Homogeneity Y • Normality Y • Zeroes Y • Collinearity X • Relationships Y and X • Interactions • Independence Y 5. Present the statistical model • Write the equation and type of model • Cite a reference paper that explains the model, and the package you will use to run it • Decide whether to standardize covariates 6. Perform model selection • Or don’t. But own it. 7. Fit the final model 8. Validate the model • Check VIF, remove collinear covariates. Refit if necessary • Check assumptions. For linear models: • Normality • Homogeneity of variance • Independence • Fixed X Report model output in paper Show exploration and validation as an appendix Process may be interrupted by discoveries along the way (violations, etc) So far, you know how to:

FISH 6003: Week 5 - Model validation and selection

FISH 6003: Week 5 - Model validation and selection

More Decks by MI Fisheries Science

Other Decks in Science

Featured

Transcript