Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6003: Week 5 - Model validation and selection

FISH 6003: Week 5 - Model validation and selection

Week 5 lecture for FISH 6003

MI Fisheries Science

February 01, 2018
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Chapter 5: Model selection and validation CatchRate ~ Poisson (μ

    ij ) E(CatchRate) = μ ij Log(μ ij ) = GearType ij + Temperature ij + FleetDeployment i FleetDeployment i ~ N(0, σ2) Using lme4: m <- glmer(CatchRate ~ GearType + Temperature + (1 | FleetDeployment), family = poisson) FISH 6003 FISH 6003: Statistics and Study Design for Fisheries Brett Favaro
  2. Land Acknowledgment We would like to respectfully acknowledge the territory

    in which we gather as the ancestral homelands of the Beothuk, and the island of Newfoundland as the ancestral homelands of the Mi’kmaq and Beothuk. We would also like to recognize the Inuit of Nunatsiavut and NunatuKavut and the Innu of Nitassinan, and their ancestors, as the original people of Labrador. We strive for respectful partnerships with all the peoples of this province as we search for collective healing and true reconciliation and honour this beautiful land together. http://www.mun.ca/aboriginal_affairs/
  3. Confound vs. Collinearity • CONFOUND: A variable that is not

    a variable of interest distorts a relationship for which you are testing • COLLINEARITY: Variables in a model that are highly correlated with each other Y: Swimming speed Y: Swimming speed Covariates: Body size, body weight Puig-Pons (2018) X: Body size Major influence, not measured or modelled: Temperature
  4. Credit where it is due: This lecture heavily informed by:

    http://www.highstat.com/index.php/our- 10-courses
  5. At this stage you know: • How to perform a

    data exploration • The basic logic of linear regression • How to run a linear model in R • How to interpret results of linear models • But how do we know if the model is describing our data?
  6. Wrecksi = βo + β1 Hurricanesi + error • In

    our data exploration, we made note of a few things about the data • Count, bounded at zero (negative values impossible) • Observations were independent (good) • Against our better judgment, we fit a simple linear model. How can we test whether it fit? • The answer lies in the residuals
  7. Testing assumptions • Normality • Homogeneity of variance • Independence

    • Fixed X To properly assess these, we need multiple observations at every X We don’t have that, so we have to settle for looking at residuals
  8. Testing assumptions Normality 1. Plot a histogram of residuals 2.

    Use a Q-Q plot Should be approx. normal in shape Each dot is a ‘standardized residual’ If the points follow the line, it’s normally distributed If they don’t… divergence from normal distribution
  9. Testing assumptions Homogeneity of Variance Plot residuals vs. fitted values

    If it looked like this… Fake “Stars in the night” Homogeneity assumption violated Actual data: Cone-shaped pattern … you’d say homogeneity is confirmed Can we explain this pattern? Plot residuals against all covariates
  10. Testing assumptions Common problems Homogeneity of Variance - Model is

    missing a covariate or polynomial term - Model fit is very poor at high values - Lots of low values, fewer high values (probably need a GLM, negative binomial distribution. Come back later)
  11. Solutions when homogeneity of variance violated (Zuur et al. 2007):

    1. Apply a transformation But: Log(0) is undefined Options: - Remove - Add a SMALL CONSTANT: small relative to magnitude of covariate (here, 0.01) - Don’t log transform
  12. Solutions when homogeneity of variance violated (Zuur et al. 2007):

    1. Apply a transformation 2. Add explanatory variables Can’t do this for terns data because we have no other variables
  13. Solutions when homogeneity of variance violated (Zuur et al. 2007):

    1. Apply a transformation 2. Add explanatory variables 3. Add an interaction term Can’t do this for terns data because we have no other variables
  14. 4. Add polynomial terms of the explanatory variable Didn’t really

    help much Also, you should have a good reason to do this Note that it caused both terms to be NS
  15. Solutions when homogeneity of variance violated (Zuur et al. 2007):

    1. Apply a transformation 2. Add explanatory variables 3. Add an interaction term 4. Add a polynomial 5. Fit a generalized linear model (later this course) 6. Fit a mixed model (later this course) 7. Fit a generalized additive model (Another course) 8. Allow for variance to change using generalized least squares (Another course) Don’t just try these randomly. Models are used to describe reality. You should have an a priori reason for applying a solution above: Not just “to improve model fit.”
  16. Independence Testing assumptions We assess independence of data points by

    looking at the study design We can also assess it by looking at residuals versus each covariate Here, we have just the one covariate But… things to watch out for: Temporal correlation Model misfit (e.g. influential covariate missing from the model)
  17. Independence Testing assumptions Solutions: Violation due to study design, i.e.

    pseudoreplication - Incorporate study design in model - E.g. If observations nested within ~ 5 or fewer “Sites” incorporate “Site” as a covariate - If more than 10 levels of “Site,” make it a ‘random effect’ term (come back in a few weeks) - E.g. If time series data, compute an “autocorrelation function” (Not sure if we will do this in the present course) - Extreme cases: If the study design is too flawed, give up. Descriptive stats and qualitative descriptions only Violation due to model misfit - Specify a new model - Include additional explanatory variables - Come back in GLM lecture
  18. Fixed X Testing assumptions Do we believe that our X

    positions are accurate? If not… solution is beyond the scope of what we’re doing here. Try to avoid this problem by: - Designing a good study - Minimizing the number of steps between data collection and analysis - Recall the lionfish paper (Week 4), where the covariate was biomass per hectare. Uncertainty existed in: - Count of fish - Length of fish (visual estimate) - Length-weight relationship curve (more uncertainty added going from length → weight) - Converting count to density (due to estimation of transect area) - Instead, use the actual counts, thus eliminating three sources of X variation
  19. Also: Look for highly-influential points If something has a Cook’s

    distance > 1, it is considered an ‘outlier’ that could be having an undue impact on your model fit Real data Fake No problem If it looked like this: - Tells us that this observation is highly influential - May or may not be a problem - Options: - Live with it - Exclude the outlier - Transform 
  20. Coefficient of determination After model validation is complete: R2 is

    the proportion of total variance in Y explained by X R2 assesses this model in isolation and is the last thing you report. Do not use it to compare models. R2 can be high even in poor fits.
  21. Recap Linear models require: • Normality • Homogeneity of variance

    • Independence • Fixed X Test with: - Histogram of residuals, qqplot - Plot residuals vs fitted values - Common sense, plot residuals vs. covariates (those in the model, and those not) - Common sense If these are violated… - Some things can be dealt with statistically (transform, add a covariate, etc.) - Some things cannot (study design) - Document all decisions (including “ignore”)
  22. Variance Inflation Factor Recall that during data exploration, we identified

    colinear covariates and chose to eliminate some from the model You can also compute a Variance Inflation Factor after running a model Bottom line: VIF > 3 is bad VIF is essentially a metric of how much your overall model variance is inflated due to the presence of a covariate. High value = the information contained by this covariate is expressed by other variables We already looked for collinearity with pair plots, one relationship at a time. VIF operates in “higher dimensional space” – assesses collinearity as a whole
  23. VIF > 3 is bad Options: 1. Sequentially remove highest

    VIF value. Re-fit model. (Bad idea here because ‘culling’ is our variable of primary interest!) 2. Leave the highest one in, remove the next-highest one. Re-fit. Consequence: We will miss significant effects
  24. VIF > 3 is bad Options: 1. Sequentially remove highest

    VIF value. Re-fit model. (Bad idea here because ‘culling’ is our variable of primary interest!) 2. Leave the highest one in, remove the next-highest one. Re-fit. Consequence: We will miss significant effects Let’s remove Culling:PercentageCover interaction
  25. Removing interactions reduced VIF Good practice: Check VIF prior to

    ‘model selection’ * Any time we change our model, we must re-validate!
  26. Standardization In multiple regression, we include variables that can have

    very different scales Covariate Range Shannon-Weiner index 1.5 - 2.3 Predator density 0 - 240 Percentage coral cover 17 - 28 Substrate height 34.5 - 57 In these cases, it is good practice to standardize covariates prior to analysis Do this with the scale() function
  27. Covariate Range Shannon-Weiner index 1.5 - 2.3 Predator density 0

    - 240 Percentage coral cover 17 - 28 Substrate height 34.5 - 57 These are standardized AND centred – scale() does this by default +1 means that covariate value is one SD above the mean (which is zero, when centred) A scaled covariate with value of 2 is 2 SD above the average covariate value A scaled covariate with value -1 is 1 SD below the average covariate value
  28. For every one-unit increase in Shannon-Wiener index, PreyDensity dropped by

    258.9 units For every one-SD increase in Shannon-Wiener index, PreyDensity dropped by 60.5 units
  29. Should I standardize? Probably yes, if: • Covariates are on

    very different scales • Covariates are measuring very different physical things Probably no if: • Some covariates have very high variability, and we think that is important in our model. • We have a ‘baseline’ level that remains unchanged across experiments • E.g. We are doing 10 experiments, in each case we use the same dosage of food Don’t just do this to improve model fit. Have an a priori reason for it
  30. Next: Model Selection • How do I decide what covariates

    to put on X? • Ecology is complicated.
  31. Principle of parsimony Our approach must balance the benefits of

    including terms with the cost of complexity If we look only at fit (as determined by R2) we will always favour models with many covariates (because each covariate will explain some of the variation) But then you would end up with very complex models that have little explanatory power We want to favour models with fewer terms – the principle of parsimony
  32. Terminology: Full vs. Nested models Full model: All variables under

    consideration included in the model Nested model: Anything less than above Don’t confuse ‘nested model’ with ‘nested study design’
  33. What do I include? Y: Prey density X: Culling, diversity

    (Shannon Weiner), predator density, Coral cover, substrate height Option 1: Full model, interpret output as-is. No selection procedure.
  34. Option 1: Full model, interpret output as-is. No selection procedure.

    • May be appropriate in a controlled experiment with clear hypotheses • Tightly controlled lab experiment with well-developed hypothesis • Field study with one manipulation (e.g. fishing gear type) with no temporal or spatial autocorrelation (due to strong study design, and verified through data exploration) Even though there are non-significant terms, leave them in (but remove non-significant interactions) Pros: - Simplicity - No perception of “P-hacking” or other shenanigans
  35. But what if you have multiple competing hypotheses, each of

    which are ecologically sensible? Option 2: Hypothesis testing • Two methods: 1. Use the t-statistic: Highest p-value (Weakest evidence to reject H0 of no effect) Drop the term with the highest P-value and refit until all terms are ‘significant’
  36. Main effect NS… …but interaction IS! Awkward situation Must leave

    in PercentageCover, remove next most non-significant term Drop Substrate Height Final model. Validate and interpret
  37. Option 2: Hypothesis testing • Two methods: 1. Use the

    t-statistic • Falls apart if you have factors with more than two levels • Only works for linear models with normal distributions (i.e. doesn’t work with GLM’s… learn about this later) Recall the Urchin paper: Some months “significant” others not Alternate method: 2. F-test
  38. F-test Compare a full model with a nested model Ho

    – Models are equally good Highly significant P-value = model would be worse if we removed this term Notice how it doesn’t give us a value for Culling or PercentageCover alone. Why? Procedure: Remove term with highest P-value. Refit.
  39. Problems with Hypothesis testing 1. T-statistic 2. F-test • Lots

    of steps – multiple comparisons problem. 5% chance of being wrong at each step! • If there is collinearity you missed, then stepping backwards could lead you on the wrong path • Only works with nested models Backwards selection – Start from full model and work backwards Forward selection – Start from reduced model, work towards full
  40. Option 3: Goodness of Fit Use the “Akaike Information Criteria”

    (AIC) Score: lower is better. Difference > 2 = meaningful Additional terms in the model increase AIC. Punishes complexity AIC meaningless on its own. It is solely for comparison Can be used with nested and non-nested models
  41. Nested Not nested AIC is ~12 points higher than mod2

    Strong evidence that mod2 is superior AIC is exactly two points different between mod and mod2. Weak evidence mod2 is better. A) Pick the simpler model (mod2) and make that your Final model. Validate + interpret B) Conduct Model averaging. (Not in this course)
  42. This is what the AIC score would be if you

    REMOVED this term If you removed the Interaction, AIC would be 285. If you remove Substrate Height AIC would be 280 (a better model) If you remove PredDensity, AIC would be 279 (equally good as removing Substrate Height)
  43. Option 4: Information Theoretic Approach Put on your ecologist hat

    Declare hypotheses a priori (usually 10-15) Which has most support, via AIC?
  44. Key Points • Stepwise selection is a process that can

    be informed via t or F-tests. • Only works with nested models • AIC gives information, but does not tell you what to do with it. • Can work with nested or non-nested models. As long as the same thing is on Y, you’re good to go • Information Theoretic Approach • Is a process, informed by AIC
  45. Load and verify Dataset Data Exploration Fit model Validate Report

    Define model selection approach Workflow • Acquire from your own study, from a database, or from another paper • Who, what, where, when, why? (metadata) • How much data? What is included? Who owns it? What are you allowed to do with it? • Create R project with sensible folder layout • Make the data tidy • One row = one obs • One column = one var • One cell = one value • Verify data integrity • Did it load correctly? • Are data types what they should be? • Are there impossible values? • Are factor levels correct? • Follow Zuur et al., 2010 1. Outliers Y and X 2. Homogeneity Y 3. Normality Y 4. Zero trouble Y 5. Collinearity X 6. Relationships Y and X 7. Interactions 8. Independence Y • Note especially any dependency structure - “as-is” - Hypothesis testing - Goodness of Fit - Information Theoretic - Check assumptions of the model you’re using. For regular LM’s: - Normality of residuals - Homogeneity of variance of residuals - Independence - Fixed X Re-Validate
  46. 1. State an ecological question 2. Draw the study design

    • Note dependency structure, sample sizes per treatment, etc 3. Prep the data • Keep it tidy • Check all variables for typos, entry errors, impossible values • Note the types of data (counts, continuous measurement, yes/no, proportions, Likert, etc.) • Calculate derived quantities from raw data 4. Data exploration • Outliers Y and X • Homogeneity Y • Normality Y • Zeroes Y • Collinearity X • Relationships Y and X • Interactions • Independence Y 5. Present the statistical model • Write the equation and type of model • Cite a reference paper that explains the model, and the package you will use to run it • Decide whether to standardize covariates 6. Perform model selection • Or don’t. But own it. 7. Fit the final model 8. Validate the model • Check VIF, remove collinear covariates. Refit if necessary • Check assumptions. For linear models: • Normality • Homogeneity of variance • Independence • Fixed X Report model output in paper Show exploration and validation as an appendix Process may be interrupted by discoveries along the way (violations, etc) So far, you know how to: