Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6002: Week 2 - Introduction to R

FISH 6002: Week 2 - Introduction to R

Updated Sept 16 2019

MI Fisheries Science

September 15, 2017
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Week 2: Introduction to R Statistical Software FISH 6000: Science

    Communication for Fisheries Brett Favaro 2017 This work is licensed under a Creative Commons Attribution 4.0 International License
  2. The overwhelming majority of fisheries papers involve analysis of quantitative

    data R is the environment in which you: - Explore data - Make plots and figures - Conduct statistical analysis
  3. R is: - Free and open source - Industry standard

    - Community-supported - Expandable - Based on scripting
  4. R is: - Free and open source “R provides a

    wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.” “The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.” From r-project.org: https://www.r-project.org/foundation/ https://www.r-project.org/foundation/Rfoundation- statutes.pdf
  5. R is: - Free and open source - Industry standard

    http://blog.revolutionanalytics.com/2017/02/job-trends-for-r-and-python.html http://spectrum.ieee.org/computing/software/the-2016-top- programming-languages
  6. R is: - Free and open source - Industry standard

    - Community-supported Its popularity means there is a large community of users, and a huge base of helpful resources online You have to actively learn R • Books (see syllabus) • https://www.r-bloggers.com/ • http://stackoverflow.com/questions/tagged/r • http://www.introductoryr.co.uk/R_Resources_for_Beginners.html • https://learnr.wordpress.com/ • http://blog.revolutionanalytics.com/2013/05/top-3-r-resources-for- beginners.html
  7. R is: - Free and open source - Industry standard

    - Community-supported - Expandable Packages are pieces of software you install into R that expand its functionality Packages are free
  8. Why care about expandability? Why not just use a new

    program each time? • In university, you have access to paid software. May not be the case after you graduate • Invest the time into learning R, and it enables you to grow into more skills. Invest the time into a specific program, and all you learn is that program • Packages often get improved over time by the community – and you always get access to the improved version (contrast w/ SPSS)
  9. Complexity Basic Complex Moderate While basic stats and graphs can

    be made in Excel, SPSS, etc. you will hit a wall Most fisheries research occurs beyond this point With R, the wall is your skillfulness, not the software environment
  10. R is: - Free and open source - Industry standard

    - Community-supported - Expandable - Based on scripting This is probably the most important aspect of R Scripts are sets of instructions given to the computer that make it do a task. Scripts are recorded in plain text files with the file type of .R
  11. Concept: RStudio is an integrated development environment (IDE) for R

    Basic: It makes your code easier to read Advanced: It adds some advanced features (projects, Rpubs and Markdown, and a few other things) https://www.rstudio.com/ RStudio extends the functionality of R, and makes it easier to use
  12. You will spend 95% of your time in Rstudio (5%

    in Excel, saving CSV files) • You rarely enter data into the console. You mostly spend your time writing and running scripts Brett, this is dumb. My stats are going to be simple – I should just use Excel and save time! The problem is that Excel isn’t really easier:
  13. Irrelevant fluff (bad for your brain!) Where’s the code? Have

    to click on a cell to find it Is it pointing to the right cells? Data and outputs are mixed together in a single sheet
  14. [1] 4 # A simple script a <- 2 b

    <- 2 a + b Comment explains what you’re doing Variables named, values defined Operation clearly stated Output separate from data
  15. [1] 4 # A simple script a <- 2 b

    <- 2 a + b This code is generalizable and can be easily adapted to a new problem Process is clear. Output is clear. Show it to anyone and they’ll immediately understand it. Go back 10 years from now, and YOU’LL understand it. 10 years from now: what does this mean (if you can open it)? Also, Excel tables are rigid. Move a cell and… Worst part: Very hard to error-check
  16. As a scientist, you have to use the right tool

    for the job • BAD: • “I’m not familiar with X so I’m not going to try it.” • GOOD: • “I am aware of X, and have determined it is not the best course of action because Y”
  17. > 2 [1] 2 R is a calculator > 2

    + 2 [1] 4 > 2*(2 + 2) [1] 8 In the console: Mathematical operators: + - * / ^ Log(x) () Full list: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
  18. > X <- 1 # Define X, give it value

    of 1 > X # What is the value of X? [1] 1 R is a programming language In the console: Anything after a # is a “comment” and is not run R is case sensitive. X and x are two different variables Standard practice in R is to assign value with <-
  19. > X <- 1 # Define X, give it value

    of 1 > X == 2 # Is X equal to 2? [1] FALSE > X <- 1 # Define X, give it value of 1 > X != 2 # Is X NOT equal to 2? [1] TRUE Logical operators: < > == >= <= Not: ! And: & or: | Exclusive or: xor(x, y)
  20. > X <- 1 # Define X, give it value

    of 1 > X + 1 [1] 2 > X # What is the value of X now?
  21. > X <- 1 # Define X, give it value

    of 1 > X + 1 [1] 2 > X # What is the value of X now? [1] 1 > X <- X + 1 > X [1] 2
  22. Always work in a script So far we’ve been typing

    into console. Now it’s time to make a SCRIPT. Highlight the lines of code you want to run, then Ctrl+Enter
  23. Scripts • Contain all the instructions that you give to

    the computer to load data into memory, manage and manipulate data, perform operations, make figures etc. • Write code for readability and repeatability • Being super efficient less important than being clear (with small to medium- data) • Use lots of comments • 10 years from now, will I understand my code?
  24. Variables Variables are storage units on which we can perform

    operations number_of_fish <- 1 fishname <- “Salmon” #Note the quotation marks number_of_fish + number_of_fish #what does this give? Fishname + number_of_fish #now what? You can perform operations, but the operations have to make sense
  25. Naming variables, functions, etc Be meaningful Good: lobster_catch temperature Bad:

    A num Retain meta-data! Use comments, or keep a text file with the R script. Note things like: Units! Good: catch_rate Bad: The_number_of_lobs ters_that_I_caught _during_my_study Be concise Good: dollar_value Bad: dolar_value doller_value d0llar_value Avoid typos (seriously!)
  26. Die1 <- 0 Die2 <- 0 NumRolls <- 10 #Here,

    we specify number of rolls Die1 <- sample(1:6, 10, replace = T) Die2 <- sample(1:6, 10, replace = T) plot(Die1+Die2) #Basic plotting #command – more later #Gives cumulative dice value # across ten rolls Let’s roll two dice ten times, and plot the results. First, create the variables: !
  27. Let’s roll two dice ten times, and plot the results.

    First, create the variables: Die1 <- 0 Die2 <- 0 NumRolls <- 10 #Here, we specify number of rolls Die1 <- sample(1:6, 10, replace = T) Die2 <- sample(1:6, 10, replace = T) values from 1 to 6 inclusive Do it NumRolls (i.e. 10) times Sample with replacement Important: Use ? to get help ?sample
  28. Die1 <- 0 Die2 <- 0 NumRolls <- 10 #Here,

    we specify number of rolls Die1 <- sample(1:6, 10, replace = T) Die2 <- sample(1:6, 10, replace = T) plot(Die1+Die2) #Basic plotting #command – more later #Gives cumulative dice value # across ten rolls Let’s roll two dice ten times, and plot the results. First, create the variables:
  29. Die1 [1] 3 1 2 5 2 2 5 4

    5 1 head(Die1) [1] 3 1 2 5 2 2 summary(Die1) Die1[2] # Die 1 is a ONE DIMENSIONAL vector with ten #values. What’s in position 2? [1] 1
  30. Die1 [1] 3 1 2 5 2 2 5 4

    5 1 Die2 [1] 5 2 3 2 2 2 4 5 5 3 Die1 + Die2 [1] 8 3 5 7 4 4 9 9 10 4 Die1 * Die2 [1] 15 2 6 10 4 4 20 20 25 3 You can perform operations on vectors Die1 == Die2 [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
  31. Naming variables, functions, etc Bååth, R (2012). The state of

    naming conventions in R: https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf alllowercase period.separated underscore_separated lowerCamelCase UpperCamelCase All are acceptable. Try to be consistent http://stat405.had.co.nz/r-style.html - Hadley Wickham’s style guide https://google.github.io/styleguide/Rguide.xml - Google’s style guide
  32. Let’s get a little more sophisticated. It is most common

    for data to be in a table. Using the dice example: Option 1: Build a data frame Option 2: Type it into Excel, load it in to R
  33. A data frame is a data object that can contain

    more than one type of data Die1 [1] 3 1 2 5 2 2 5 4 5 1 Die2 [1] 5 2 3 2 2 2 4 5 5 3 We have two vectors (sequence of values of same data type): We want: RollNum DiceName Score 1 Die1 3 1 Die2 5 2 Die1 1 2 Die2 2 Number Number String Also:
  34. Die1 [1] 3 1 2 5 2 2 5 4

    5 1 Die2 [1] 5 2 3 2 2 2 4 5 5 3 RollNum <- c(1:10) # Let’s make a counter for each roll combine values from one TO ten into a vector called RollNum RollNum [1] 1 2 3 4 5 6 7 8 9 10
  35. DiceData <- data.frame(RollNum, Die1, Die2) #Create a data frame Problem:

    This is wide-format We want to make it LONG format.
  36. Die1 [1] 3 1 2 5 2 2 5 4

    5 1 Die2 [1] 5 2 3 2 2 2 4 5 5 3 require(tidyr) install.packages(“tidyr”) Take a package from the Internet, called “tidyr,” and download/install it Note the quotation marks Load the package into memory. No more quotation marks. Why? Once the package is loaded, you get access to every function contained in the package! RollNum [1] 1 2 3 4 5 6 7 8 9 10
  37. DiceData_long <- gather( DiceData, key = “DiceName”, value = “Score”,

    Die1:Die2) Make a new variable called DiceData_long Apply the “gather” function… it GATHERS columns into rows See : https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Do it to DiceData Make a new column called DiceName… Make a new column called Score… … and fill DiceName and Score with everything present in columns Die1 to Die2 (inclusive) from DiceData
  38. DiceData_long <- gather(DiceData, DiceName, Score, Die1:Die2) See : https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Now

    we can work with these data! Use data summary tools head(DiceData_long) #First six summary(DiceData_long)
  39. plot(DiceData_long) #Doesn’t work! plot(DiceData_long$Score) # The $ specifies a variable

    WITHIN a dataframe plot(DiceData_long$Score ~ DiceDat_long$RollNum) Y by X
  40. #What do the rolls like like across dice? plot(DiceData_long$Score ~

    DiceDat_long$DiceName) Problem: R often gives meaningless error messages. Always start by ruling out easy stuff. Is there a typo? Check capitalization. Check brackets. Check commas. If something is truly weird, close and re-open R and re-run code. Next, check the data types
  41. sapply(DiceData_long, class) Check the data types Apply the function class

    each column of DiceData_long https://www.r-bloggers.com/basic-data-types-in-r/ There are several data types in R: DiceName is a ‘character’, or a vector that includes text (i.e. not a number). Solution: Turn DiceName into a FACTOR so we can work with it http://adv-r.had.co.nz/Data-structures.html
  42. plot(DiceData_long$Value ~ DiceDat_long$DiceName) Factor Character A special type of numerical

    vector that has levels described with characters A string of characters that has no numerical value vs Therefore, when we tried: It made no sense from the computer’s perspective. You can’t plot a number against a bunch of text!
  43. Option 1: Plot as a factor plot(DiceData_long$Score ~ as.factor(DiceData_long$DiceName)) DiceData_long$DiceName

    <- as.factor(DiceData_long$DiceName) plot(DiceData_long$Value ~ DiceData_long$DiceName) Option 2: Convert to factor, then plot
  44. Reminder: Excel has weird data types too! They’re less predictable,

    and you can’t easily tell which is being used
  45. Quick note: Save your scripts. NOT your workspace. The workspace

    includes everything that has been loaded into memory, including variables, packages, functions, etc. Sometimes packages clash, functions you write may cause problems. Always best to start with a fresh workspace. Exception: If you’re working with a massive dataset that takes a long time to analyze. Stock assessment people may encounter this.
  46. It is most common for data to be in a

    table. Using the dice example: Option 1: Build a data frame Option 2: Type it into Excel, load it in to R
  47. Assume, rather than having generated these values in R, we

    got them from an actual experiment. We probably typed them into Excel. Let’s bring those data into R In Excel… type it all in. Save as CSV. • Make sure there are no weird data types • Remember it’s case sensitive • Don’t use commas. CSV means COMMA SEPARATED VALUES. If you put in a comma, R will read it as a new column • We have entered it in long format, so less data manipulation needed • Note: There are packages that allow you to bring data in from an .XLS. Use with caution
  48. setwd("Y:/YOUR DIRECTORY HERE") DiceData <- read.csv(“DiceData.csv”) head(DiceData) sapply(DiceData, class) Bringing

    data into R Note: By default, read.csv pulls text data in as a factor right away. Pro: Most stats are done w/factors Con: error correction must be done on characters To disable: read.csv(“DiceData.csv”, stringsAsFactors = FALSE) Score Score
  49. setwd(“C:/YOUR DIRECTORY HERE") DiceData <- read.csv(“DiceData.csv”) Bringing data into R

    http://plantarum.ca/code/setwd/ Use relational folders in Rstudio. Next week
  50. Recap. So far we learned: • Basic R syntax •

    Mathematical and logical operators • Some basic R commands: • Using c() and colon – e.g. c(1:10) means 1,2,3,4,5,6,7,8,9,10. c(“cat”, “dog”) makes vector of length 2 with values “cat”, “dog” • How to load packages (tidyr) • How to get help (with ?) • The basic plot command. A few other commands (sapply) • Reading data into R