Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6002: Week 4 - Introduction to Tidy Data

FISH 6002: Week 4 - Introduction to Tidy Data

Week 4, class meeting of FISH 6002

updated 24 Sept 2019

MI Fisheries Science

September 25, 2017
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Week 4: Intro to Tidy Data FISH 6000: Science Communication

    for Fisheries Brett Favaro 2017 This work is licensed under a Creative Commons Attribution 4.0 International License
  2. To work with data in R, it needs to be

    tidy Imagine a study… You tested three types of crab trap: Big, Medium, Small You deployed each trap three times (24 hr soak) and counted how many crabs you caught Deployment Big Trap Medium Trap Small Trap 1 |||| || | 2 (Extreme winds) |||| ||| | 3 |||| | || Makes sense to people. Nonsense to a computer Into your data booklet, you write:
  3. Tidy data: 1.One column = one variable 2.One row =

    one observation 3.One cell = one value Grolemund and Wickham (2016), Fig 12.1
  4. Deployment Big Trap Medium Trap Small Trap 1 |||| ||

    | 2 (Extreme winds) |||| ||| | 3 |||| | || What does highlighting mean? Mixing two types of information Tallies aren’t machine readable Does the blank mean ZERO, or does it mean we didn’t count it (i.e. missing data)?
  5. You need to tidy the data: Deployment Big Trap Medium

    Trap Small Trap 1 |||| || | 2 (Extreme winds) |||| ||| | 3 |||| | || Remove colours
  6. You need to tidy the data: Deployment Big Trap Medium

    Trap Small Trap 1 5 2 1 2 (Extreme winds) 4 3 1 3 6 2 Use simplest notation
  7. You need to tidy the data: Deployment Big Trap Medium

    Trap Small Trap 1 5 2 1 2 (Extreme winds) 4 3 1 3 6 2 0 Address missing values (May be zero, may be NA (missing))
  8. You need to tidy the data: Deployment Big Trap Medium

    Trap Small Trap Weather 1 5 2 1 Calm 2 4 3 1 Windy 3 6 2 0 Calm One piece of data per cell
  9. Deployment Big Trap Medium Trap Small Trap 1 5 2

    1 2 4 3 1 3 6 2 0 Not done yet… (We’ll ignore weather, for simplicity) What do these numbers indicate? Number of crabs caught in a trap. Number of crabs caught is ONE VARIABLE Number of crabs caught should be ONE COLUMN
  10. Deployment Big Trap Medium Trap Small Trap 1 5 2

    1 2 4 3 1 3 6 2 0 In this study, what is ONE OBSERVATION? One observation = catch from ONE TRAP One trap should be one row
  11. Deployment Big Trap Medium Trap Small Trap 1 5 2

    1 2 4 3 1 3 6 2 0 Not done yet…
  12. Deployment Big Trap Medium Trap Small Trap 1 5 2

    1 2 4 3 1 3 6 2 0 I want to produce: How would I produce a plot of CATCH by TRAPTYPE with the above data frame?
  13. TrapType CatchNum Deployment BigTrap 5 1 BigTrap 4 2 BigTrap

    6 3 MediumTrap 2 1 MediumTrap 3 2 MediumTrap 2 3 SmallTrap 1 1 SmallTrap 1 2 SmallTrap 0 3 Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4 3 1 3 6 2 0 Must reorganize data:
  14. Also: • Raw data table should generally not have derived

    quantities. Best for everything to be *observed* and derive quantities in R with your script. • E.g. Deployment Cod Haddock Total 1 1 2 3 2 2 3 5 3 1 1 2 Deployment Cod Haddock 1 1 2 2 2 3 3 1 1 Deployment Species Catch 1 Cod 1 1 Haddock 2 2 Cod 2 …
  15. TrapType CatchNum Deployment BigTrap 5 1 BigTrap 4 2 BigTrap

    6 3 MediumTrap 2 1 MediumTrap 3 2 MediumTrap 2 3 SmallTrap 1 1 SmallTrap 1 2 SmallTrap 0 3 Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4 3 1 3 6 2 0 plot(CatchNum ~ TrapType, data = TrapData) Easy! Must reorganize data:
  16. Long format: Good for computer Wide format: Good for humans

    Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4 3 1 3 6 2 0 TrapType CatchNum Deployment BigTrap 5 1 BigTrap 4 2 BigTrap 6 3 MediumTrap 2 1 MediumTrap 3 2 MediumTrap 2 3 SmallTrap 1 1 SmallTrap 1 2 SmallTrap 0 3
  17. Exercise – by hand, let’s tidy some data! Get into

    pairs/3’s – draw out ‘tidy’ versions of ‘messy data’ shown above 15 minutes
  18. 1. Using your own data: Collect and enter tidily 2.

    Tidy it in R How to produce Tidy Data
  19. Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3

    BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 10 3 SmallTrap 0 Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4 3 1 3 6 2 0 Let’s start small There are many ways to get from left to right. With very small datasets you can do it manually, or with PivotTables in Excel. But beware: You will make untraceable mistakes. Also, this becomes completely impossible with even moderate-sized datasets
  20. Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4

    3 1 3 6 2 0 TidyData <- gather(data = MessyData, key = TrapType, Value = Catch, BigTrap:SmallTrap) Data frame: MessyData Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0
  21. Reference: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Deployment TrapType Catch 1 BigTrap 5 2 BigTrap

    4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Deployment BigTrap MediumTrap SmallTrap 1 5 2 1 2 4 3 1 3 6 2 0 WideFormat <- spread(data = TidyData, key = TrapType, value = Catch) One line of code!!
  22. Data manipulation Deployment TrapType Catch 1 BigTrap 5 2 BigTrap

    4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Select only BigTrap observations TempData <- filter(TidyData, TrapType == “BigTrap”) Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3 BigTrap 6 TempData <- filter(TidyData, TrapType == “BigTrap” & Catch >= 6) Deployment TrapType Catch 3 BigTrap 6 TempData <- filter(TidyData, TrapType == “BigTrap” & Catch >= 6 | TrapType == “BigTrap” & Catch <= 4) #What will this do?
  23. Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3

    BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Unite columns together: TempData <- unite(TidyData, DeploymentAndTrapType, Deployment, TrapType, sep=“.”) TempData <- separate(TempData, DeploymentAndTrapType, c(“Deployment”, “TrapType”)) Separate them…
  24. Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3

    BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Rename a column: TempData <- rename(TidyData, FishCatch = Catch) NEW name comes first
  25. Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3

    BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Select a column (or columns): TempData <- select(TidyData, TrapType, Catch) Select the top N entries TempData <- top_n(TidyData, 3, Catch)
  26. Helper functions TempData <- select(TidyData, starts_with("Trap")) TempData <- select(TidyData, ends_with(“ment"))

    TempData <- select(TidyData, -Deployment) Just drop the one column identified here
  27. Summary functions (i.e. what we just did) take many values,

    return one. Window functions take many values and return the same number of values, but with an operation performed on each. Experiment with the other window functions in the cheat sheet e.g. Cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
  28. Joining datasheets Recall our trap study We have discovered new

    data that we’d like to use: Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Deployment TemperatureC 1 23 2 25 3 21 4 20 TidyData TemperatureData
  29. First, figure out what connects the two data frames What

    variable occurs in each dataframe? Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Deployment TemperatureC 1 23 2 25 3 21 4 20 TidyData TemperatureData Deployment TrapType Catch Deployment Temperature “Deployment” is your key
  30. CombinedData <- left_join(TidyData, TemperatureData, by=“Deployment”) Join this… …to this… …using

    Deployment as the linking variable (or ‘key’) CombinedData <- right_join(TidyData, TemperatureData, by=“Deployment”) See the difference?
  31. LargerData <- bind_rows(TidyData, ExtraData) What if we want to add

    ROWS, not columns? E.g. Deployment TrapType Catch 1 BigTrap 5 2 BigTrap 4 3 BigTrap 6 1 MediumTrap 2 2 MediumTrap 3 3 MediumTrap 2 1 SmallTrap 1 2 SmallTrap 1 3 SmallTrap 0 Deployment TrapType Catch 4 BigTrap 6 4 MediumTrap 3 4 SmallTrap 1 MoreData
  32. Datasheet Layout Problem: One table or more? From: https://dynamicecology.wordpress.com/2016/ 08/22/ten-commandments-for-good-data-

    management/ • Fully de-normalized – everything in one spreadsheet • Fully normalized – each sheet has minimal information, related by common variables • Partly normalized – Something in between
  33. Plan relationships with a diagram: Note that each spreadsheet is

    connected by common variables, usually with the same name. These are called keys In general, raw data are normalized to some degree (for readability), and are de-normalized for analysis
  34. Fictional study: We want to know which type of trap

    catches more crabs Traps: http://www.fao.org/3/x2590e/x2590e07.htm TrapType A B C D E Each trap is deployed several times: TrapID 1 2 3 … We recorded: CatchNum: the number of crabs caught per trap CarapaceWidth: the carapace width of each crab
  35. Fictional study: We want to know which type of trap

    catches more crabs TrapType A B C D E Unique identifier of deployed trap: TrapID 1 2 3 … CatchNum: CarapaceWidth: fishing boat by Symbolon from the Noun Project FishingDate: When traps were deployed SetLocationLat SetLocationLong SoakTimeHrs
  36. WMO Sea State Code Wave height Characteristics 0 0 metres

    (0 ft) Calm (glassy) 1 0 to 0.1 metres (0.00 to 0.33 ft) Calm (rippled) 2 0.1 to 0.5 metres (3.9 in to 1 ft 7.7 in) Smooth (wavelets) 3 0.5 to 1.25 metres (1 ft 8 in to 4 ft 1 in) Slight 4 1.25 to 2.5 metres (4 ft 1 in to 8 ft 2 in) Moderate 5 2.5 to 4 metres (8 ft 2 in to 13 ft 1 in) Rough 6 4 to 6 metres (13 to 20 ft) Very rough 7 6 to 9 metres (20 to 30 ft) High 8 9 to 14 metres (30 to 46 ft) Very high 9 Over 14 metres (46 ft) Phenomenal SeaState CloudCover More on cloud cover measurements: https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/met.1542
  37. Variable Code name Values Data type A unique numerical label

    for each trap TrapID 1-# of traps in the study Factor Carapace width of each crab in centimeters CarapaceWidth Length in mm rounded to nearest whole number Integer The model of pot used in that observation TrapType A, B, C, D, or E Factor Number of crabs caught in a deployment CatchNum Whole number of fish caught in a deployment Integer Date gear was deployed FishingDate Date gear was deployed Date* Latitude at which gear was set SetLocationLat Decimal degree as given by GPS Numeric Longitude at which gear was set SetLocationLong Decimal degree as given by GPS Numeric Number of hours gear was in the water before retrieval SoakTimeHrs Number of hours passed since gear deployed Integer? Sea state at time of gear retrieval SeaState 0-9 as per previous slide Factor Cloud cover at time of gear retrieval CloudCover 0-9 as per previous slide Factor Example: Imagine a different study with many variables *We will cover working with dates in a future lecture. “Date” isn’t actually a data type in R per se.... Stay tuned.
  38. Problem – We have data that have different numbers of

    replicates Carapace width = one observation per crab Catch number = one observation per trap Fishing day = one observation per day
  39. TrapID TrapType CatchNum FishingDate CarapaceWidth SetLocationL at SetLocationL ong SoakTimeHr

    s SeaSt ate CloudC over 1 A 12 Jan 1, 2019 54 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 44 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 63 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 32 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 76 47.38 -54.4 22 3 5 2 B 14 Jan 1, 2019 … 47.42 -54.2 24 2 6 One table (fully de-normalized) FishingDate SetLocation Lat SetLocation Long SoakTimeHrs SeaState CloudCover TrapID TrapType CatchNum CarapaceWid th Makes sense to people: Many smaller sheets Makes sense to computers: One big sheet Raw data Analysis data
  40. TrapID TrapType CatchNum FishingDate CarapaceWidth SetLocationLat SetLocationLong SoakTimeHrs SeaState CloudCover

    1 A 12 Jan 1, 2019 54 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 44 47.38 -54.4 22 3 5 1 A 12 Jan 1, 2019 63 47.38 -54.4 22 3 5 Is this tidy?
  41. TrapID TrapType CatchNum FishingDay CarapaceWidth 1 A 12 Jan 1,

    2019 54 1 A 12 Jan 1, 2019 44 1 A 12 Jan 1, 2019 63 1 A 12 Jan 1, 2019 32 1 A 12 Jan 1, 2019 76 2 B 14 Jan 1, 2019 … Two tables (partly normalized) FishingDate SetLocationLat SetLocationLong SoakTimeHrs SeaState CloudCover Jan 1, 2019 47.38 -54.4 22 3 5 Jan 2, 2019 47.42 -54.2 24 2 6 Jan 3, 2019 47.36 -54.3 26 2 6 Jan 4, 2019 47.7 -54.6 22 5 5 Jan 5, 2019 47.1 -54.0 21 3 5 FishingDate SetLocation Lat SetLocation Long SoakTimeHrs SeaState CloudCover TrapID TrapType CatchNum FishingD ate Carapace Width The key is: FishingDate
  42. TrapID TrapType CatchNum Fishing Date 1 A 12 Jan 1,

    2019 2 B 14 Jan 1, 2019 3 C 3 Jan 1, 2019 4 D 16 Jan 1, 2019 5 E 2 Jan 1, 2019 Three table solution (fully normalized) TrapID CarapaceWidth 1 54 1 44 1 63 1 32 1 76 FishingDate SetLocationLat SetLocationLong SoakTimeHrs SeaState CloudCover Jan 1, 2019 47.38 -54.4 22 3 5 Jan 2, 2019 47.42 -54.2 24 2 6 Jan 3, 2019 47.36 -54.3 26 2 6 Jan 4, 2019 47.7 -54.6 22 5 5 Jan 5, 2019 47.1 -54.0 21 3 5 TrapID TrapType CatchNum FishingD ate FishingDate SetLocation Lat SetLocation Long SoakTimeHrs SeaState CloudCover TrapID Carapace Width The keys are: TrapID, FishingDate
  43. Tidy data starts with your study design Collecting data in

    the field How will you record data? What will you record it on? How will you lay out your records? How will you make sure you don’t lose them? How will they be inputted into the computer?
  44. Options for data collection Data booklet • Pros: • Free

    form, can be adapted • Small, compact, can fit in pocket • Considerations: • Hard to back up • Be careful not to forget anything • Lose the book, lose everything • Book is water RESISTANT, not water-proof 1. If you’re using your own data: Collect and enter tidily
  45. Datasheets • Pros: • Systematic layout – just make sure

    nothing’s left empty, and you have all your data • Easy backup (photocopying, scanning) • Can print on fully waterproof paper • Great if you have many volunteers or staff members • Considerations: • Data sheets are easy to lose • Inflexible – Don’t use these for pilot studies http://www2.mar.dfo- mpo.gc.ca/fisheries/res/imp/IMAGES/whlkmon.GIF
  46. Electronic devices (e.g. tablets) • Pros: • Systematic layout •

    Backup can be automatic • No transcription error from tablet → computer • Can be good for administering forms to people (e.g. in data collection for social science) • Considerations: • Dead battery = no data collection • Possible to completely lose datasets due to technical error https://zapier.com/learn/forms-surveys/best-data- collection-apps/
  47. Spreadsheet on laptop • Pros: • Enter directly into a

    spreadsheet – complete control over data entry • Skip a step • Considerations: • Will you REALLY be able to operate a laptop while doing fieldwork? • Anything short of a Toughbook (seen left) is fragile
  48. • Back up data after every day of fieldwork, regardless

    of how you enter it • Could be: Photos of field book. Photocopies of data sheets. Offsite backup of CSV file • If possible, store the backup physically off-site. If not, put on two portable HDs, or one HD and one laptop • When you get back to MI, store on internal infrastructure • Make sure you understand all your tools before you go into the field • Remember: Every piece of data costs time and money to collect! Treat it like solid gold! • Don’t leave stuff blank. Write zeroes. • Include enough detail in “notes” so that you can remember their point Digital security: How to secure your laptop before crossing the border: https://www.cba.org/Publications-Resources/CBA-Practice-Link/Young-Lawyers/2008/How-to- secure-your-laptop-before-crossing-the-bord Data Collection Tips