Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6002: Week 5 - Working with messy data

FISH 6002: Week 5 - Working with messy data

MI Fisheries Science

October 16, 2017
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Week 5: Working with messy data FISH 6002: Data Collection,

    Management, and Display © Brett Favaro 2017
  2. Recall: Tidy data 1.One column = one variable 2.One row

    = one observation 3.One cell = one value 4.One column = one data type Grolemund and Wickham (2016), Fig 12.1
  3. You will work with two types of data • Data

    given to you by others • Data you collect yourself  This week  Next week Fieldwork, physical datasheets, surveys, data from figures – data that have NOT been coded, where you can design the spreadsheet yourself Data that have already been coded and stored in a table
  4. I have deliberately broken this dataset. Our job is to

    clean it up When loading data, take four steps: 1. Did it load correctly? 2. Are data types what they should be? 3. Numbers: Are there impossible values? 4. Factors: Are factor levels correct?
  5. Start with total length. Right now it’s a factor, but

    we don’t know why. We will need to find non-numerical characters. To do this, first convert the factor to a character:
  6. as.numeric returns NA if a character is present is.na returns

    TRUE if NA is present filter returns only rows that meet the criteria above We are saying: Give me rows with an NA, where it’s NA if it is NOT a number We have discovered two problem rows, ID 62 and 306.
  7. Mutate the data frame (add a new variable) Call the

    new variable tl (i.e. overwrite the old variable called tl). If tl is 2O8, then assign it a value of 208. If tl is ANYTHING ELSE, assign it a value of tl (i.e. don’t change it) Repeat the process for 215f, changing it to 215 Turn this variable back into a number
  8. Take a step back: To R, e means EXPONENT. Our

    error detection missed this, because we were looking for NA’s When “56e8” is converted to a numeric, it doesn’t return NA like other letters (This is why it’s good to retain an unchanged version)
  9. Still have a missing value This decision must be made

    thoughtfully. The right choice depends on context Document what you did and why
  10. Locate the error: Execute fix Decide what to do: -

    Delete? - Go to data source? - Assume it’s 297, not 2970?
  11. Is 50 years a plausible age for a lake trout?

    Consult Fishbase: http://fishbase.org/search.php
  12. Is 50 years a plausible age for a lake trout?

    Yes Is 1 year a plausible age for a trout in this study? Need more info 1 year may be plausible – but not for a fish this big. Age 1 is an impossible value here.
  13. Here, we can go back to the original source, and

    discover that it should be 13, not 1.
  14. What are possible values for fish ID? - Any label

    is fine - But it has to be UNIQUE (i.e. no duplicates) At least one value is occurring twice  So, we must hunt for duplicates
  15. Group the data by ID number, and report any group

    that has more than one row What do we do? - Discard one or both? - Keep it as is? - Reassign an ID number to something different?
  16. Each of these differences are explainable - We deleted a

    row - ID is a character, because we made up an ID name
  17. http://derekogle.com/fishR/data/data-html/InchLake2.html Your task: • Take the Inch Lake dataset (InchLake2-

    Broken.csv) • Clean it up • Four steps: 1. Did it load correctly? 2. Are data types what they should be? 3. Numbers: Are there impossible values? 4. Factors: Are factor levels correct? • See: FISH6002-Week5_Activity.R to get started If you’re stuck: FISH6002- Week5Solution.R (but try yourself first) At end: Compare the dataframe you produce to the CLEAN version (InchLake2-clean.csv) Are differences explainable?