Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FISH 6002: Week 7 - Displaying Data Visually 1

FISH 6002: Week 7 - Displaying Data Visually 1

Oct 2, 2017 lecture of FISH 6002 - revised again oct 21 2019

MI Fisheries Science

October 02, 2017
Tweet

More Decks by MI Fisheries Science

Other Decks in Science

Transcript

  1. Week 7: Displaying Data Visually FISH 6000: Science Communication for

    Fisheries Brett Favaro 2017 This work is licensed under a Creative Commons Attribution 4.0 International License
  2. #year: Year of catch #state: State of catch (MI, MN,

    or WI) #catch: Catch in lbs #value: Value of catch in dollars Date
  3. Month Day Year Hour Minute Second df$date df$date mdy_hms("August 03

    2017 07:14:36 PM") [1] "2017-08-03 19:14:36 UTC" mdy_hms("August 03 2017 07:14:36 PM", tz="America/St_Johns") [1] "2017-08-03 19:14:36 NDT"
  4. [1] "2017/08/03 13:53" Month Day Year Hour Minute ymd_hm("2017/08/03 13:53",

    tz="America/St_Johns") [1] "2017-08-03 13:53:00 NDT"
  5. > dmy("3/8/2017") [1] "2017-08-03" > dmy("3/August/2017") [1] "2017-08-03" > dmy("3/Aug/2017")

    [1] "2017-08-03" > dmy("3/AUG/2017") [1] "2017-08-03" > dmy("3/AuG/2017") [1] "2017-08-03"
  6. Datetime ymd_hm("2017/08/03 13:53", tz="America/St_Johns") [1] "2017-08-03 13:53:00 NDT" hm("13:53") [1]

    "13H 53M 0S" Date Time Number of seconds since 0:00 ymd("2017/08/03") [1] "2017-08-03"
  7. Break the problem into steps Each row is an action

    (setting or hauling gear) within a deployment. Deployment is given by deployment_id. For each action we have a date and a time. We need to create a date-time and calculate the amount of time that passed Problem: Remember, if we use hms to look at time data, we get TIME PASSED SINCE 0:00:00 and AM/PM is ignored. No good!
  8. Combine the date and time into one column Then, make

    it a Date-Time date_df <- date_df %>% unite(Date, Time, col="DateTime", sep=" ")
  9. Now, calculate durations Let’s make it wide, so within a

    deployment we have a START and END time date_wide <- date_wide %>% mutate(SoakTime = as.duration(set %--% haul))
  10. Durations are stored in seconds, and expressed by default in

    the most appropriate unit (hrs, days, etc) mean(date_wide$SoakTime) [1] 866880 median(date_wide$SoakTime) [1] 189000 If you perform mathematical operations, they will be performed on seconds
  11. date_df <- read.csv("./data/DateExample.csv") date_df %>% unite(Date, Time, col="DateTimeRaw", sep=" ")

    %>% #Combine Dates and Times mutate(DateTime = mdy_hms(DateTimeRaw, tz="America/St_Johns")) %>% # Make Date-Time select(-DateTimeRaw) %>% # Remove the non Date-Time column spread(key=action, value=DateTime) %>% # Make it wide format mutate(SoakTime = as.duration(set %--% haul)) #Calculate duration btwn set and haul A note about piping: This does everything we just did, in a single chunk of code
  12. #year: Year of catch #state: State of catch (MI, MN,

    or WI) #catch: Catch in lbs #value: Value of catch in dollars Date
  13. Data visualization, or dataviz, can completely alter our perception of

    a subject Rose and Rowe (2015) DFO (2016) Both depictions are true… do they change your interpretation?
  14. Graphs should be: • An honest representation of the data

    • An effective display • Efficient – i.e. they should use space well • Visually appealing! (Scientists like aesthetics too) Graphs should tell a story
  15. 3 plotting “ecosystems” in R Base ggplot2 lattice Each has

    different strengths. Very different syntax We will focus on Base, and ggplot2, not lattice. https://bookdown.org/rdpeng/exdata/plotting-systems.html#the-base-plotting-system
  16. What graph do I make? Depends on: • Type of

    data • Continuous • Discrete https://github.com/allisonhorst/stats-illustrations
  17. One-variable plots What stories can we tell with these? Continuous

    Discrete 0 10000 Apple Banana Count Count Density Proportion
  18. Start with the easiest table(whitefish$state) MI MN WI 31 11

    31 plot(table(whitefish$state)) One-variable – Discrete X Response is “count”
  19. barplot(table(whitefish$state)) plot(table(whitefish$state)) The plot command executes a default plot, based

    on data type. These plots are minimalist in nature. Contrast barplot with plot – what’s different?
  20. https://www.edwardtufte.com/tufte/ Data ink -------------------------- Total ink used in graphic Data:ink

    ratio = Fill up the plot with meaningful ink – not decorative ink # of pieces of data -------------------------- Area of graphic Data density = R’s plot command, by default, uses plots that maximize data:ink ratio Edward Tufte
  21. One-variable plots – Continuous Continuous data (Measured) Counts rows of

    data This is a scatterplot plot(whitefish$catch) In this context, a scatterplot is only useful as it helps us look for outliers
  22. A histogram displays frequency of observations that fit within a

    specified range Height = count of observations that fall into that interval Intervals are (a, b] by default: They include right hand endpoint There were 17 observations where catch was between zero and ~200,000 hist(whitefish$catch)
  23. In R, all graphical elements can be customized (We will

    look at aesthetics in future weeks)
  24. When breaks aren’t uniform: Add up all bar areas to

    get 1 When breaks uniform: Add up all bar heights to get count of observations
  25. # Many calculations are conducted by the hist function prior

    to plotting # and you can extract those values histinfo <- hist(whitefish$catch)
  26. Everything in this plot has meaning Title – what is

    this plot? Axis labels – what is being plotted? Values Bar area and height convey meaning – width tells interval. Height tells count
  27. Plot of mean catch by state (two-variable) How many pieces

    of information are present in the plot space?
  28. 3 points of data visible data:ink ratio low All 73

    observations visible, summarized as 9 bars data:ink ratio higher
  29. ggplot2 # ggplot(data = df, # aes(DATA YOU WANT TO

    PLOT)) + # GRAPHICAL ELEMENTS, ADDED ONE BY ONE ggplot(data = whitefish, aes(state)) + geom_bar()
  30. ggplot2 a <- ggplot(data = whitefish, aes(state)) + geom_bar() print(a)

    a <- ggplot(data = whitefish, aes(state)) print(a) Key point: ggplot2 works with layered elements
  31. How did I know to type geom_bar()? I knew I

    had… ONE VARIABLE and it is… discrete https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf
  32. a <- ggplot(data = whitefish, aes(state)) + geom_bar() print(a) barplot(table(whitefish$state))

    In base plot, I had to manually calculate a count (using table()), and use the plot() or barplot() commands. Different commands for different plots, and elements must be manually calculated. ggplot uses the same basic syntax for all plots. Elements are layered
  33. ggplot() plots are customizable a + geom_histogram(bins=10) a + geom_histogram(bins=20)

    a + geom_histogram(bins=30) a + geom_histogram(bins=40)
  34. Defaults differ across ecosystems: a + geom_area(stat="bin") `stat_bin()` using `bins

    = 30`. Pick better value with `binwidth`. R teachable moment: Always understand the defaults
  35. a + geom_histogram(bins = 10) + coord_flip() ggplot has built-in

    functions to make common adjustments to plots with little code e.g. sideways histograms are common in population data Lindeman (1942)
  36. What graph do I make? Depends on: • Type of

    data • Continuous • Discrete • Nominal • Ordinal • Binary https://github.com/allisonhorst/stats-illustrations
  37. Nominal Watermelons Oranges Number eaten Cherries Watermelons Oranges Number eaten

    Cherries Ordinal (based on size) Not Citrus Citrus Number eaten Binary
  38. Building a graph is building a narrative What Contrast do

    I want to emphasize? What story am I trying to tell? Must be honest, efficient, and effective
  39. Continuous Continuous plot(value ~ catch, data=whitefish) Scatterplot Each circle =

    one point of data. Conveys X and Y value Base plot uses empty circles by default, so you can see partially- overlapping points
  40. When designing graphs, ask what question does this graph answer?

    Y should be a response variable X, an explanatory variable Contrast: Catch Catch Value Value What question is each graph answering?
  41. In base plot, you add elements manually “Make a straight

    line, of the results of a linear model” “Add it to whatever has previously been plotted” Scatterplots are often enhanced with visual aids plot(catch ~ year, data=whitefish) abline(lm(catch~year, data=whitefish)) The order of execution matters…
  42. abline(lm(catch~year, data=whitefish)) Error in int_abline(a = a, b = b,

    h = h, v = v, untf = untf, ...) : plot.new has not been called yet plot(catch ~ year, data=whitefish) plot(catch ~ year, data=whitefish) abline(lm(catch~year, data=whitefish))
  43. In ggplot a <- ggplot(data=whitefish, aes(x = year, y =

    catch)) a + geom_point() The aesthetics are: Year on X Catch on Y Then, add points
  44. Note ggplot syntax: a <- ggplot(data=whitefish, aes(x = year, y

    = catch)) a + geom_point() a <- ggplot(data=whitefish, aes(x = year, y = catch)) + geom_point() print(a) #or just a Same result
  45. a <- ggplot(data=whitefish, aes(x = year, y = catch)) +

    geom_point() + geom_smooth(method = “lm”) # add a straight line of best fit a ggplot(data=whitefish, aes(x = year, y = catch)) + geom_point() + geom_smooth() # note no method=“lm” Can you explain all elements?
  46. ggplot(data=whitefish, aes(x = year, y = catch)) + geom_point() +

    geom_smooth(method = “lm”, level=0.99) level=0.95 level=0.5 ggplot(data=whitefish, aes(x = year, y = catch)) + geom_point() + geom_smooth(span=1) span=100
  47. Disclose, in methods or figure caption how you made the

    plot Figure 1: Scatterplot of catch (Y) by year (X) of Lake Whitefish in Lake Superior. The blue line depicts a loess smoother with a span value of 100 (implemented in ggplot2), and the grey shading is a 95% confidence interval.
  48. plot(catch ~ year, data=whitefish) abline(lm(catch~year, data=whitefish)) ggplot(data=whitefish, aes(x = year,

    y = catch)) + geom_point() + geom_smooth(method = “lm”) When X and Y are both continuous you’re visualizing a relationship between two variables
  49. Continuous Y, Discrete X Discrete Continuous When X is discrete,

    you are contrasting across groups “What’s the median catch across states”?
  50. “What’s the median catch across states”? Raw data: - How

    many points of data are shown? - Is data:ink ratio high or low? - Is data density high or low? - How many points of data are shown? - Is data:ink ratio higher, lower, or same? - Is data density higher, lower, or same?
  51. Median Boxplot 1st quartile 3rd quartile min max Outlier Outlier

    > Q3 + 1.5*IQR < Q1 - 1.5*IQR Default: whiskers extend UP TO 1.5 times the IQR from Q3 – no further IQR: Inter-quartile range
  52. Discrete Discrete Need to add another discrete variable… whitefish <-

    whitefish %>% mutate(BigState = ifelse(state=="MI", "Y", "N")) Is this observation from a “big state”? ggplot(data=whitefish, aes(x=state, y = BigState)) + geom_count()
  53. One-variable plots What questions can we answer? Continuous Discrete 0

    10000 Apple Banana Count Count Density Proportion Q: How many apples vs. bananas did we eat? We ate more apples than bananas Q: What was the distribution of catch data? Mostly small catches, a few large catches Catch
  54. Two-variable plots Discrete Continuous Continuous Continuous Q: Is there a

    relationship between my explanatory (Y) and response variable (X)? year catch 1970 2000 A: Maybe, catch increases over time sort of state catch Q: Is there a difference between catches across states? A: Yes, Catch higher in MI, etc. MI MN WI
  55. Defining your question is the most important part of graph-making

    The second most important part is accurately describing your data: - Continuous? - Discrete - Ordinal - Nominal - Binary 3rd: Decide what is explanatory and what is response variables (for two-variable plots) Finally… make the graph! (In base or ggplot)
  56. Activity There are four graphs printed. In pairs or 3’s,

    take one. • What type of graph is it? • What type(s) of data are in the graph? (discrete, continuous) • What question is the graph asking? • What inference could you draw from this graph? • Describe the main finding shown in the graph
  57. Example Apple Banana Count This is a barplot. It is

    a one-variable plot, and the variable is “fruit type” which is discrete – nominal The question it is asking is: “How many fruits of each type were eaten” and the inference you would draw is “Which fruit type is eaten more?” Described: “Many more apples were eaten than bananas”
  58. Activity There are three graphs printed. In pairs or 3’s,

    take one. • What type of graph is it? • What type(s) of data are in the graph? (discrete, continuous) • What question is the graph asking? • What inference could you draw from this graph? • Describe the main finding shown in the graph
  59. This figure shows the CATCH PER UNIT EFFORT (i.e. snow

    crabs caught per trap deployment) across several types of bait on the X axis. https://peerj.com/articles/6874/ Araya-Schmidt et al. 2019
  60. Number of hurricanes per two-week period Number of “wrecked birds”

    per two-week period – i.e. dead birds found on beaches, presumably killed due to hurricanes https://peerj.com/articles/3287/ Huang et al. 2017
  61. https://peerj.com/articles/5306/ Tixier et al. 2018 Depredation is when something takes

    bait off a hook. Here, they went fishing many days and recorded whether or not depredation occurred on their fishing gear. Black bars are Amsterdam/St. Paul, grey bars are SE Australia
  62. Q3 + 1.5 * IQR 13 + 1.5 * 10

    = 28 So… everything > 28 = boxplot outlier IQR = Q3 – Q1 13 – 3 = 10
  63. Q3 + 1.5*IQR 13 + 1.5 * 10 = 28

    Everything > 28 = outlier IQR = Q3 – Q1 14 – 3 = 11 If we also had 27 in the number series… No more outliers! IQR = Q3 – Q1 13 – 3 = 10 IQR = Q3 – Q1 14 + 1.5 * 11 = 26.5 Everything > 26.5