Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Visualization Using R

William Gunn
January 26, 2013

Data Visualization Using R

An invited workshop at Science Online 2013 on data visualization using R.

William Gunn

January 26, 2013
Tweet

More Decks by William Gunn

Other Decks in Research

Transcript

  1. Data Visualization using R How to get, manage, and present

    data to tell a compelling science story William Gunn @mrgunn Head of Academic Outreach, Mendeley Access point: NRC Visitor
  2. 1. A short history of graphical presentation of data 2.

    Introduction to R 3. Finding, cleaning, and presenting data 4. Reproducibility and data sharing
  3. Data viz has a long history John Snow’s cholera map

    helped communicate the idea that cholera was a water-borne disease.
  4. How our eyes and brain perceive It takes 200 ms

    to initiate an eye movement, but the red dot can be found in 100 ms or less. This is due to pre-attentive processing.
  5. There are many “primitive” properties which we perceive • Length

    • Width • Size • Density • Hue • Color intensity • Depth • 3-D orientation
  6. Hue

  7. Types of color schemes • Sequential – suited for ordered

    data that progress from low to high. Use light colors for low values and dark colors for higher. • Diverging – uses hue to show the breakpoint and intensity to show divergent extremes. • Qualitative – uses different colors to represent different categories. Beware of using hue/saturation to highlight unimportant categories.
  8. Tips for maps • Keep it to 5-7 data classes

    • ~8% of men are red-green colorblind • Diverging schemes don’t do well when printed or photocopied • Colors will often render differently on different screens, especially low-end LCD screens • http://colorbrewer2.org
  9. Why R? • Open source tool • Huge variety of

    packages for any kind of analysis • Saves time repeating data processing steps • Allows working with more diverse types of data and much larger datasets than Excel • Processing is much faster than Excel • Scripts are easily shareable, promoting reproducible work
  10. .csv and .xls / xlsx • Excel files are designed

    to hold the appearance of the spreadsheet in addition to the data. • R just wants the data, so always save as .csv if you have tabular data
  11. types of data • y<-c(“abc”, “def”, “g”, “h”, “i”) •

    y • class(y) • y[2] • length(y) • data can be integer (1,2,3,…), numeric (1.0, 2.3, …), character (a, b, c,…), logical (TRUE, FALSE) or other things
  12. Vectors • R can hold data organized a few different

    ways • vectors (1,2,3,4) but not (1,2,3,x,y,z) • lists – can hold heterogeneous data – 1 – 2 – a • x • arrays – multi-dimensional • dataframes – lists of vectors - like spreadsheets
  13. Vector operations • x + 1 • x • sum(x)

    • mean(x) • mean(x+1) • x[2]<-x[2]+1 • x • x+c(2:3) • x[2:10] + c(2:3)
  14. working with lists • y<-list(name = “Bob”, age = 24)

    • y • y$name • y[1] • y[[1]] • class(y[1]) • class(y[[1]]) • y<-list(y$name, “Sue”) • y$name • y$age[2]<-list(33)
  15. PLOTS • ggplot2 – an implementation of the “grammar of

    graphics” in R • a set of graph types and a way of mapping variables to graph features • graph types are called “geoms” • mappings are “aesthetics” • graphs are built up by layering geoms
  16. Types of geoms • point – dotplot – takes x,y

    coords of points • abline – line layer – takes slope, intercept • line – connect points with a line • smooth – fit a curve • bar – aka histogram – takes vector of data • boxplot – box and whiskers • density – to show relative distributions • errorbar – what it says on the tin