Data Visualization using R
How to get, manage, and present
data to tell a compelling science
story
William Gunn
@mrgunn
Head of Academic Outreach, Mendeley
Access point: NRC Visitor
Slide 2
Slide 2 text
1. A short history of graphical
presentation of data
2. Introduction to R
3. Finding, cleaning, and presenting
data
4. Reproducibility and data sharing
Slide 3
Slide 3 text
Data viz has a long history
John Snow’s
cholera map
helped
communicate
the idea that
cholera was a
water-borne
disease.
Slide 4
Slide 4 text
Florence Nightingale used dataviz
Slide 5
Slide 5 text
Modernization of dataviz
Slide 6
Slide 6 text
Chart junk: good, bad, and ugly
Which presentation is better?
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
It can be elegant…
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Tufte
Slide 11
Slide 11 text
Tufte
Slide 12
Slide 12 text
How our eyes and brain perceive
It takes 200 ms to initiate an eye movement, but the red dot
can be found in 100 ms or less. This is due to pre-attentive
processing.
Slide 13
Slide 13 text
Shape is a little slower than color!
Slide 14
Slide 14 text
Pre-attentive processing fails!
Slide 15
Slide 15 text
There are many “primitive”
properties which we perceive
• Length
• Width
• Size
• Density
• Hue
• Color intensity
• Depth
• 3-D orientation
Slide 16
Slide 16 text
Length
Slide 17
Slide 17 text
Width
Slide 18
Slide 18 text
Density
Slide 19
Slide 19 text
Hue
Slide 20
Slide 20 text
Color Intensity
Slide 21
Slide 21 text
Depth
Slide 22
Slide 22 text
3D orientation
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
Types of color schemes
• Sequential – suited for ordered data that
progress from low to high. Use light colors for
low values and dark colors for higher.
• Diverging – uses hue to show the breakpoint
and intensity to show divergent extremes.
• Qualitative – uses different colors to represent
different categories. Beware of using
hue/saturation to highlight unimportant
categories.
Slide 25
Slide 25 text
Sequential
http://colorbrewer2.org/
Slide 26
Slide 26 text
Diverging
Slide 27
Slide 27 text
Qualitative
Slide 28
Slide 28 text
Tips for maps
• Keep it to 5-7 data classes
• ~8% of men are red-green colorblind
• Diverging schemes don’t do well when
printed or photocopied
• Colors will often render differently on
different screens, especially low-end LCD
screens
• http://colorbrewer2.org
Slide 29
Slide 29 text
Part 2
Introduction to R
Slide 30
Slide 30 text
Why R?
• Open source tool
• Huge variety of packages for any kind of
analysis
• Saves time repeating data processing steps
• Allows working with more diverse types of
data and much larger datasets than Excel
• Processing is much faster than Excel
• Scripts are easily shareable, promoting
reproducible work
Slide 31
Slide 31 text
.csv and .xls / xlsx
• Excel files are designed to hold the
appearance of the spreadsheet in addition
to the data.
• R just wants the data, so always save as
.csv if you have tabular data
Slide 32
Slide 32 text
data structures
• x<-c(1,2,3,4,5,6,7,8,9,10)
• x
• length(x)
• x[1]
• x[2]
• x<-c(1:10)
• x
Slide 33
Slide 33 text
types of data
• y<-c(“abc”, “def”, “g”, “h”, “i”)
• y
• class(y)
• y[2]
• length(y)
• data can be integer (1,2,3,…), numeric (1.0,
2.3, …), character (a, b, c,…), logical
(TRUE, FALSE) or other things
Slide 34
Slide 34 text
Vectors
• R can hold data organized a few different
ways
• vectors (1,2,3,4) but not (1,2,3,x,y,z)
• lists – can hold heterogeneous data
– 1
– 2
– a
• x
• arrays – multi-dimensional
• dataframes – lists of vectors - like
spreadsheets
Slide 35
Slide 35 text
Vector operations
• x + 1
• x
• sum(x)
• mean(x)
• mean(x+1)
• x[2]<-x[2]+1
• x
• x+c(2:3)
• x[2:10] + c(2:3)
Slide 36
Slide 36 text
working with lists
• y<-list(name = “Bob”, age = 24)
• y
• y$name
• y[1]
• y[[1]]
• class(y[1])
• class(y[[1]])
• y<-list(y$name, “Sue”)
• y$name
• y$age[2]<-list(33)
Selecting subsets of data
• “[“
• “$”
• which
• grep and grepl
• subset
Slide 39
Slide 39 text
PLOTS
• ggplot2 – an implementation of the
“grammar of graphics” in R
• a set of graph types and a way of mapping
variables to graph features
• graph types are called “geoms”
• mappings are “aesthetics”
• graphs are built up by layering geoms
Slide 40
Slide 40 text
Types of geoms
• point – dotplot – takes x,y coords of points
• abline – line layer – takes slope, intercept
• line – connect points with a line
• smooth – fit a curve
• bar – aka histogram – takes vector of data
• boxplot – box and whiskers
• density – to show relative distributions
• errorbar – what it says on the tin