Slide 1

Slide 1 text

Data Visualization using R How to get, manage, and present data to tell a compelling science story William Gunn @mrgunn Head of Academic Outreach, Mendeley Access point: NRC Visitor

Slide 2

Slide 2 text

1. A short history of graphical presentation of data 2. Introduction to R 3. Finding, cleaning, and presenting data 4. Reproducibility and data sharing

Slide 3

Slide 3 text

Data viz has a long history John Snow’s cholera map helped communicate the idea that cholera was a water-borne disease.

Slide 4

Slide 4 text

Florence Nightingale used dataviz

Slide 5

Slide 5 text

Modernization of dataviz

Slide 6

Slide 6 text

Chart junk: good, bad, and ugly Which presentation is better?

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

It can be elegant…

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Tufte

Slide 11

Slide 11 text

Tufte

Slide 12

Slide 12 text

How our eyes and brain perceive It takes 200 ms to initiate an eye movement, but the red dot can be found in 100 ms or less. This is due to pre-attentive processing.

Slide 13

Slide 13 text

Shape is a little slower than color!

Slide 14

Slide 14 text

Pre-attentive processing fails!

Slide 15

Slide 15 text

There are many “primitive” properties which we perceive • Length • Width • Size • Density • Hue • Color intensity • Depth • 3-D orientation

Slide 16

Slide 16 text

Length

Slide 17

Slide 17 text

Width

Slide 18

Slide 18 text

Density

Slide 19

Slide 19 text

Hue

Slide 20

Slide 20 text

Color Intensity

Slide 21

Slide 21 text

Depth

Slide 22

Slide 22 text

3D orientation

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Types of color schemes • Sequential – suited for ordered data that progress from low to high. Use light colors for low values and dark colors for higher. • Diverging – uses hue to show the breakpoint and intensity to show divergent extremes. • Qualitative – uses different colors to represent different categories. Beware of using hue/saturation to highlight unimportant categories.

Slide 25

Slide 25 text

Sequential http://colorbrewer2.org/

Slide 26

Slide 26 text

Diverging

Slide 27

Slide 27 text

Qualitative

Slide 28

Slide 28 text

Tips for maps • Keep it to 5-7 data classes • ~8% of men are red-green colorblind • Diverging schemes don’t do well when printed or photocopied • Colors will often render differently on different screens, especially low-end LCD screens • http://colorbrewer2.org

Slide 29

Slide 29 text

Part 2 Introduction to R

Slide 30

Slide 30 text

Why R? • Open source tool • Huge variety of packages for any kind of analysis • Saves time repeating data processing steps • Allows working with more diverse types of data and much larger datasets than Excel • Processing is much faster than Excel • Scripts are easily shareable, promoting reproducible work

Slide 31

Slide 31 text

.csv and .xls / xlsx • Excel files are designed to hold the appearance of the spreadsheet in addition to the data. • R just wants the data, so always save as .csv if you have tabular data

Slide 32

Slide 32 text

data structures • x<-c(1,2,3,4,5,6,7,8,9,10) • x • length(x) • x[1] • x[2] • x<-c(1:10) • x

Slide 33

Slide 33 text

types of data • y<-c(“abc”, “def”, “g”, “h”, “i”) • y • class(y) • y[2] • length(y) • data can be integer (1,2,3,…), numeric (1.0, 2.3, …), character (a, b, c,…), logical (TRUE, FALSE) or other things

Slide 34

Slide 34 text

Vectors • R can hold data organized a few different ways • vectors (1,2,3,4) but not (1,2,3,x,y,z) • lists – can hold heterogeneous data – 1 – 2 – a • x • arrays – multi-dimensional • dataframes – lists of vectors - like spreadsheets

Slide 35

Slide 35 text

Vector operations • x + 1 • x • sum(x) • mean(x) • mean(x+1) • x[2]<-x[2]+1 • x • x+c(2:3) • x[2:10] + c(2:3)

Slide 36

Slide 36 text

working with lists • y<-list(name = “Bob”, age = 24) • y • y$name • y[1] • y[[1]] • class(y[1]) • class(y[[1]]) • y<-list(y$name, “Sue”) • y$name • y$age[2]<-list(33)

Slide 37

Slide 37 text

Loading data • data<-read.csv("C:/Users/William Gunn/Desktop/Dropbox/Scripting/Data/t raffic_accidents/accidents2010_all.csv", header = TRUE, stringsAsFactors = FALSE)

Slide 38

Slide 38 text

Selecting subsets of data • “[“ • “$” • which • grep and grepl • subset

Slide 39

Slide 39 text

PLOTS • ggplot2 – an implementation of the “grammar of graphics” in R • a set of graph types and a way of mapping variables to graph features • graph types are called “geoms” • mappings are “aesthetics” • graphs are built up by layering geoms

Slide 40

Slide 40 text

Types of geoms • point – dotplot – takes x,y coords of points • abline – line layer – takes slope, intercept • line – connect points with a line • smooth – fit a curve • bar – aka histogram – takes vector of data • boxplot – box and whiskers • density – to show relative distributions • errorbar – what it says on the tin