Slide 1

Slide 1 text

1 TIDY DATA Jeff Goldsmith, PhD Department of Biostatistics

Slide 2

Slide 2 text

2 • “Middle” step in the wrangling process Tidy data R for Data Science

Slide 3

Slide 3 text

3 • Data tables have an implied structure which the “tidy data” framework makes explicit – Columns are variables – Rows are observations – Every value has a cell Rules for tidy data R for Data Science

Slide 4

Slide 4 text

4 • Consistent data structures will simplify your thought process – Especially true if you use tools designed for tidy data – Sounds like something the “tidyverse” would help with… • Data written for computers is easier to work with Why tidy your data?

Slide 5

Slide 5 text

5 • Columns are values, not variable names • Single columns contain multiple variables • Data are stored in multiple tables • Non-tidiness is sometimes (if only rarely) intentional • Data written for humans is generally not tidy – Human readability is important, but should be a deliberate choice • Some data aren’t really amenable to tidiness – Genomics; neuroimaging Not all data are tidy

Slide 6

Slide 6 text

6 vs “Tidy Data”, H. Wickham, JSS

Slide 7

Slide 7 text

7 vs “Tidy Data”, H. Wickham, JSS

Slide 8

Slide 8 text

8 vs

Slide 9

Slide 9 text

9 • Data spread across tables with defined relations • Variables used to define these relations are keys • Tables are combined by joins Relational data R for Data Science

Slide 10

Slide 10 text

10 • Joining datasets x and y Join types R for Data Science Inner joins Outer joins

Slide 11

Slide 11 text

11 • For tidying single tables – pivot_longer – separate • For untidying single tables – pivot_wider • For combining multiple tables – bind_rows – *_join Key functions