P8105: Tidy Data

1 TIDY DATA Jeff Goldsmith, PhD Department of Biostatistics

2 • “Middle” step in the wrangling process Tidy data
R for Data Science

3 • Data tables have an implied structure which the
“tidy data” framework makes explicit – Columns are variables – Rows are observations – Every value has a cell Rules for tidy data R for Data Science

4 • Consistent data structures will simplify your thought process
– Especially true if you use tools designed for tidy data – Sounds like something the “tidyverse” would help with… • Data written for computers is easier to work with Why tidy your data?

5 • Columns are values, not variable names • Single
columns contain multiple variables • Data are stored in multiple tables • Non-tidiness is sometimes (if only rarely) intentional • Data written for humans is generally not tidy – Human readability is important, but should be a deliberate choice • Some data aren’t really amenable to tidiness – Genomics; neuroimaging Not all data are tidy

6 vs “Tidy Data”, H. Wickham, JSS

7 vs “Tidy Data”, H. Wickham, JSS

8 vs https://github.com/jennybc/lotr-tidy/blob/master/01-intro.md

9 • Data spread across tables with defined relations •
Variables used to define these relations are keys • Tables are combined by joins Relational data R for Data Science

10 • Joining datasets x and y Join types R
for Data Science Inner joins Outer joins

11 • For tidying single tables – pivot_longer – separate
• For untidying single tables – pivot_wider • For combining multiple tables – bind_rows – *_join Key functions

P8105: Tidy Data

P8105: Tidy Data

Jeff Goldsmith

More Decks by Jeff Goldsmith

Featured

Transcript

1 TIDY DATA Jeff Goldsmith, PhD Department of Biostatistics

2 • “Middle” step in the wrangling process Tidy data

3 • Data tables have an implied structure which the

4 • Consistent data structures will simplify your thought process

5 • Columns are values, not variable names • Single

6 vs “Tidy Data”, H. Wickham, JSS

7 vs “Tidy Data”, H. Wickham, JSS

8 vs https://github.com/jennybc/lotr-tidy/blob/master/01-intro.md

9 • Data spread across tables with defined relations •

10 • Joining datasets x and y Join types R

11 • For tidying single tables – pivot_longer – separate