coRps R Building Blocks Series - 6 - Importing Data

Learning Objectives • Know what diﬀerent packages exist to read
in data and when to use them • Understand key parameters useful across diﬀerent packages • Develop an approach to reviewing and knowing how to import data

Tidyverse packages in the flow

Options on options on options read.table() readr::read_csv() data.table::fread() readr::read_tsv() vroom::vroom()
readxl::read_excel() haven::read_dta() googlesheets4::read_sheet() gophr::read_msd() googlesheets4::range_speedread() tameDP::tame_dp()

A preference for csv • Does not need a specific
platform or software to open • Faster, less complicated (no user formatting stored) • Typically consumes less memory *csv = comma separated value

Letʼs read in data read_csv("2022-03-28/simple.csv") BUT WAIT! WHAT IS THE
FIRST STEP BEFORE READING IN ANY DATA?

Inspection Before reading in any file, we want to inspect
it first to understand its structure and any possible complications. !

Letʼs read in data df_months <- read_csv("2022-03-28/simple.csv") object file path
from working directory function output saved into If you are working on a PC, remember to change file path backslashes (“\”) to forward slashes (“/”) or use an escape first (“\\”).

Letʼs read in data column delimiter column structures size (R
x C)+ read in as tibble not “data frame” data preview (10 rows)

We can also export using readr write_csv(df_months, "2022-03-28/simple_out.csv", na =
“”) function file name and path from working directory where to save data frame to output store missing values as blank cells

Parameter specifications (read_csv) - 20 read_csv( file, col_names = TRUE,
col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy() ) file to import specify column types specify source file number of lines to skip from top non-unique column names handling structure type guess range The other import functions all have a very similar set of parameters. If you understand read_csv, you can understand the structure of the rest. define missing values

Running into parsing trouble column parsing based on guess in
X number of rows; when there are missing values to start, it may parse your column incorrectly use problem to inspect the error

Column types col_types c = character i = integer n
= number d = double l = logical f = factor D = date T = date time t = time ? = guess _ or - = skip

Define column types read_csv("2022-03-28/simple.csv", col_types = cols(mech_code = "c"))

Itʼs often easier to read in all columns as character
read_csv("2022-03-28/simple.csv", col_types = cols(.default = "c"))

Where to use what to use when readr::read_csv() data.table::fread() readr::read_tsv()
vroom::vroom() readxl::read_excel() haven::read_dta() googlesheets4::read_sheet() gophr::read_msd() googlesheets4::range_speedread() tameDP::tame_dp() read in small/medium sized csv read in small/medium sized tsv read in an Excel file read small Google Sheet read large Google Sheet read in large csv very fast “doesnʼt fit quite so well into the tidyverse” read in large csv fast read in SAS/Stata/SPSS file read in MER Structured Dataset applies necessary structure read in PEPFAR Data Pack tidies format

What are some threats to tidy data here?

What are some threats to tidy data here? Wide data,
1st row is not machine readable, 2nd row “headers” have multiple pieces of information !

How would you read this in? (pseudo code)

What is going on? read_excel("2022-03-28/hfr.xlsx") help file parameters

With Excel, you want to specify the sheet to read
excel_sheets("2022-03-28/hfr.xlsx") read_excel("2022-03-28/hfr.xlsx", sheet = “HFR”, skip = 1)

Looping over multiple sheets purrr::map_dfr(.x = c(“HFR”, “ HFR Eridanus”),
.f = ~read_excel("2022-03-28/hfr.xlsx", sheet = .x, skip = 1)) OR excel_sheets("2022-03-28/hfr.xlsx") %>% stringr::str_subset("HFR") %>% purrr::map_dfr(.f = ~read_excel("2022-03-28/hfr.xlsx", sheet = .x, skip = 1)) identify sheets in workbook limit vector to those that match HFR iterate read function over multiple sheets and combine into a tibble iterate read function over multiple sheets and combine into a tibble

gs4_auth() read_sheet(as_sheets_id("1B_DcG1WqZv6xo_eBBye-Q31tMGq_DjCBcjlWs3xINOw")) Very similar step reading in Google Sheets need
to authenticate with your USAID email onces each session Specify the file id (from the url) using as_sheet_id()

Reading in a MSD vroom(filepath_msd) versus gophr::read_msd(filepath_msd) errors on read
in due to parsing guesses mech_code should be stored as a character; fiscal_year should be integer and targets/ results should be double utilizes vroom, but is setup to read in all the columns as we want them

One last thing Not discussed earlier, but you can define
your own .name_repair, when there are duplicates, special character, etc in variable/ column names. A powerful package that can do most of the leg work for you is janitor.

Take advantage of RStudio cheat sheets

Pseudo Coding Exercise How would you read this file in?
What steps would you need to take to clean it up and make it tidy/useable?

Homework Problem: Using what youʼve learned, import in all the
data from the HIV estimates to create a tidy dataset. • Data do not start on the first row (hint - readxl parameters) • Missing values are written as “…” (hint - readxl parameters) • Columns are not uniquely named and have special character (hint - janitor and dplyr::rename()) • There are multiple tabs (hint - purrr::map_dfr()) • Values cells often contain characters (eg “<”) and thousand separators are spaces (hint - stringr::str_remove()) • Upper and lower bounds should be separate columns (hint - tidyr::separate())

Resources & Sources • R for Data Science, Garrett Grolemund
and Hadley Wickham • https://www.tidyverse.org/ • janitor package • https://www.geeksforgeeks.org/diﬀerence-between-csv-and-excel/

coRps R Building Blocks Series - 6 - Importing ...

coRps R Building Blocks Series - 6 - Importing Data

Aaron

More Decks by Aaron

Other Decks in Programming

Featured

Transcript

Learning Objectives • Know what diﬀerent packages exist to read

Tidyverse packages in the flow

Options on options on options read.table() readr::read_csv() data.table::fread() readr::read_tsv() vroom::vroom()

A preference for csv • Does not need a specific

Letʼs read in data read_csv("2022-03-28/simple.csv") BUT WAIT! WHAT IS THE

Inspection Before reading in any file, we want to inspect

Letʼs read in data df_months <- read_csv("2022-03-28/simple.csv") object file path

Letʼs read in data column delimiter column structures size (R

We can also export using readr write_csv(df_months, "2022-03-28/simple_out.csv", na =

Parameter specifications (read_csv) - 20 read_csv( file, col_names = TRUE,

Running into parsing trouble column parsing based on guess in

Column types col_types c = character i = integer n

Define column types read_csv("2022-03-28/simple.csv", col_types = cols(mech_code = "c"))

Itʼs often easier to read in all columns as character

Where to use what to use when readr::read_csv() data.table::fread() readr::read_tsv()

What are some threats to tidy data here?

What are some threats to tidy data here? Wide data,

How would you read this in? (pseudo code)

What is going on? read_excel("2022-03-28/hfr.xlsx") help file parameters

With Excel, you want to specify the sheet to read

Looping over multiple sheets purrr::map_dfr(.x = c(“HFR”, “ HFR Eridanus”),

gs4_auth() read_sheet(as_sheets_id("1B_DcG1WqZv6xo_eBBye-Q31tMGq_DjCBcjlWs3xINOw")) Very similar step reading in Google Sheets need

Reading in a MSD vroom(filepath_msd) versus gophr::read_msd(filepath_msd) errors on read

One last thing Not discussed earlier, but you can define

Take advantage of RStudio cheat sheets

Pseudo Coding Exercise How would you read this file in?

Homework Problem: Using what youʼve learned, import in all the

Resources & Sources • R for Data Science, Garrett Grolemund