Upgrade to Pro — share decks privately, control downloads, hide ads and more …

coRps R Building Blocks Series - 6 - Importing Data

Aaron
March 28, 2022

coRps R Building Blocks Series - 6 - Importing Data

Aaron

March 28, 2022
Tweet

More Decks by Aaron

Other Decks in Programming

Transcript

  1. Learning Objectives • Know what different packages exist to read

    in data and when to use them • Understand key parameters useful across different packages • Develop an approach to reviewing and knowing how to import data
  2. Options on options on options read.table() readr::read_csv() data.table::fread() readr::read_tsv() vroom::vroom()

    readxl::read_excel() haven::read_dta() googlesheets4::read_sheet() gophr::read_msd() googlesheets4::range_speedread() tameDP::tame_dp()
  3. A preference for csv • Does not need a specific

    platform or software to open • Faster, less complicated (no user formatting stored) • Typically consumes less memory *csv = comma separated value
  4. Inspection Before reading in any file, we want to inspect

    it first to understand its structure and any possible complications. !
  5. Letʼs read in data df_months <- read_csv("2022-03-28/simple.csv") object file path

    from working directory function output saved into If you are working on a PC, remember to change file path backslashes (“\”) to forward slashes (“/”) or use an escape first (“\\”).
  6. Letʼs read in data column delimiter column structures size (R

    x C)+ read in as tibble not “data frame” data preview (10 rows)
  7. We can also export using readr write_csv(df_months, "2022-03-28/simple_out.csv", na =

    “”) function file name and path from working directory where to save data frame to output store missing values as blank cells
  8. Parameter specifications (read_csv) - 20 read_csv( file, col_names = TRUE,

    col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy() ) file to import specify column types specify source file number of lines to skip from top non-unique column names handling structure type guess range The other import functions all have a very similar set of parameters. If you understand read_csv, you can understand the structure of the rest. define missing values
  9. Running into parsing trouble column parsing based on guess in

    X number of rows; when there are missing values to start, it may parse your column incorrectly use problem to inspect the error
  10. Column types col_types c = character i = integer n

    = number d = double l = logical f = factor D = date T = date time t = time ? = guess _ or - = skip
  11. Itʼs often easier to read in all columns as character

    read_csv("2022-03-28/simple.csv", col_types = cols(.default = "c"))
  12. Where to use what to use when readr::read_csv() data.table::fread() readr::read_tsv()

    vroom::vroom() readxl::read_excel() haven::read_dta() googlesheets4::read_sheet() gophr::read_msd() googlesheets4::range_speedread() tameDP::tame_dp() read in small/medium sized csv read in small/medium sized tsv read in an Excel file read small Google Sheet read large Google Sheet read in large csv very fast “doesnʼt fit quite so well into the tidyverse” read in large csv fast read in SAS/Stata/SPSS file read in MER Structured Dataset applies necessary structure read in PEPFAR Data Pack tidies format
  13. What are some threats to tidy data here? Wide data,

    1st row is not machine readable, 2nd row “headers” have multiple pieces of information !
  14. With Excel, you want to specify the sheet to read

    excel_sheets("2022-03-28/hfr.xlsx") read_excel("2022-03-28/hfr.xlsx", sheet = “HFR”, skip = 1)
  15. Looping over multiple sheets purrr::map_dfr(.x = c(“HFR”, “ HFR Eridanus”),

    .f = ~read_excel("2022-03-28/hfr.xlsx", sheet = .x, skip = 1)) OR excel_sheets("2022-03-28/hfr.xlsx") %>% stringr::str_subset("HFR") %>% purrr::map_dfr(.f = ~read_excel("2022-03-28/hfr.xlsx", sheet = .x, skip = 1)) identify sheets in workbook limit vector to those that match HFR iterate read function over multiple sheets and combine into a tibble iterate read function over multiple sheets and combine into a tibble
  16. gs4_auth() read_sheet(as_sheets_id("1B_DcG1WqZv6xo_eBBye-Q31tMGq_DjCBcjlWs3xINOw")) Very similar step reading in Google Sheets need

    to authenticate with your USAID email onces each session Specify the file id (from the url) using as_sheet_id()
  17. Reading in a MSD vroom(filepath_msd) versus gophr::read_msd(filepath_msd) errors on read

    in due to parsing guesses mech_code should be stored as a character; fiscal_year should be integer and targets/ results should be double utilizes vroom, but is setup to read in all the columns as we want them
  18. One last thing Not discussed earlier, but you can define

    your own .name_repair, when there are duplicates, special character, etc in variable/ column names. A powerful package that can do most of the leg work for you is janitor.
  19. Pseudo Coding Exercise How would you read this file in?

    What steps would you need to take to clean it up and make it tidy/useable?
  20. Homework Problem: Using what youʼve learned, import in all the

    data from the HIV estimates to create a tidy dataset. • Data do not start on the first row (hint - readxl parameters) • Missing values are written as “…” (hint - readxl parameters) • Columns are not uniquely named and have special character (hint - janitor and dplyr::rename()) • There are multiple tabs (hint - purrr::map_dfr()) • Values cells often contain characters (eg “<”) and thousand separators are spaces (hint - stringr::str_remove()) • Upper and lower bounds should be separate columns (hint - tidyr::separate())
  21. Resources & Sources • R for Data Science, Garrett Grolemund

    and Hadley Wickham • https://www.tidyverse.org/ • janitor package • https://www.geeksforgeeks.org/difference-between-csv-and-excel/