Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Organising data in spreadsheets

Organising data in spreadsheets

Basic concepts to take into consideration when storing data in spreadsheets.

The slides were developed for an undergraduate course in statistics.

Peter Kamerman

February 10, 2021
Tweet

More Decks by Peter Kamerman

Other Decks in Research

Transcript

  1. Organising data: constituents of a tidy table Tables consist of

    values [In this table: 6 numbers, 12 strings] Values are arranged by variables [In this table: 3 variables] Values are arranged by observations [In this table: 6 observations] Wickham H. J Stat Softw 59: 1-23, 2014
  2. Organising data: long vs wide table format Participant Weight_kg Sex

    Treatment breaths_per_min ID001 75 M A 26 ID001 75 M B 25 ID002 64 F A 16 ID002 64 F B 21 ID003 82 M A 14 ID003 82 M B 23 ID004 62 F A 27 ID004 62 F B 16 ID005 72 M A 14 ID005 72 M B 18 Participant Weight_kg Sex Treatment_A_breaths_per_min Treatment_B_breaths_per_min ID001 75 M 26 25 ID002 64 F 16 21 ID003 82 M 14 23 ID004 62 F 27 16 ID005 72 M 14 18 Long (narrow) Wide Scenario • 200 study participants (only first 5 shown) • Administered drug A and B two weeks apart (administered in random order) • Measured: respiratory rate in breaths per minute, 30 minutes after taking each drug.
  3. Organising data But…data collected from laboratory experiments typically are collected

    and captured in a fairly structured format, making the job of making and keeping it tidy not too bad.
  4. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Choose good variable names • Use snake_case, CamelCase, or hyphenated-words I prefer snake_case because it is the easiest to read • Use short, but meaningful names • Include the measurement units were possible If not, make sure the units are specified in the code book • Do not use special characters (e.g., $ @ % # & * ( ) ! / ^)
  5. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Be consistent • Use consistent codes for categorical variables (e.g., ”male” or “Male” or “1”. Do not chop and change) • Use consistent variable names across sheets/files (e.g., “Glucose_10wk” or “gluc_10weeks”. Do not chop and change) • Use consistent observation identifiers across sheets/files (e.g., “153” or “ID153” or “mouse-153”. Do not chop and change)
  6. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018; xkcd comics: https://xkcd.com/1179. Always use ISO8601 date format (YYYY-MM-DD) • Never use the build-in spreadsheet date format. • Always convert spreadsheet date columns/cells to text format • Alternatively, have separate columns for year, month, and day
  7. Organising data: tidy spreadsheets Basic table rules • Always start

    at cell: “A1” • Never “freeze panes” • Never “Hide” columns/rows
  8. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 No blank cells Incorrect Correct
  9. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 No multi-line variable names or merged cells Incorrect Correct (wide format) Correct (long format)
  10. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Only one variable per column Incorrect Correct
  11. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Do not use colour for variable coding Incorrect Correct
  12. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Keep the raw data pristine (no calculations) Incorrect (calculations & 2x tables) Correct
  13. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Keep the raw data pristine (no calculations) • Rather duplicate the raw data and make calculations on the duplicate
  14. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 Make a code book (data dictionary) • The code book must be stored in a separate “sheet” to the raw data or in a separate file altogether • The code book may contain the following information o The full name of each variable (e.g., “Body temperature on day 1”) o The column names used in the spreadsheet (e.g., “Temperature_C_day_1”) o A longer explanation of what the variable means (e.g., “Body temperature was measured on a daily basis at 13:00, using a rectal thermometer”) continued on next slide…
  15. Organising data: tidy spreadsheets Broman KW & Woo KH. Am

    Stat 72: 2-10, 2018 …continued from previous slide • The code book may contain the following information o Units of measure (e.g., “degrees Celsius”) o Expected values (continuous data) (if relevant) (e.g., “Minimum = 33oC, Maximum = 40oC”) o Encoding of categorical data (nominal / ordinal data) (if relevant) (e.g., M = male, F = female 0 = no disease, 1 = stage I, 2 = stage II, 3 = stage III)