values [In this table: 6 numbers, 12 strings] Values are arranged by variables [In this table: 3 variables] Values are arranged by observations [In this table: 6 observations] Wickham H. J Stat Softw 59: 1-23, 2014
Treatment breaths_per_min ID001 75 M A 26 ID001 75 M B 25 ID002 64 F A 16 ID002 64 F B 21 ID003 82 M A 14 ID003 82 M B 23 ID004 62 F A 27 ID004 62 F B 16 ID005 72 M A 14 ID005 72 M B 18 Participant Weight_kg Sex Treatment_A_breaths_per_min Treatment_B_breaths_per_min ID001 75 M 26 25 ID002 64 F 16 21 ID003 82 M 14 23 ID004 62 F 27 16 ID005 72 M 14 18 Long (narrow) Wide Scenario • 200 study participants (only first 5 shown) • Administered drug A and B two weeks apart (administered in random order) • Measured: respiratory rate in breaths per minute, 30 minutes after taking each drug.
Stat 72: 2-10, 2018 Choose good variable names • Use snake_case, CamelCase, or hyphenated-words I prefer snake_case because it is the easiest to read • Use short, but meaningful names • Include the measurement units were possible If not, make sure the units are specified in the code book • Do not use special characters (e.g., $ @ % # & * ( ) ! / ^)
Stat 72: 2-10, 2018 Be consistent • Use consistent codes for categorical variables (e.g., ”male” or “Male” or “1”. Do not chop and change) • Use consistent variable names across sheets/files (e.g., “Glucose_10wk” or “gluc_10weeks”. Do not chop and change) • Use consistent observation identifiers across sheets/files (e.g., “153” or “ID153” or “mouse-153”. Do not chop and change)
Stat 72: 2-10, 2018; xkcd comics: https://xkcd.com/1179. Always use ISO8601 date format (YYYY-MM-DD) • Never use the build-in spreadsheet date format. • Always convert spreadsheet date columns/cells to text format • Alternatively, have separate columns for year, month, and day
Stat 72: 2-10, 2018 Make a code book (data dictionary) • The code book must be stored in a separate “sheet” to the raw data or in a separate file altogether • The code book may contain the following information o The full name of each variable (e.g., “Body temperature on day 1”) o The column names used in the spreadsheet (e.g., “Temperature_C_day_1”) o A longer explanation of what the variable means (e.g., “Body temperature was measured on a daily basis at 13:00, using a rectal thermometer”) continued on next slide…
Stat 72: 2-10, 2018 …continued from previous slide • The code book may contain the following information o Units of measure (e.g., “degrees Celsius”) o Expected values (continuous data) (if relevant) (e.g., “Minimum = 33oC, Maximum = 40oC”) o Encoding of categorical data (nominal / ordinal data) (if relevant) (e.g., M = male, F = female 0 = no disease, 1 = stage I, 2 = stage II, 3 = stage III)