Slide 1

Slide 1 text

www.meetup.com/Montreal-R-User-Group/

Slide 2

Slide 2 text

Basics and loading data Etienne Low-Decarie material in part prepared by Zofia Taranu

Slide 3

Slide 3 text

Learning Objectives ›  Create an R project ›  Look at Data in R (reminder) ›  Create data that is appropriate for use with R ›  Import data ›  Dealing with broken data files ›  Identifying the problem ›  Fixing it in a spreadsheet ›  Fixing it in R ›  Dealing with bad data formats ›  Reshape from wide to long ›  Save and export data

Slide 4

Slide 4 text

Create an R project ›  Create a basic folder structure for each of your projects ›  Helps you stay organized ›  R will look in here and save to here

Slide 5

Slide 5 text

Create an R project ›  You can download the whole structure with R scripts, data, etc from: ›  https://github.com/zerotorhero/MBSU

Slide 6

Slide 6 text

Create an R project in RStudio ›  In R Studio, create a project in that folder ›  Tells R this is the project and folder your are working on

Slide 7

Slide 7 text

Create an R project ›  In the existing directory you have created or downloaded

Slide 8

Slide 8 text

Create an R project ›  Quit R studio and re-open it by double clicking on the project file ›  Rstudio will open with all your scripts open as when you quit ›  you can directly browse files within your project folder

Slide 9

Slide 9 text

The Scripts › What is this? •  A text file that contains all the commands you will use! › Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!

Slide 10

Slide 10 text

Create an R script

Slide 11

Slide 11 text

The Script (s)

Slide 12

Slide 12 text

Also File-> Open File -> 02_Basics and importing data.R Recommendation create your own new script refer to provided code only if needed avoid copy pasting or running the code directly from script

Slide 13

Slide 13 text

› Text in the R script looks like Input # this is my command › To run this command highlight and (shortcut: CTRL ↵ or ⌘↵) The Scripts

Slide 14

Slide 14 text

Housekeeping ›  Steps you should take before running code ›  Written as a section at the top of scripts

Slide 15

Slide 15 text

Housekeeping › Removes all variables from memory › Prevents errors such as use of older data # Clear R memory rm(list = ls() ) Type this in your R script

Slide 16

Slide 16 text

Commenting ›  # symbol tells R to ignore this ›  commenting/documenting ›  annotate someone’s script is good way to learn ›  remember what you did ›  tell collaborators what you did ›  good step towards reproducible science

Slide 17

Slide 17 text

Directory › A directory is a folder and the instructions (path) to get to that folder › a “/” separate folders and file › “.” indicates the current working directory is › ie where you created your R project › to know what the current working directory is ›  type “getwd()” in the console ›  RStudio sets the directory to the folder containing your R project

Slide 18

Slide 18 text

Look at data load a built-in data file peek at first few rows structure of the object names of items in the object attributes of the object summary statistics plot of all variable combinations data(CO2) head(CO2) str(CO2) names(CO2) attributes(CO2) summary(CO2) plot(CO2) Working with a data frame

Slide 19

Slide 19 text

Look at data ›  data() ›  head(), str(), names(), attributes(), summary, plot() ›  discuss difference between how you store data and data in R

Slide 20

Slide 20 text

Prep data for R ›  comma separate files (.csv) in Data folder ›  can be created from almost all apps (Excel, LibreOffice, GoogleDocs) ›  file-> save as… .csv

Slide 21

Slide 21 text

Prep data for R ›  short informative column headings ›  starting with a letter ›  no spaces

Slide 22

Slide 22 text

Prep data for R ›  Columns values match their intended use ›  No text in numeric columns ›  including spaces ›  NA (not available) is allowed ›  Avoid numeric values for data that does not have numeric meaning ›  Subject, Replicate, Treatment ›  1,2,3 -> A,B,C or S1,S2,S3 or …

Slide 23

Slide 23 text

Prep data for R ›  no gimmicks ›  no notes, additional headings, merged cells

Slide 24

Slide 24 text

Prep data for R ›  Prefer long format ›  Wide ›  Each level of a factor gets a column ›  Multiple measurements per row ›  Excel, SPSS… ›  Pros ›  Plays nice with humans ›  No data repetition ›  “Eyeballable” ›  Cons ›  Does not play nice with R ›  Long ›  Levels are expressed in a column ›  One measured value per row ›  eg. really long: XML, JSON (tag:content pairs) ›  Pros ›  Plays nice with computers (API, databases, plyr, ggplot2…) ›  Cons ›  Does not play nice with humans ›  Lots of copy pasting and forget eyeballing it!

Slide 25

Slide 25 text

Prep data for R ›  Try to prep your data for R or find data you find interesting online and prep it for R ›  Note: it is possible to do all your data prep work within R ›  can be very tedious ›  keeps original data intact ›  can even switch between long and wide

Slide 26

Slide 26 text

Importing data › We will now import the data file iris_data<-­‐read.csv(“./Data/iris_good.csv”)   › Recall: to find out what arguments the command requires, use help “?” ! Object (name) Command (what am I doing) Argument (what am I applying this to) ?read.csv

Slide 27

Slide 27 text

Importing data Notice that R-Studio now provides information on the iris_data

Slide 28

Slide 28 text

Importing data ›  Try importing the data you prepared for R ›  my_good_data<-read.csv(…) ›  Try importing data that is not ready for R ›  not_ready<-read.csv(…) ›  Look at both

Slide 29

Slide 29 text

Manipulate the data ›  Do calculations ›  Eg:

Slide 30

Slide 30 text

Manipulate the data ›  Write your own function ! my.function<-function(arguments){! ! ! ! results<-do.something(arguments)! return(data.frame(results))}! ! area.ellipse<-function(width, length){! ! !area<-pi*(width/2)*(length/2)! ! !return(data.frame(area=area))}! ! with(iris_data, area_ellipse(Sepal.Width, Sepal.Length))! ! ! ! !

Slide 31

Slide 31 text

Manipulate the data ›  Do calculations for each group ›  Eg.: ›  Replace specific values iris_data$mean.sepal.length[iris_data$Species=="setosa"]<-with(iris_data, mean(Sepal.Length)) iris_data$Species[iris_data$Species=="setosa"]<-"Setosa"

Slide 32

Slide 32 text

Manipulate the data ›  Manipulate your data ›  my_data_with_calcs<-my_data ›  means for a group, density, volume…

Slide 33

Slide 33 text

Save data # Saving an R file save(iris_data, file = ”./Output/iris_cleaned.R") # Clear your memory rm(list = ls()) # Reload iris_data load(”./Ouptut/iris_cleaned.R") head(iris_data) # looking good! Save Clear Reload … that’s it!

Slide 34

Slide 34 text

Exporting data write.table(normalized_iris, file=”./Output/normalized_iris.csv", sep = ",") … that’s it!

Slide 35

Slide 35 text

Save and export ›  Manipulate your data ›  save and export : my_data_with_calcs › open it in your previously favourite data app

Slide 36

Slide 36 text

Hard challenge ›  Read in the file "CO2_broken.csv” ›  This is probably what your data or the data you downloaded looks like ›  You can do it in R (or not…) ›  Read your own un-prepped data [PLEASE DON’T: look at the answers in the script before trying. DO: work with your neighbors and have FUN! HINT: There are 4 errors]

Slide 37

Slide 37 text

www.meetup.com/Montreal-R-User-Group/

Slide 38

Slide 38 text

Broken data › Try to read in the data file: iris_broken.csv › It didn’t work because the extension was .txt and not .csv iris_data<-read.csv("iris_broken.csv") > iris_data<-read.csv("iris_broken.csv") Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'iris_broken.csv': No such file or directory iris_data<-read.csv("iris_broken.txt") ERROR 1

Slide 39

Slide 39 text

Broken data > head(iris_data) I.did.my.best.to.create.the.most.annoying.file.to.import.into.R 1 This really looks like the first file I ever imported into R\t\t\t\t\t 2 I since do a way better job of cleaning up my data\t\t\t \t\t 3 But some collaborators will never diverge from their sloppy ways\t \t\t\t\t 4 \tSepal.Length\tSepal.Width\tPetal.Length\tPetal.Width \tSpecies 5 1\t5.1\t3.5\t1.4\t0.2\tsetosa 6 2\t4.9\t3\t1.4\t0.2\tsetosa head(iris_data) › The data appears to be lumped into one line! ERROR 2

Slide 40

Slide 40 text

Broken data › Re-import the data, but specify the separation among entries •  The sep argument tells R what character separates the values on each line of the file (here; TAB was used) › The first 4 lines are useless › Is anything else strange? iris_data<-read.csv("iris_broken.txt", sep = “”) head(iris_data) str(iris_data) ERROR 2 iris_data<-read.csv("iris_broken.txt", sep = "", skip = 4) head(iris_data) str(iris_data) ERROR 3

Slide 41

Slide 41 text

Broken data › Most continuous variables are listed as factors (categorical) •  Due to missing values entered as “Forgot_this_value” and “na” •  Recall that R only recognizes “NA” (capitalized) > str(iris_data) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length : Factor w/ 35 levels "4.3","4.4","4.5",..: 9 7 5 4 8 12 4 ... $ Sepal.Width : Factor w/ 24 levels "2","2.2","2.3",..: 15 10 12 11 16 ... $ Petal.Length : Factor w/ 45 levels "1","1.1","1.2",..: 5 5 4 6 5 8 5 6 5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ... ERROR 4

Slide 42

Slide 42 text

ERROR 4

Slide 43

Slide 43 text

ERROR 4 read.csv("iris_broken.txt", sep = "", skip = 4, na.strings = c("NA","na","Forgot_this_value")) The fix!

Slide 44

Slide 44 text

Broken data ERROR 5 ›  Ok we’re nearly done! ›  Some variables still appear as factors •  row 23 of Sepal Width was entered as “_3.6” instead of “3.6” ›  Two new arguments we will need •  as.is •  as.numeric Tells R to leave the variable alone Tells R to make the variable numerical iris_data$Sepal.Width[23] class(iris_data$Sepal.Width)

Slide 45

Slide 45 text

Broken data › Now R thinks these variables are only characters •  So next we will use as.numeric •  Notice the WARNING message because NAs were introduced where non-numeric values were found. ERROR 5 iris_data<-read.csv("iris_broken.txt", sep="", skip=4, na.strings=c("NA", "na","forgot_this_value"), as.is=c("Sepal.Width", "Petal.Length")) iris_data$Sepal.Width <- as.numeric(iris_data$Sepal.Width) iris_data$Petal.Length <- as.numeric(iris_data$Petal.Length)

Slide 46

Slide 46 text

www.meetup.com/Montreal-R-User-Group/