SOC 4650 & SOC 5650 - Lecture 03

Install the janitor and naniar packages in R Please sit
in the same place you sat last night! WELCOME! GETTING STARTED Check Slack’s #_news channel for a link to this week’s entry ticket

THE NATURE OF SPATIAL DATA (PART 1) INTRO TO GISC
CHRISTOPHER PRENER, PH.D. SPRING, 2018 WEEK 04 LECTURE 03

AGENDA INTRO TO GISC / WEEK 04 / LECTURE 03
1. Front Matter 2. GISc & Public Policy 3. Types of Data 4. Methodological Challenges in GISc 5. Data Wrangling 6. Back Matter

1 FRONT   MATTER

LP-04, Lab-02, and PS-01 are all due next class (February
12th) 1. FRONT MATTER ANNOUNCEMENTS Please sit at the same computer you worked at last class! We’ll be putting ﬁnal project workgroups together this week.

GISC & PUBLIC POLICY 2

TYPES OF DATA 3

3. TYPES OF DATA WHAT ARE SPATIAL DATA? Tabular Geometric

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous Tabular

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous crime
“ucr crime type”  1. murder 2. rape 3. robbery 4. aggravated assault

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous crimeStr
“ucr crime type, string”  murder rape robbery aggravated assault

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous newCrime
“new incident”  1. no 2. yes 0. 1.

3. TYPES OF DATA TABULAR DATA qualitative ordinal continuous newCrime
“new incident”  1. no 2. yes 0. 1.

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous leadLevels
“rate of exposure”  1. very low 2. low 3. high 4. very high

3. TYPES OF DATA TABULAR DATA categorical ordinal continuous leadCount
“number of high tests”  1 2 … 85

3. TYPES OF DATA GEOMETRIC DATA Tabular Geometric

3. TYPES OF DATA GEOMETRIC DATA Geometric Vector polygon Raster

3. TYPES OF DATA GEOMETRIC DATA Vector point line polygon

3. TYPES OF DATA GEOMETRIC DATA point line polygon

POINT GEOMETRIC DATA - SHAW CRIMES

3. TYPES OF DATA GEOMETRIC DATA point line polygon zero
dimensional data -   no length or width

LINE GEOMETRIC DATA - STREET CENTERLINES

3. TYPES OF DATA GEOMETRIC DATA point line polygon one
dimensional data -   has length but not width

POLYGON GEOMETRIC DATA - BUILDING FOOTPRINTS

3. TYPES OF DATA GEOMETRIC DATA point line polygon two
dimensional data -   has length and width

3. TYPES OF DATA GEOMETRIC DATA Raster

RASTER GEOMETRIC DATA - SHAW CRIMES

METHODOLOGICAL CHALLENGES IN GISC 4

EVERYTHING IS RELATED TO EVERYTHING ELSE, BUT NEAR THINGS ARE
MORE RELATED THAN DISTANT THINGS Waldo Tobler “A computer movie simulating   urban growth in the Detroit region”  (1970)

4. METHODOLOGICAL CHALLENGES IN GISC NEAR THINGS ARE LIKE OTHER
NEAR THINGS Geographic patterning or “clustering” is referred to as spatial autocorrelation.

4. METHODOLOGICAL CHALLENGES IN GISC NEAR THINGS ARE LIKE OTHER
NEAR THINGS Geographic patterning or “clustering” is referred to as spatial autocorrelation. I = -1.000 I = 0.000 I = 0.857

4. METHODOLOGICAL CHALLENGES IN GISC PAST BEHAVIOR IS PREDICTIVE When
patterns repeat over time, like traveling the same route for your commute, we refer to them as having temporal autocorrelation.

THE SIERRA NEVADA 4. METHODOLOGICAL CHALLENGES IN GISC

KEY TERM Topographic maps are  special types of reference  maps
that use a basemap with contour lines describing terrain - changes in the altitude and shape of the earth’s surface.

THE SIERRA NEVADA 4. METHODOLOGICAL CHALLENGES IN GISC

SPATIAL SAMPLING 4. METHODOLOGICAL CHALLENGES IN GISC

▸ Where did these data come from? ▸ How were
they collected? ▸ Are these data valid? ▸ How are these data measured? ▸ Are they representative? ▸ Are these data appropriate for the question or application at hand? 4. METHODOLOGICAL CHALLENGES IN GISC SPATIAL SAMPLING

THE MAINE COAST 4. METHODOLOGICAL CHALLENGES IN GISC

THE MAINE COAST

SMALL SCALE Target Scale: 1:20,000,000  Miles per Inch: ~316

LARGER SCALE Target Scale: 1:5,000,000  Miles per Inch: ~79

LARGER SCALE Target Scale: 1:500,000  Miles per Inch: ~7.89

SCALE & DISTANCE 1:20,000,000 1:500,000

SCALE & REPRESENTATION 1:20,000,000 1:500,000

LONDON 4. METHODOLOGICAL CHALLENGES IN GISC

RAW QUANTITATIVE DATA 4. METHODOLOGICAL CHALLENGES IN GISC

NORMALIZED QUANTITATIVE DATA 4. METHODOLOGICAL CHALLENGES IN GISC

DATA WRANGLING 5

5. DATA WRANGLING HIGH LEVEL WORKFLOW 1. Plan 2. Organize
3. Document 4. Execute FOR EACH  STEP:

IT IS OFTEN SAID THAT 80% OF DATA ANALYSIS IS
SPENT ON THE PROCESS OF CLEANING AND PREPARING THE DATA Hadley Wickham “Tidy Data”  (2014)

HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY
IN ITS OWN WAY Leo Tolstoy Anna Karenina  (1878)

LIKE FAMILIES, TIDY DATASETS ARE ALL ALIKE BUT EVERY MESSY
DATASET IS MESSY IN ITS OWN WAY. Hadley Wickham “Tidy Data”  (2014)

TIDY DATASETS PROVIDE A STANDARDIZED WAY TO LINK THE STRUCTURE
OF A DATASET (ITS PHYSICAL LAYOUT) WITH ITS SEMANTICS (ITS MEANING). Hadley Wickham “Tidy Data”  (2014)

5. DATA WRANGLING TIDY DATA Each variable should be saved
in its own column. A B C D

▸ Are variables named consistently? ▸ Are variable names clear?
▸ Are variables stored in the format that makes the most sense for their data? ▸ Do variables represent one and only one construct? ▸ Is there a unique identiﬁcation variable? ▸ Is there missing or incomplete data? 5. DATA WRANGLING A B VARIABLES

5. DATA WRANGLING TIDY DATA Each observation should be saved
in its own row. A B C D

▸ What is the observational unit? 5. DATA WRANGLING A
B OBSERVATIONS

5. DATA WRANGLING TIDY DATA Each table (i.e. each ﬁle
or data frame) should contain a single observational unit. A B C D Car Dealer Brand Cars Dealers Brands

▸ What is the observational unit? • Do the data
need to be subset into tables with different observational units? ▸ Are there duplicate observations? ▸ Are there “near” duplicate observations? 5. DATA WRANGLING A B OBSERVATIONS

▸ readr for reading data ▸ magrittr for its pipe
operator, but we don’t need to load it explicitly ▸ janitor for its data cleaning functions ▸ dplyr for data wrangling functions ▸ naniar for missing data analyses 5. DATA WRANGLING PACKAGES

▸ filename is the name of the ﬁle you wish
to load Available in readr  Download via CRAN as part of tidyverse 5. DATA WRANGLING READING IN TABULAR DATA Parameters: read_csv(“data/filename.csv”) f(x)

▸ filename is the name of the ﬁle you wish
to load 5. DATA WRANGLING READING IN TABULAR DATA Parameters: read_csv(“data/filename.csv”) f(x)

f(x) 5. DATA WRANGLING READING IN TABULAR DATA read_csv(“data/filename.csv”) Using
a hypothetical ﬁle: > leadData <- read_csv(“data/leadData.csv”) Always copy and paste your raw data into a data subfolder of your project.

f(x) 5. DATA WRANGLING READING IN TABULAR DATA read_csv(“data/filename.csv”) Using
a hypothetical ﬁle: > leadData <- read_csv(“data/leadData.csv”) The read_csv() is easily confused with read.csv() - these are not the same functions and produce slightly different data frames!

▸ .data is a data frame or a tibble ▸
case can take on a range of options, but I recommend using either snake for snake_case or small_camel for camelCase. Available in janitor  Download via CRAN 5. DATA WRANGLING RENAMING VARIABLES EN MASSE Parameters: clean_names(.data, case) f(x)

case can take on a range of options, but I recommend using either snake for snake_case or small_camel for camelCase. 5. DATA WRANGLING RENAMING VARIABLES EN MASSE Parameters: clean_names(.data, case) f(x)

f(x) 5. DATA WRANGLING RENAMING VARIABLES EN MASSE clean_names(.data, case)
Using the stlLead data from stlData: > leadData <- clean_names(leadData, case = “snake”) It doesn’t really matter how you name variables as long as there are (a) no spaces in the name and (b) you name them consistently and clearly.

5. DATA WRANGLING RENAMING VARIABLES EN MASSE > str(leadData) 'data.frame':
106 obs. of 15 variables: $ geoID : num 2.95e+10 2.95e+10 2.95e+10 2.95e+10 ... $ tractCE : int 118100 117400 126700 119102 126800 126900 ... $ nameLSAD : chr "Census Tract 1181" "Census Tract 1174" … > leadData <- clean_names(leadData, case = “snake”) > str(leadData) 'data.frame': 106 obs. of 15 variables: $ geo_id : num 2.95e+10 2.95e+10 2.95e+10 2.95e+10 2.95e+10 ... $ tract_ce : int 118100 117400 126700 119102 126800 126900 … $ name_lsad : chr "Census Tract 1181" "Census Tract 1174" ...

newName is the new variable name you want to use ▸ oldName is the current name of the variable Available in dplyr  Download via CRAN as part of tidyverse 5. DATA WRANGLING RENAMING A SINGLE VARIABLE Parameters: rename(.data, newName = oldName) f(x)

newName is the new variable name you want to use ▸ oldName is the current name of the variable 5. DATA WRANGLING RENAMING A SINGLE VARIABLE Parameters: rename(.data, newName = oldName) f(x)

f(x) 5. DATA WRANGLING RENAMING A SINGLE VARIABLE rename(.data, newName
= oldName) Using the povertyU18 variable from stlData’s stlLead data: > leadData <- rename(leadData, povertyKids = povertyU18) It doesn’t really matter how you name variables as long as there are (a) no spaces in the name and (b) you name them consistently and clearly.

5. DATA WRANGLING PIPELINES ```{r rename-variables} leadData %>% clean_names(case =
“snake”) %>% rename(pvty_u18 = poverty_u_18) %>% rename(pvty_u18_moe = poverty_u_18_moe) -> leadData ``` 1. Take the leadData data frame, then 2. standardize all of the variable names to snake_case, then 3. rename poverty_u_18 to pvty_u18, then 4. rename poverty_u_18_moe to pvty_u18_moe, and 5. assign the changes to leadData.

5. DATA WRANGLING PIPELINES 1. Keep them short 2. Keep
them focused on a single group of tasks, like renaming variables 3. Remember not to include the data frame reference for functions in the pipeline 4. Some functions are not compatible (or at least easily compatible) with the pipe operator

▸ .data is a data frame or a tibble Available
in naniar  Download via CRAN 5. DATA WRANGLING MISSING DATA ANALYSIS Parameters: miss_var_summary(.data) f(x)

▸ .data is a data frame or a tibble 5.
DATA WRANGLING MISSING DATA ANALYSIS Parameters: miss_var_summary(.data) f(x)

f(x) 5. DATA WRANGLING MISSING DATA ANALYSIS miss_var_summary(.data) Using the
stlLead data from stlData: > miss_var_summary(leadData) Output can optionally be assigned to a new data object.

5. DATA WRANGLING MISSING DATA ANALYSIS > miss_var_summary(leadData) # A
tibble: 15 x 3 variable n_missing percent <chr> <int> <dbl> 1 geoID 0 0 2 tractCE 0 0 3 nameLSAD 0 0 4 countTested 0 0 5 pctElevated 0 0 6 totalPop 0 0 7 totalPop_MOE 0 0 8 white 0 0

▸ .data is a data frame or a tibble 5.
DATA WRANGLING MISSING DATA ANALYSIS Parameters: miss_case_summary(.data) f(x)

f(x) 5. DATA WRANGLING MISSING DATA ANALYSIS miss_case_summary(.data) Using the
stlLead data from stlData: > miss_case_summary(leadData) Output can optionally be assigned to a new data object.

5. DATA WRANGLING MISSING DATA ANALYSIS > miss_case_summary(leadData) # A
tibble: 106 x 3 case n_missing percent <int> <int> <dbl> 1 1 0 0 2 2 0 0 3 3 0 0 4 4 0 0 5 5 0 0 # ... with 101 more rows > missingCases <- miss_case_summary(leadData)

varList can be optionally speciﬁed to look for duplicates in only one variable or in a list of variables Available in janitor  Download via CRAN 5. DATA WRANGLING FIND DUPLICATE VALUES Parameters: get_dupes(.data, varList) f(x)

varList can be optionally speciﬁed to look for duplicates in only one variable or in a list of variables 5. DATA WRANGLING FIND DUPLICATE VALUES Parameters: get_dupes(.data, varList) f(x)

f(x) 5. DATA WRANGLING FIND DUPLICATE VALUES get_dupes(.data, varList) Using
the stlLead data from stlData: > get_dupes(leadData) If no varList is supplied, this will look for duplicates across all columns.

5. DATA WRANGLING FIND DUPLICATE VALUES > get_dupes(leadData) No variable
names specified - using all columns. No duplicate combinations found of: geoID, tractCE, nameLSAD, countTested, pctElevated, totalPop, totalPop_MOE, white, white_MOE, ... and 6 other variables # A tibble: 0 x 16 # ... with 16 variables: geoID <dbl>, tractCE <int>, nameLSAD <chr>, # countTested <int>, pctElevated <dbl>, totalPop <int>, # totalPop_MOE <int>, white <int>, white_MOE <int>, black <int>, # black_MOE <int>, povertyTot <int>, povertyTot_MOE <int>, # povertyU18 <int>, povertyU18_MOE <int>, dupe_count <int>

5. DATA WRANGLING FIND DUPLICATE VALUES > get_dupes(leadData, geoID) No
duplicate combinations found of: geoID No duplicate combinations found of: geoID, tractCE, nameLSAD, countTested, pctElevated, totalPop, totalPop_MOE, white, white_MOE, ... and 6 other variables # A tibble: 0 x 16 # ... with 16 variables: geoID <dbl>, dupe_count <int>, # tractCE <int>, nameLSAD <chr>, countTested <int>, # pctElevated <dbl>, totalPop <int>, totalPop_MOE <int>, # white <int>, white_MOE <int>, black <int>, black_MOE <int>, # povertyTot <int>, povertyTot_MOE <int>, povertyU18 <int>,   # povertyU18_MOE <int>

varName is either an existing variable you want to edit or a new variable you want to create ▸ expression is the current name of the variable Available in dplyr  Download via CRAN as part of tidyverse 5. DATA WRANGLING SELECTING VARIABLES Parameters: select(.data, varList) f(x)

varList is a list of variables to either retain or remove 5. DATA WRANGLING SELECTING VARIABLES Parameters: select(.data, varList) f(x)

f(x) 5. DATA WRANGLING SELECTING VARIABLES select(.data, varList) Using the
stlLead data from stlData: > elevateData <- select(leadData, geoID, pctElevated) This variation of select() will keep the listed variables in your data frame.

f(x) 5. DATA WRANGLING SELECTING VARIABLES select(.data, varList) Using the
stlLead data from stlData: > demoData <- select(leadData, -countTested,   —pctElevated) This variation of select() will remove the listed variables in your data frame.

varName is a variable whose characteristics will be used to identify rows to retain ▸ expression provides instructions on how to evaluate that variable (see cookbook handout) 5. DATA WRANGLING SUBSETTING OBSERVATIONS Parameters: filter(.data, expression) f(x)

f(x) 5. DATA WRANGLING SUBSETTING OBSERVATIONS filter(.data, expression) Using the
stlLead data from stlData: > highData <- filter(leadData, pctElevated >= 15) The filter() function will always keep the observations that are evaluated as TRUE.

f(x) 5. DATA WRANGLING SUBSETTING OBSERVATIONS filter(.data, expression) Using the
stlLead data from stlData: > highData <- filter(leadData, pctElevated >= 15) This example evaluates each observation of pctElevated to see if its greater than or equal to 15. If it is, it retains that row. If it is not, it removes that row.

varName is either an existing variable you want to edit or a new variable you want to create ▸ expression provides instructions on how to alter the variable (see cookbook handout) 5. DATA WRANGLING ALTERING A VARIABLE Parameters: mutate(.data, varName = expression) f(x)

f(x) 5. DATA WRANGLING ALTERING A VARIABLE mutate(.data, varName =
expression) Using the pctElevated variable from stlData’s stlLead data: > leadData <- mutate(leadData, leadData =   ifelse(pctElevated >= 15, “high”, “low”)) The ifelse() function evaluates a statement to be either TRUE or FALSE for each observation. If TRUE, it returns the ﬁrst value (high in this case). If FALSE, it returns the second.

expression) Using the pctElevated variable from stlData’s stlLead data: > leadData <- mutate(leadData, leadData =   ifelse(pctElevated >= 15, “high”, “low”)) This example evaluates each observation of pctElevated to see if its greater than or equal to 15. If it is, it enters a value of high in the new variable highLead. If it is not, it enters a value of low in the new variable. Note how quotes are used.

expression) Using the countTested variable from stlData’s stlLead data: > leadData <- mutate(leadData, countTested =   ifelse(geoID == 29510118100, 445, countTested)) This example edits a single observation of countTested by using a unique identiﬁcation variable geoID. It changes the errant value in that observation to 445 while retaining the original values in the remaining observations.

expression) Using the countTested variable from stlData’s stlLead data: > leadData <- mutate(leadData, countTested =   as.character(countTested)) This example edits a single observation of countTested by using a unique identiﬁcation variable geoID. It changes the errant value in that observation to 445 while retaining the original values in the remaining observations.

5. DATA WRANGLING PIPELINES ```{r rename-variables} leadData %>% mutate(highLead =
ifelse(pctElevated >= 15, “high”, “low”)) %>% mutate(countTested = ifelse(geoID == 29510118100, 445,   countTested)) -> leadData ``` 1. Take the leadData data frame, then 2. create a new variable named highLead, then 3. ﬁx an errant value in countTested, then 4. assign the changes to leadData.

6 BACK   MATTER

AGENDA REVIEW 6. BACK MATTER 2. GISc & Public Policy
3. Types of Data 4. Methodological Challenges in GISc 5. Data Wrangling

REMINDERS #. BACK MATTER LP-04, Lab-02, and PS-01 are all
due next class (February 12th) We’ll be putting ﬁnal project workgroups together this week. Please close any open GitHub Issues in your assignment repository!

SOC 4650 & SOC 5650 - Lecture 03

SOC 4650 & SOC 5650 - Lecture 03

More Decks by Christopher Prener

Other Decks in Education

Featured

Transcript