midfieldr Data, methods, & metrics for studying student persistence Richard Layton Russell Long Matthew Ohland Nichole Ramirez Rose-Hulman Institute of Technology Purdue University Purdue University Purdue University useR! Conference, Brisbane, 2018–07–11

In education, cross-sectional designs are typical group 1 group 2 group 3 different groups at one time time 2 / 27

Longitudinal studies offer some advantages same groups over time time year 1 year 2 year 3 year 4 year 5 year 6 3 / 27

MIDFIELD is a database for longitudinal studies 1.6 M undergraduate students at 21 US institutions whole-population data from registrars 1987–present 4 / 27

MIDFIELD data are curated in four categories students courses terms degrees MIDFIELD : 1.6 M students 5 / 27

R package midfielddata provides a stratified sample students courses terms degrees midfielddata : 98 000 students 6 / 27

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT midfielddata : 98 000 students midfieldstudents 98,000 observations 19 Mb of memory 7 / 27

Each observation is one term for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA midfielddata : 98 000 students midfieldterms 729,000 observations 82 Mb of memory 8 / 27

Each observation is one course for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor midfielddata : 98 000 students midfieldcourses 3.5 M observations 348 Mb of memory 9 / 27

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor student ID institution term major degree midfielddata : 98 000 students midfielddegrees 98,000 observations 10 Mb of memory 10 / 27

midfielddata provides the data midfielddata midfieldstudents midfieldterms midfieldcourses midfielddegrees 11 / 27

midfieldr provides the tools midfielddata midfieldstudents midfieldterms midfieldcourses midfielddegrees library(midfielddata) library(midfieldr) cip_filter() ever_filter() grad_filter() race_sex_join() multiway_order() etc. midfieldr 12 / 27

Both packages are currently available on GitHub 13 / 27

Why R? Increase accessibility to the data Support a collaborative community of researchers Share our methods and metrics 14 / 27

15 / 27

Which is stickier: Engineering or Stats/Applied-Math? stickiness = N graduates of a program N students ever enrolled in the program 16 / 27

We start with the programs’ Classification of Instructional Programs (CIP) codes midfieldr cip_filter() midfieldr 1584 observations cip 17 / 27

cip_filter() helps us find the codes we want 27-series: Applied math and stats #> # A tibble: 4 x 2 #> cip4 cip4name #> #> 1 2701 Mathematics #> 2 2703 Applied Mathematics #> 3 2705 Statistics #> 4 2799 Mathematics and Statistics, Other 14-series: Engineering #> # A tibble: 1 x 2 #> cip2 cip2name #> #> 1 14 Engineering 18 / 27

ever_filter() identifies all students ever enrolled in these two programs midfieldr ever_filter() midfielddata 730 000 observations midfieldterms 19 / 27

ever_filter() identifies all students ever enrolled in these two programs...nearly 20,000 #> # A tibble: 19,404 x 2 #> id program #> #> 1 MID25783162 Engineering #> 2 MID25783166 Engineering #> 3 MID25783167 Engineering #> 4 MID25783178 Engineering #> 5 MID25783197 Engineering #> 6 MID25783199 Engineering #> 7 MID25783257 Engineering #> 8 MID25783259 Engineering #> 9 MID25783275 Engineering #> 10 MID25783388 Engineering #> # ... with 19,394 more rows 20 / 27

race_sex_join() adds student race and sex variables to the data frame 98 000 observations midfieldr race_sex_join() midfielddata midfieldstudents 21 / 27

We group and summarize these data by program, race, and sex #> Observations: 28 #> Variables: 4 #> $ program "Engineering", "Engineering" #> $ race "Asian", "Asian", "Black", " #> $ sex "Female", "Male", "Female", #> $ ever 302, 998, 734, 1273, 177, 57 Numbers are a little high because "possible graduation in 6 years" not yet accounted for. 22 / 27

grad_filter() identifies the students graduating from these two programs midfieldr grad_filter() midfielddata 98 000 observations midfielddegrees 23 / 27

Again, we group and summarize these data (7500 graduates) by program, race, and sex #> Observations: 27 #> Variables: 4 #> $ program "Engineering", "Engineering" #> $ race "Asian", "Asian", "Black", " #> $ sex "Female", "Male", "Female", #> $ grad 129, 445, 276, 395, 55, 206, 24 / 27

We join the graduates to the ever-enrolleds and compute stickiness for programs with > 5 enrolled stickiness <- left_join(ever_enrolled, graduated) %>% filter(ever > 5) %>% mutate(stickiness = grad / ever) Graph using conventional ggplot2 functions. 25 / 27

Statistics and Applied Math Engineering 0.1 0.2 0.3 0.4 0.5 0.6 Black Male Hispanic Male Native American Female Black Female Native American Male Hispanic Female White Female White Male Asian Female Asian Male Black Male Hispanic Male Native American Female Black Female Native American Male Hispanic Female White Female White Male Asian Female Asian Male Stickiness 26 / 27

To find out more... R packages MIDFIELD Project midfi[email protected] Support provided by the US National Science Foundation, Grant 1545667 Expanding Access to and Participation in the Multiple-Institution Database for Investigating Engineering Longitudinal Development 27 / 27