Making MIDFIELD More Accessible A workshop for R beginners Richard Layton, Matthew Ohland, Russell Long, Marisa Orr 2018–10–03 FIE Conference, San Jose, CA

R-Bar volunteers can help with software issues Min Topic 10 Introductions 20 Elements of effective graphs 30 Getting started with R (tutorial) 20 Accessing the MIDFIELD data 20 — break — 40 Using midfieldr (tutorial) 10 Extending your repertoire 10 Next steps 20 Conversations 3

Elements of effective graphs 4

In your handout, list the slices A thru E from largest to smallest A B C D E Adapted from (Robbins 2013) Ch. 2 5

In your handout, list the slices A thru E from largest to smallest A B C D E • B (largest) Adapted from (Robbins 2013) Ch. 2 5

In your handout, list the slices A thru E from largest to smallest A B C D E • B (largest) • D • A • C • E (smallest) Adapted from (Robbins 2013) Ch. 2 5

The same data arranged along a common axis Comparing values along a common axis is a high-accuracy visual task. E C A D B 17 18 19 20 21 22 23 6

Slices are what percentage of the whole? A D C B Fill in the blanks A. The total should be 100% B. C. D. 7

3D-effects distort our judgment A D C B Fill in the blanks A. 20% The total should be 100% B. 20% C. 20% D. 40% 8

Again, the same data arranged along a common axis A high-accuracy visual task. A B C D 20 25 30 35 40 9

Write down the heights of the bars This is a visual inspection only. Fill in the blanks A. B. C. D. Adapted from (Robbins 2013) p. 22 10

Again, 3D-effects distort our judgment This is a visual inspection only. Fill in the blanks A. 2 B. 4 C. 6 D. 8 11

Again, the same data arranged along a common axis A high-accuracy visual task. A B C D 2 4 6 8 12

You can use bars, but must include zero A B C D 0 2 4 6 8 13

If you mark the endpoints, you can omit the bar A B C D 0 2 4 6 8 14

Producing a “dot plot” with rows ordered per the data A B C D 0 2 4 6 8 15

Try estimating areas of three states Visual estimation of area is a low-accuracy task. South Carolina (SC) ≈ 83,000 sq km. FL x 1000 sq. km GA x 1000 sq. km AL x 1000 sq. km SC 83 x 1000 sq. km Adapted from (Ihaka 2007) 16

Again, the same data arranged along a common axis FL x 1000 sq. km GA x 1000 sq. km AL x 1000 sq. km SC 83 x 1000 sq. km 17

Your estimates have probably improved FL 170 x 1000 sq. km GA 154 x 1000 sq. km AL 136 x 1000 sq. km SC 83 x 1000 sq. km 18

When color represents area, what story emerges? Color used deceptively, 2012 election by county: Obama, Romney 19

When color represents voters? Color used judiciously, each dot 100 votes for: Obama, Romney Color. Color represents a quantity – each dot is 100 votes. 20

The experts tell us Optimal design primarily depends on • The message to be conveyed • The variables to be shown (Doumont 2009) Image from 21

The experts tell us The task of the designer is to give visual access to the subtle and the difficult — that is, reveal the complex. (Tufte 1983) Image from 22

The experts tell us What’s your point? Seriously, that’s the most important question. (Evergreen 2017) Image from 23

R is designed with statistical analysis and data graphics in mind Well-designed data graphics are accessible, even to the beginner • makes graphical exploration of data accessible to all • work in progress is easily disseminated via GitHub And because R is open-source • new packages appear regularly—one might solve your problem • anyone can help us find errors and add features to our packages 24

Getting started with R (tutorial) 25

This self-paced tutorial introduces basic R • Don’t worry about the pace of your work. • Everyone works and learns new material at a different pace. • Please ask questions of your neighbors as well as the facilitators • If you finish early, ask if anyone near you needs assistance • Save your work regularly 26

Slide 29

Accessing the MIDFIELD data 28

In education, cross-sectional designs are typical group 1 group 2 group 3 different groups at one time time 29

Longitudinal studies offer some advantages same groups over time time year 1 year 2 year 3 year 4 year 5 year 6 30

MIDFIELD is a database for longitudinal studies • 1.6 M undergraduate students at 21 US institutions • whole-population data from registrars • 1987–present 31

MIDFIELD data are curated in four categories students courses terms degrees MIDFIELD : 1.6 M students 32

R package midfielddata provides a stratified sample students courses terms degrees midfielddata : 98 000 students 33

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT midfielddata : 98 000 students midfieldstudents 98,000 observations 19 Mb of memory 34

Each observation is one term for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA midfielddata : 98 000 students midfieldterms 729,000 observations 82 Mb of memory 35

Each observation is one course for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor midfielddata : 98 000 students midfieldcourses 3.5 M observations 348 Mb of memory 36

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor student ID institution term major degree midfielddata : 98 000 students midfielddegrees 98,000 M observations 10 Mb of memory 37

midfielddata provides the data midfieldstudents midfieldterms midfieldcourses midfielddegrees 38

midfieldr provides the tools midfieldstudents midfieldterms midfieldcourses midfielddegrees library(midfielddata) library(midfieldr) cip_filter() ever_filter() grad_filter() race_sex_join() multiway_order() etc. 39

Preparing for the workshop, you installed both packages 40

midfieldr provides functions for working with midfieldddata Some of those functions you will use today are: Function Provides cip_filter() Identify programs by CIP code cip_label() Label your programs ever_filter() Find all students ever enrolled in your programs grad_filter() Find all graduates of your programs race_sex_join() Join student race/ethnicity and sex to the data multiway_order() Order the rows and panels of multiway data 41

midfieldr provides functions for working with midfieldddata Some of those functions you will use today are: Function Provides cip_filter() Identify programs by CIP code cip_label() Label your programs ever_filter() Find all students ever enrolled in your programs grad_filter() Find all graduates of your programs race_sex_join() Join student race/ethnicity and sex to the data multiway_order() Order the rows and panels of multiway data • We’ll work with midfieldr after the break. 41

Using midfieldr (tutorial) 42

This self-paced tutorial illustrates midfieldr functions • Don’t worry about the pace of your work. • Everyone works and learns new material at a different pace. • Please ask questions of your neighbors as well as the facilitators • If you finish early, ask if anyone near you needs assistance • Save your work regularly 43

Slide 47

Extending your repertoire: Metrics & graphics 45

Graduation rates of starters Figure 4 Graduation rates of starters 46

Stickiness in major and in any other major 47

Starting and destination majors of all women ever in EE Sankey diagram Non−ENG Other−ENG EE Unknown Non−ENG Other−ENG EE N = 2.5 N = 1.5 N = 1.0 0 1 2 3 0 1 2 3 Starting Major Year 6 Destination Women ever enrolled in Electrical Engineering Number of students (x1000) 20 20 48

Migration yield 49

Comparing graduation rates of starters and migrators 50

Next steps 51

Next steps in learning to use midfieldr Several more vignettes (tutorials) on the midfieldr website 52

Next steps if you want more than a MIDFIELD sample students courses terms degrees MIDFIELD : 1.6 M students Talk to a member of the MIDFIELD team. Names and emails on the website. 53

Talk to a member of the MIDFIELD team 54

Next steps in learning R Hadley Wickham Garrett Grolemund Robert Kabacoff Or just google it Your problem may already be solved 55

Next steps in learning about graph design Edward Tufte Howard Wainer Naomi Robbins Charles Kostelnick Michael Hassett 56

Conversations An unstructured time to relax, talk, question, and share. 57

