Slide 1

Slide 1 text

Making MIDFIELD More Accessible A workshop for R beginners Richard Layton, Matthew Ohland, Russell Long, Marisa Orr 2018–10–03 FIE Conference, San Jose, CA

Slide 2

Slide 2 text

R-Bar volunteers can help with software issues Min Topic 10 Introductions 20 Elements of effective graphs 30 Getting started with R (tutorial) 20 Accessing the MIDFIELD data 20 — break — 40 Using midfieldr (tutorial) 10 Extending your repertoire 10 Next steps 20 Conversations 3

Slide 3

Slide 3 text

Elements of effective graphs 4

Slide 4

Slide 4 text

In your handout, list the slices A thru E from largest to smallest A B C D E Adapted from (Robbins 2013) Ch. 2 5

Slide 5

Slide 5 text

In your handout, list the slices A thru E from largest to smallest A B C D E • B (largest) Adapted from (Robbins 2013) Ch. 2 5

Slide 6

Slide 6 text

In your handout, list the slices A thru E from largest to smallest A B C D E • B (largest) • D • A • C • E (smallest) Adapted from (Robbins 2013) Ch. 2 5

Slide 7

Slide 7 text

The same data arranged along a common axis Comparing values along a common axis is a high-accuracy visual task. E C A D B 17 18 19 20 21 22 23 6

Slide 8

Slide 8 text

Slices are what percentage of the whole? A D C B Fill in the blanks A. The total should be 100% B. C. D. 7

Slide 9

Slide 9 text

3D-effects distort our judgment A D C B Fill in the blanks A. 20% The total should be 100% B. 20% C. 20% D. 40% 8

Slide 10

Slide 10 text

Again, the same data arranged along a common axis A high-accuracy visual task. A B C D 20 25 30 35 40 9

Slide 11

Slide 11 text

Write down the heights of the bars This is a visual inspection only. Fill in the blanks A. B. C. D. Adapted from (Robbins 2013) p. 22 10

Slide 12

Slide 12 text

Again, 3D-effects distort our judgment This is a visual inspection only. Fill in the blanks A. 2 B. 4 C. 6 D. 8 11

Slide 13

Slide 13 text

Again, the same data arranged along a common axis A high-accuracy visual task. A B C D 2 4 6 8 12

Slide 14

Slide 14 text

You can use bars, but must include zero A B C D 0 2 4 6 8 13

Slide 15

Slide 15 text

If you mark the endpoints, you can omit the bar A B C D 0 2 4 6 8 14

Slide 16

Slide 16 text

Producing a “dot plot” with rows ordered per the data A B C D 0 2 4 6 8 15

Slide 17

Slide 17 text

Try estimating areas of three states Visual estimation of area is a low-accuracy task. South Carolina (SC) ≈ 83,000 sq km. FL x 1000 sq. km GA x 1000 sq. km AL x 1000 sq. km SC 83 x 1000 sq. km Adapted from (Ihaka 2007) 16

Slide 18

Slide 18 text

Again, the same data arranged along a common axis FL x 1000 sq. km GA x 1000 sq. km AL x 1000 sq. km SC 83 x 1000 sq. km 17

Slide 19

Slide 19 text

Your estimates have probably improved FL 170 x 1000 sq. km GA 154 x 1000 sq. km AL 136 x 1000 sq. km SC 83 x 1000 sq. km 18

Slide 20

Slide 20 text

When color represents area, what story emerges? Color used deceptively, 2012 election by county: Obama, Romney http://www-personal.umich.edu/~mejn/election/2012/ 19

Slide 21

Slide 21 text

When color represents voters? Color used judiciously, each dot 100 votes for: Obama, Romney Color. Color represents a quantity – each dot is 100 votes. http://coach.weinstein.to/lets-get-specific/election-results/ 20

Slide 22

Slide 22 text

The experts tell us Optimal design primarily depends on • The message to be conveyed • The variables to be shown (Doumont 2009) Image from http://www.principiae.be/pdfs/Principiae-2014.pdf 21

Slide 23

Slide 23 text

The experts tell us The task of the designer is to give visual access to the subtle and the difficult — that is, reveal the complex. (Tufte 1983) Image from https://en.wikipedia.org/wiki/Edward_Tufte 22

Slide 24

Slide 24 text

The experts tell us What’s your point? Seriously, that’s the most important question. (Evergreen 2017) Image from https://tei.cgu.edu/people/stephanie-evergreen-phd/ 23

Slide 25

Slide 25 text

R is designed with statistical analysis and data graphics in mind Well-designed data graphics are accessible, even to the beginner • makes graphical exploration of data accessible to all • work in progress is easily disseminated via GitHub And because R is open-source • new packages appear regularly—one might solve your problem • anyone can help us find errors and add features to our packages 24

Slide 26

Slide 26 text

Getting started with R (tutorial) 25

Slide 27

Slide 27 text

This self-paced tutorial introduces basic R • Don’t worry about the pace of your work. • Everyone works and learns new material at a different pace. • Please ask questions of your neighbors as well as the facilitators • If you finish early, ask if anyone near you needs assistance • Save your work regularly 26

Slide 28

Slide 28 text

https://midfieldr.github.io/workshops Create an R project, start an R script, add code, rinse, repeat. 27

Slide 29

Slide 29 text

Accessing the MIDFIELD data 28

Slide 30

Slide 30 text

In education, cross-sectional designs are typical group 1 group 2 group 3 different groups at one time time 29

Slide 31

Slide 31 text

Longitudinal studies offer some advantages same groups over time time year 1 year 2 year 3 year 4 year 5 year 6 30

Slide 32

Slide 32 text

MIDFIELD is a database for longitudinal studies • 1.6 M undergraduate students at 21 US institutions • whole-population data from registrars • 1987–present 31

Slide 33

Slide 33 text

MIDFIELD data are curated in four categories students courses terms degrees MIDFIELD : 1.6 M students 32

Slide 34

Slide 34 text

R package midfielddata provides a stratified sample students courses terms degrees midfielddata : 98 000 students 33

Slide 35

Slide 35 text

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT midfielddata : 98 000 students midfieldstudents 98,000 observations 19 Mb of memory 34

Slide 36

Slide 36 text

Each observation is one term for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA midfielddata : 98 000 students midfieldterms 729,000 observations 82 Mb of memory 35

Slide 37

Slide 37 text

Each observation is one course for one student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor midfielddata : 98 000 students midfieldcourses 3.5 M observations 348 Mb of memory 36

Slide 38

Slide 38 text

Each observation is a unique student students courses terms degrees student ID institution term major transfer sex, race, age us citizen home zip code SAT, ACT student ID institution term major level standing co-op credit hours GPA student ID institution term course --section --hours --type --grade --instructor student ID institution term major degree midfielddata : 98 000 students midfielddegrees 98,000 M observations 10 Mb of memory 37

Slide 39

Slide 39 text

midfielddata provides the data midfieldstudents midfieldterms midfieldcourses midfielddegrees 38

Slide 40

Slide 40 text

midfieldr provides the tools midfieldstudents midfieldterms midfieldcourses midfielddegrees library(midfielddata) library(midfieldr) cip_filter() ever_filter() grad_filter() race_sex_join() multiway_order() etc. 39

Slide 41

Slide 41 text

Preparing for the workshop, you installed both packages https://midfieldr.github.io/midfieldr 40

Slide 42

Slide 42 text

midfieldr provides functions for working with midfieldddata Some of those functions you will use today are: Function Provides cip_filter() Identify programs by CIP code cip_label() Label your programs ever_filter() Find all students ever enrolled in your programs grad_filter() Find all graduates of your programs race_sex_join() Join student race/ethnicity and sex to the data multiway_order() Order the rows and panels of multiway data 41

Slide 43

Slide 43 text

midfieldr provides functions for working with midfieldddata Some of those functions you will use today are: Function Provides cip_filter() Identify programs by CIP code cip_label() Label your programs ever_filter() Find all students ever enrolled in your programs grad_filter() Find all graduates of your programs race_sex_join() Join student race/ethnicity and sex to the data multiway_order() Order the rows and panels of multiway data • We’ll work with midfieldr after the break. 41

Slide 44

Slide 44 text

Using midfieldr (tutorial) 42

Slide 45

Slide 45 text

This self-paced tutorial illustrates midfieldr functions • Don’t worry about the pace of your work. • Everyone works and learns new material at a different pace. • Please ask questions of your neighbors as well as the facilitators • If you finish early, ask if anyone near you needs assistance • Save your work regularly 43

Slide 46

Slide 46 text

https://midfieldr.github.io/midfieldr Start a new R script, add a line of code, run it Examine the result, repeat 44

Slide 47

Slide 47 text

Extending your repertoire: Metrics & graphics 45

Slide 48

Slide 48 text

Graduation rates of starters Figure 4 Graduation rates of starters 46

Slide 49

Slide 49 text

Stickiness in major and in any other major 47

Slide 50

Slide 50 text

Starting and destination majors of all women ever in EE Sankey diagram Non−ENG Other−ENG EE Unknown Non−ENG Other−ENG EE N = 2.5 N = 1.5 N = 1.0 0 1 2 3 0 1 2 3 Starting Major Year 6 Destination Women ever enrolled in Electrical Engineering Number of students (x1000) 20 20 48

Slide 51

Slide 51 text

Migration yield 49

Slide 52

Slide 52 text

Comparing graduation rates of starters and migrators 50

Slide 53

Slide 53 text

Next steps 51

Slide 54

Slide 54 text

Next steps in learning to use midfieldr Several more vignettes (tutorials) on the midfieldr website 52

Slide 55

Slide 55 text

Next steps if you want more than a MIDFIELD sample students courses terms degrees MIDFIELD : 1.6 M students Talk to a member of the MIDFIELD team. Names and emails on the website. 53

Slide 56

Slide 56 text

Talk to a member of the MIDFIELD team 54

Slide 57

Slide 57 text

Next steps in learning R Hadley Wickham Garrett Grolemund Robert Kabacoff StackExchange.com Or just google it Your problem may already be solved 55

Slide 58

Slide 58 text

Next steps in learning about graph design Edward Tufte Howard Wainer Naomi Robbins Charles Kostelnick Michael Hassett 56

Slide 59

Slide 59 text

Conversations An unstructured time to relax, talk, question, and share. 57

Slide 60

Slide 60 text

References Doumont, Jean-luc. 2009. Trees, Maps, and Theorems: Effective Communication for Rational Minds. 2nd ed. Kraainem, Belgium: Principiae. Evergreen, Stephanie D. H. 2017. Effective Data Visualization: The Right Chart for the Right Data. Sage. Ihaka, Ross. 2007. “Statistics 787 Lecture Slides.” Kabacoff, Robert. 2015. R in Action: Data Analysis and Graphics with R, 2/e. Manning Publications Co. Kostelnick, Charles, and Michael Hassett. 2003. Shaping Information: The Rhetoric of Visual Conventions. Southern Illinois University. Robbins, Naomi. 2013. Creating More Effective Graphs. Chart House. Tufte, Edward. 1983. The Visual Display of Quantitative Information. Graphics Press. Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. Copernicus. Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. Sebastopol, CA: O’Reilly Media, Inc. 58