Slide 1

Slide 1 text

Kara Woo | @kara_woo Sage Bionetworks rstudio::conf 2019 BOX PLOTS A CASE STUDY IN DEBUGGING AND PERSEVERANCE

Slide 2

Slide 2 text

THIS IS ONLY KIND OF ABOUT BOX PLOTS

Slide 3

Slide 3 text

"When I try to produce boxplots with colours depending on a categorical variable, these appear overlapping if varwidth is set to TRUE” —GitHub user mcol

Slide 4

Slide 4 text

“This should be straightforward.”

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

How do I know what the bug is? How do I know when I’m done? How do I fix it?

Slide 7

Slide 7 text

HOW DO YOU KNOW WHAT THE BUG IS? Photo: Jia Ye Isolate the problem

Slide 8

Slide 8 text

Photo: Sarah Dorweiler REPREX minimal reproducible example

Slide 9

Slide 9 text

reprex::reprex()

Slide 10

Slide 10 text

require(ggplot2) #> Loading required package: ggplot2 ggplot(data = iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = TRUE) #> Warning: position_dodge requires non-overlapping x intervals

Slide 11

Slide 11 text

require(ggplot2) #> Loading required package: ggplot2 ggplot(data = iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = FALSE)

Slide 12

Slide 12 text

Follow the trails Photo: Jia Ye

Slide 13

Slide 13 text

require(ggplot2) #> Loading required package: ggplot2 ggplot(data = iris, aes(Species, Sepal.Length)) + geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = TRUE) #> Warning: position_dodge requires non-overlapping x intervals

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

position_dodge() !" PositionDodge !" collide() !" pos_dodge()

Slide 16

Slide 16 text

COLLIDE() Gets information about box location Looks for box overlap Passes boxes that share position to pos_dodge() POS_DODGE() Scales boxes down Places boxes side by side

Slide 17

Slide 17 text

> debug(ggplot2:::collide)

Slide 18

Slide 18 text

> debug(ggplot2:::collide) > ggplot(data = iris, aes(Species, Sepal.Length)) + > geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = FALSE) #> debugging in: collide(data, params$width, #> name = “position_dodge", strategy = pos_dodge, #> n = params$n, check.width = FALSE) Browse[2]>

Slide 19

Slide 19 text

> debug(ggplot2:::collide) > ggplot(data = iris, aes(Species, Sepal.Length)) + > geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = FALSE) #> debugging in: collide(data, params$width, #> name = “position_dodge", strategy = pos_dodge, #> n = params$n, check.width = FALSE) Browse[2]> data #> ... x xmin xmax #> 1 ... 1 0.625 1.375 #> 2 ... 2 1.625 2.375 #> 3 ... 3 2.625 3.375 #> 4 ... 1 0.625 1.375 #> 5 ... 2 1.625 2.375 #> 6 ... 3 2.625 3.375

Slide 20

Slide 20 text

VARWIDTH = TRUE VARWIDTH = FALSE data #> ... x xmin xmax #> 1 ... 1 0.625 1.375 #> 2 ... 2 1.625 2.375 #> 3 ... 3 2.625 3.375 #> 4 ... 1 0.625 1.375 #> 5 ... 2 1.625 2.375 #> 6 ... 3 2.625 3.375 data #> ... x xmin xmax #> 1 ... 1 0.6553988 1.344601 #> 2 ... 2 1.8750000 2.125000 #> 3 ... 3 2.7984436 3.201556 #> 4 ... 1 0.8063508 1.193649 #> 5 ... 2 1.6250000 2.375000 #> 6 ... 3 2.6599632 3.340037

Slide 21

Slide 21 text

Boxes with different xmin aren’t treated as the same position. collide <- function(data, ...) { # ... plyr::ddply(data, "xmin", strategy, ..., width = width) # ... }

Slide 22

Slide 22 text

HOW DO YOU FIX IT? Photo: Jia Ye Experiment

Slide 23

Slide 23 text

ccc6bbb4 This doesn't fix position_dodge, but it might be in the right direction? f5946680 Commit before I break something else

Slide 24

Slide 24 text

INITIAL “FIX” What’s wrong with this picture?

Slide 25

Slide 25 text

•Boxes in the wrong order •Doesn’t work for continuous x axes •Incorrect scaling INITIAL “FIX” What’s wrong with this picture?

Slide 26

Slide 26 text

Photo: Jia Ye Test many scenarios HOW DO YOU FIX IT?

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Photo: Jia Ye Make it general HOW DO YOU FIX IT?

Slide 29

Slide 29 text

Can we extend to bars? Can we solve other open issues? Arbitrary rectangles?

Slide 30

Slide 30 text

→ POSITION_DODGE2() Used for boxes, bars, rectangles Finds overlap by comparing xmin to previous xmax →

Slide 31

Slide 31 text

HOW DO YOU KNOW WHEN YOU’RE DONE? Photo: Jia Ye Again, test

Slide 32

Slide 32 text

HOW DO YOU KNOW WHEN YOU’RE DONE? Photo: Jia Ye Don’t let perfect be the enemy of good

Slide 33

Slide 33 text

CAN POSITION_DODGE2() REPLACE POSITION_DODGE()?

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

• Isolate the problem • Follow the trails • Experiment • Test many scenarios • Make it general • Don’t let perfect be the enemy of good https://github.com/tidyverse/ggplot2/pull/2196

Slide 37

Slide 37 text

Thank you Thanks to @mcol for reporting this bug, Hadley Wickham for repeated code reviews, and Sean Kross and Karthik Ram for feedback on this presentation.