Slide 1

Slide 1 text

Data visualization APAM E4990 Modeling Social Data Jake Hofman Columbia University February 15, 2019 Jake Hofman (Columbia University) Data visualization February 15, 2019 1 / 21

Slide 2

Slide 2 text

Why visualize? 1. To explore and understand data 2. To communicate with your readers Jake Hofman (Columbia University) Data visualization February 15, 2019 2 / 21

Slide 3

Slide 3 text

Anscombe’s quartet (1973)1 What’s the difference between these four data sets? 1https://www.jstor.org/stable/2682899 Jake Hofman (Columbia University) Data visualization February 15, 2019 3 / 21

Slide 4

Slide 4 text

Anscombe’s quartet What’s the difference between these four data sets? Jake Hofman (Columbia University) Data visualization February 15, 2019 4 / 21

Slide 5

Slide 5 text

Anscombe’s quartet2 What’s the difference between these four data sets? 2http://vissoc.co/lookatdata.html Jake Hofman (Columbia University) Data visualization February 15, 2019 5 / 21

Slide 6

Slide 6 text

So. Many. Options.3 3https://serialmentor.com/dataviz/directory-of-visualizations.html Jake Hofman (Columbia University) Data visualization February 15, 2019 6 / 21

Slide 7

Slide 7 text

Even. More. Options. Jake Hofman (Columbia University) Data visualization February 15, 2019 7 / 21

Slide 8

Slide 8 text

Good plots (a la Mackinlay 1986) Good plots should express the facts effectively as possible • “Tell the truth and nothing but the truth” • Use encodings that people can easily decode • Make a clear and concise point • Have a one sentence take-away Jake Hofman (Columbia University) Data visualization February 15, 2019 8 / 21

Slide 9

Slide 9 text

Good plots (a la Mackinlay 1986) Automating the Design of Graphical Presentations of Relational Information JOCK MACKINLAY Stanford University The goal of the research described in this paper is to develop an application-independent presentation tool that automatically designs effective graphical presentations (such as bar charts, scatter plots, and connected graphs) of relational information. Two problems are raised by this goal: The codifi- cation of graphic design criteria in a form that can be used by the presentation tool, and the generation of a wide variety of designs so that the presentation tool can accommodate a wide variety of information. The approach described in this paper is based on the view that graphical presentations are sentences of graphical languages. The graphic design issues are codified as expressiveness and effectiveness criteria for graphical languages. Expressiveness criteria determine whether a graphical language can express the desired information. Effectiveness criteria determine whether a graphical language exploits the capabilities of the output medium and the human visual system. A wide variety of designs can be systematically generated by using a composition algebra that composes a small set of primitive graphical languages. Artificial intelligence techniques are used to implement a prototype presentation tool called APT (A Presentation Tool), which is based on the composition algebra and the graphic design criteria. Categories and Subject Descriptors: D.2.2 [Software Engineering]: Tools and Techniques-user interfaces; H.1.2 [Models and Principles]: User/Machine Systems--human information processing; H.3.4 [Information Storage and Retrieval]: Systems and Software; 1.2.1 [Artificial Intelli- Jake Hofman (Columbia University) Data visualization February 15, 2019 9 / 21

Slide 10

Slide 10 text

Some visualizations are better4 than others Automating the Design of Graphical Presentations l 125 More accurate Less accurate I I Position IMll 1 I Length F-l Iha I I0.I I I Volume rl l¶kJ Color cl mot Shown) Fig. 14. Accuracy ranking of quantitative perceptual tasks. Higher tasks are accom- plished more accurately than lower tasks. Cleveland and McGill empirically verified the basic properties of this ranking. Quantitative Ordinal Nominal Position Position Position Color Hue 4Perceived more accurately Jake Hofman (Columbia University) Data visualization February 15, 2019 10 / 21

Slide 11

Slide 11 text

Clevland & McGill 1984 What percent is the smaller region of the larger region? 534 Journal of the American Statistical Association, September 1984 TYPE 1 TYPE 2 TYPE 3 TYPE 4 TYPE 5 100o 10oo 100- 10oo 100- IhLL O_ 0A A * A B A B A B A B A B Figure 4. Graphs from position-length experiment. tracted by perceiving position along a scale, in this case the horizontal axis. The y values can be perceived in a similar manner. The real power of a Cartesian graph, however, does not derive only from one's ability to perceive the x and y values separately but, rather, from one's ability to un- derstand the relationship of x and y. For example, in Fig- ure 7 we see that the relationship is nonlinear and see the nature of that nonlinearity. The elementary task that en- The eye-brain system is capable of extracting such a slope by perceiving the direction of the line segment join- ing (xi, yi) and (xj, yj). We conjecture that the perception of these slopes allows the eye-brain system to imagine a smooth curve through the points, which is then used to judge the pattern. For example, in Figure 7 one can per- ceive that the slopes for pairs of points on the left side of the plot are greater than those on the right side of the plot, which is what enables one to judge that the rela- Jake Hofman (Columbia University) Data visualization February 15, 2019 11 / 21

Slide 12

Slide 12 text

Clevland & McGill 1984 / Heer & Bostock 2010 0 40 50 60 70 80 90 ue Proportional Difference (%) ans of log absolute errors against or each proportional judgment type; curves computed with lowess. ition-angle experiment to those for the ment. By designing judgment types 6 same format as the others, the results Cleveland & McGill's Results 1.0 1.5 2.0 2.5 3.0 3.5 T1 T2 T3 T4 T5 Log Error Crowdsourced Results 1.0 1.5 2.0 2.5 3.0 3.5 T1 T2 T3 T4 T5 T6 T7 T8 T9 Log Error Figure 4: Proportional judgment results (Exp. 1A & B). Top: Cleveland & McGill’s [7] lab study. Bottom: MTurk studies. Error bars indicate 95% confidence intervals. Jake Hofman (Columbia University) Data visualization February 15, 2019 12 / 21

Slide 13

Slide 13 text

Diffrent strokes for different data types • Quantitative: numerical values in a range (e.g., height) • Ordinal: categories with natural ordering (e.g., day of week) • Nominal: categories with no natural ordering (e.g., gender) Jake Hofman (Columbia University) Data visualization February 15, 2019 13 / 21

Slide 14

Slide 14 text

Diffrent strokes for different data types Less accurate rl l¶kJ Color cl mot Shown) Quantitative Ordinal Nominal Position Position Color Saturation Position Color Hue Texture Connection Containment Density Color Saturation Color Saturation Shape Length Angle Slope Area Volume Fig. 15. Ranking of perceptual tasks. The tasks shown in the gray boxes are not relevant to these types of data. An example analysis for area perception is shown in Figure 16. The top line shows that a series of decreasing areas can be used to encode a tenfold quantitative range. Of course, in a real diagram such as Figure 13, the areas would be laid out Jake Hofman (Columbia University) Data visualization February 15, 2019 14 / 21

Slide 15

Slide 15 text

Diffrent colors for different data types5 Quantitative Nominal 5https://serialmentor.com/dataviz/color-basics.html Jake Hofman (Columbia University) Data visualization February 15, 2019 15 / 21

Slide 16

Slide 16 text

A grammar of graphics A language to describe the components of a graphic Jake Hofman (Columbia University) Data visualization February 15, 2019 16 / 21

Slide 17

Slide 17 text

Grammar of graphics a la ggplot26 6http://r4ds.had.co.nz Jake Hofman (Columbia University) Data visualization February 15, 2019 17 / 21

Slide 18

Slide 18 text

Grammar of graphics a la ggplot2 7 1 Get your data into the right format 2 Map variables to aesthetics 3 Choose a geometry for your plot 4 Set co-ordinate system and scales 5 Add annotations, legends, and labels 7http://vissoc.co/makeplot.html Jake Hofman (Columbia University) Data visualization February 15, 2019 18 / 21

Slide 19

Slide 19 text

Grammar of graphics a la ggplot2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1957 1977 1997 1e+03 1e+04 1e+05 1e+03 1e+04 1e+05 1e+03 1e+04 1e+05 30 40 50 60 70 80 GDP per capita Life expectancy pop q q q q 300,000,000 600,000,000 900,000,000 1,200,000,000 continent q q q q q Africa Americas Asia Europe Oceania Health and wealth of countries over time ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + geom_point() + scale_x_log10() + scale_size_area(label = comma) + labs(x = ’GDP per capita’, y = ’Life expectancy’, title = ’Health and wealth of countries over time’) facet_wrap(~ year) Jake Hofman (Columbia University) Data visualization February 15, 2019 19 / 21

Slide 20

Slide 20 text

Benefits • Lowers the barrier to asking questions of your data • Lets you explore more, and faster • Easily produces publication-ready plots • Large and active user base for support Jake Hofman (Columbia University) Data visualization February 15, 2019 20 / 21

Slide 21

Slide 21 text

Acknowledgements Slides are generously adapted from C ¸a˘ gatay Demiralp, whose slides are generously adapted from Jeff Heer’s Data Visualization course Jake Hofman (Columbia University) Data visualization February 15, 2019 21 / 21