A Simple Scalable Visualization Approach for Large Complex Data

Slide 1

Slide 1 text

Slide 2

Slide 2 text

CONTEXT: DEEP ANALYSIS OF LARGE COMPLEX DATA »  Goals of analysis: –  Uncover interesting or previously unknown behavior –  Develop new insights –  Identify deviations from expected behavior –  Conﬁrm or reject hypotheses or suspicions »  Visualization is critical in this process

Slide 3

Slide 3 text

DEEP ANALYSIS OF LARGE COMPLEX DATA »  Data most often do not come with a manual for what to do »  If we already (think we) know the algorithm / model to apply and simply apply it to the data, we are not doing analysis, we are processing »  Deep analysis means detailed, comprehensive analysis that does not lose important information in the data »  It means learning from the data, not forcing our preconceptions on the data »  It means being willing and able to use any of the 1000s of statistical, machine learning, and visualization methods as dictated by the data »  It means trial and error, an iterative process of hypothesizing, ﬁtting, validating, learning

Slide 4

Slide 4 text

DEEP ANALYSIS OF LARGE COMPLEX DATA »  Any or all of the following: – Large number of records – Many variables – Complex data structures not readily put into tabular form of cases by variables – Intricate patterns and dependencies that require complex models and methods of analysis – Does not conform to simple assumptions made by many algorithms

Slide 5

Slide 5 text

EXAMPLE: POWER GRID DATA »  2 TB data set of high-frequency power grid measurements at several locations on the grid »  Identiﬁed, validated, and built precise statistical algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2

Slide 6

Slide 6 text

KEYS TO EFFECTIVE VISUALIZATION »  Flexibility »  Rapid and cheap development »  Scalability

Slide 7

Slide 7 text

FLEXIBILITY »  Visualizations must be easily tailored to the domain, data, or analysis context, which can readily change throughout the course of the analysis »  Any effective visual method that can be imagined should be able to be employed

Slide 8

Slide 8 text

RAPID AND CHEAP DEVELOPMENT »  Analysis process is iterative – repeated trial and error »  Visualization is usually the driver of the iteration – it is most effective at helping us realize we are doing something wrong or at giving us ideas of something new to try »  Spending a lot of time / effort / money on any single visualization in this process will slow or stop the iteration »  Most often this requires skill and knowledge using a high-level programming environment, although systems like Tableau, Lyra, and others are helping to make it easier to rapidly specify visualizations without programming

Slide 9

Slide 9 text

SCALABILITY »  We cannot rely on summary plots alone »  We need to be able to rapidly and ﬂexibly look at big data in detail, at scale average y value Set 1 Set 2 Set 3 Set 4 7.2 7.4 7.6 7.8 x y 4 6 8 10 12 Set 1 5 10 15 Set 2 5 10 15 Set 3 4 6 8 10 12 Set 4

Slide 10

Slide 10 text

Is there a solution that

Slide 11

Slide 11 text

TRELLIS DISPLAY »  Data are split into meaningful subsets, usually conditioning on variables of the dataset »  A visualization method is applied to each subset »  The image for each subset is called a “panel” »  Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis Time Seasonal Component -0.5 0.0 0.5 Jan 1960197019801990 Feb Mar 1960197019801990 Apr May 1960197019801990 Jun 1960197019801990 Jul Aug 1960197019801990 Sep Oct 1960197019801990 Nov -0.5 0.0 0.5 Dec Average yearly deaths due to cancer per 100000 -0.2 0.0 0.2 0.4 -0.5 0.0 0.5 rate.male -0.2 0.0 0.2 0.4 rate.female 100 200 300 400 500 600

Slide 12

Slide 12 text

WHY TRELLIS IS EFFECTIVE »  Edward Tufte’s term for panels in Trellis Display is small multiples: –  Once a viewer understands one panel, they have immediate access to the data in all other panels –  Small multiples directly depict comparisons to reveal repetition and change, pattern and surprise »  Fisher barley data example –  Average barley yields for 10 varieties at 6 sites across 2 years –  A glaring error in the data went unnoticed for nearly 60 years Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931 The Visual Display of Quantitative Information, Tufte Visualizing Data, Cleveland

Slide 13

Slide 13 text

TRELLIS VS. A CUSTOM INTERFACE »  Often complexity in data (multiple variables, etc.) is handled by making a custom user interface to navigate the complexities »  Example: choosing model parameters for smoothing data: 0 2 4 6 8 -2 0 2 4 x y

Slide 14

Slide 14 text

TRELLIS VS. A CUSTOM INTERFACE »  Smoothing parameters: –  Degree of smoothing polynomial (0, 1, 2) –  Span of smoothing window (from 0 to 1) »  Goal: choose parameters that yield smoothest curve possible that still follows the pattern in the data Demo

Slide 15

Slide 15 text

TRELLIS VS. A CUSTOM INTERFACE »  A very simple alternative: Trellis display across a range of parameter settings x y -2 0 2 4 span: 0.01 degree: 0 0 2 4 6 8 span: 0.05 degree: 0 span: 0.1 degree: 0 0 2 4 6 8 span: 0.2 degree: 0 span: 0.35 degree: 0 0 2 4 6 8 span: 0.5 degree: 0 span: 0.75 degree: 0 span: 0.01 degree: 1 span: 0.05 degree: 1 span: 0.1 degree: 1 span: 0.2 degree: 1 span: 0.35 degree: 1 span: 0.5 degree: 1 -2 0 2 4 span: 0.75 degree: 1 -2 0 2 4 0 2 4 6 8 span: 0.01 degree: 2 span: 0.05 degree: 2 0 2 4 6 8 span: 0.1 degree: 2 span: 0.2 degree: 2 0 2 4 6 8 span: 0.35 degree: 2 span: 0.5 degree: 2 0 2 4 6 8 span: 0.75 degree: 2

Slide 16

Slide 16 text

TRELLIS VS. A CUSTOM INTERFACE »  Hypothesis: choosing the most appropriate smoothing model is easier with the Trellis display –  We can see multiple models at once – we don’t have to try to remember what we’ve seen with other parameter choices –  Would be interesting to validate this with an experiment »  If you are skeptical about this claim: –  It’s still hard to argue that it would be any more difﬁcult –  It is most likely much faster to make a reasonable choice from the Trellis display (consider a timed experiment) –  Creating the Trellis display is much easier and much faster than creating the interface

Slide 17

Slide 17 text

WHY TRELLIS IS EFFECTIVE »  Trellis display is ﬂexible –  Data can be broken up in many ways, facilitating many different views of the data, the ability to visualize higher dimensions, etc. –  You can plot anything you want inside a panel –  When using a programming environment like R, a large collection of plotting methods is at your disposal »  Trellis displays can be developed rapidly –  As discussed in the previous example –  Displays can be speciﬁed through a simple set of commands »  But does Trellis display scale?

Slide 18

Slide 18 text

SCALING TRELLIS »  Big data lends itself nicely to the idea of small multiples –  Typically “big data” is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. –  It is natural to break the data up based on these dimensions and plot it »  But this means Trellis displays with potentially thousands or millions of panels »  We can create millions of plots, but we will never be able to (or want to) view all of them!

Slide 19

Slide 19 text

»  Pioneering statistician John Tukey realized this problem decades ago “As multiple-aspect data continues to grow…, the the ability of human eyes to scan the reasonable displays soon runs out” »  He put forth the idea of computing diagnostics quantities for each panel that judge the relative interest or importance of viewing a panel SCALING TRELLIS “It seems natural to call such computer guiding diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays.”

Slide 20

Slide 20 text

SCALING TRELLIS »  To scale, we can apply the same steps as in Trellis display, with one extra step: – Data are split into meaningful subsets, usually conditioning on variables of the dataset – A visualization method is applied to each subset – A set of cognostic metrics is computed for each subset – Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being speciﬁed through interactions with the cognostics

Slide 21

Slide 21 text

SIMPLE EXAMPLE »  Barley data – Variety vs. yield plotted for each of 6 sites – Small data set but illustrates the principle »  Suppose we compute the following cognostics for each site: – Average yield – Site name Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931

Slide 22

Slide 22 text

SORT BY AVERAGE YIELD

Slide 23

Slide 23 text

SORT BY SITE NAME

Slide 24

Slide 24 text

FILTER ON MEAN YIELD < 30

Slide 25

Slide 25 text

FUN WITH COGNOSTICS »  We can get creative with cognostics, even drawing from other data sets if available –  Mean difference between 1932 and 1931 yields –  Lat/long coordinates of the sites –  Average temperature at each of the sites –  Model coefﬁcients –  Anomaly scores –  etc. »  Often arriving at a set of useful cognostics is an iterative process »  Often domain expertise can provide meaningful cognostics

Slide 26

Slide 26 text

TRELLISCOPE An implementation of scalable Trellis display with cognostics

Slide 27

Slide 27 text

TESSERA »  Trelliscope is part of a larger project, Tessera »  Tessera is a general environment for deep analysis of large complex data »  Tessera provides scalable access to 1000s of analytic methods of statistics, machine learning, and visualization available in the R environment »  Trelliscope is the visualization component of Tessera »  (recall the importance of visualization and analysis living together) »  More information at http://tessera.io

Slide 28

Slide 28 text

BACK END AGNOSTIC Interface stays the same regardless of back end

Slide 29

Slide 29 text

CREATING TRELLISCOPE DISPLAYS »  Trelliscope displays are created by specifying a way to divide the data, a panel function and (optionally) a cognostics function dotplot(variety ~ yield | site, data = barley, groups = year, auto.key = list(space = "top", columns = 2), xlab = "Barley Yield (bushels/acre) ", aspect = 0.5, layout = c(3, 2), ylab = NULL) Trellis code:

Slide 30

Slide 30 text

CREATING TRELLISCOPE DISPLAYS »  Trelliscope displays are created by specifying a way to divide the data, a panel function and (optionally) a cognostics function barley %>% qtrellis(by = "site", panel = function(x) dotplot(variety ~ yield, data = x, groups = year, auto.key = list(space = "top", columns = 2), xlab = "Barley Yield (bushels/acre)"), cog = function(x) list(meanYield = cog(mean(x$yield)), meanYieldDiff = cog(mean(x$yield * ifelse(x$year == "1931", -1, 1)))), lims = list(x = "same"), width = 300, height = 200, layout = c(2, 3)) Trelliscope code:

Slide 31

Slide 31 text

Trelliscope Viewer »  A web-based viewer of Trelliscope displays allows the user to interact with panels based on cognostics – Layout (rows, columns), paging – Sorting, ﬁltering on cognostics – Univariate and multivariate visual range ﬁlters Demo

Slide 32

Slide 32 text

A (MUCH) BIGGER EXAMPLE »  High frequency ﬁnancial trade data – 25ms resolution (trade price, volume, etc.) – Hundreds of gigabytes stored on Hadoop – Use Tessera to partition data by stock symbol and day – Use Tessera to aggregate within symbol / day by second – Over 900K subsets – Use Trelliscope to visualize price vs. time

Slide 33

Slide 33 text

DEMO High Frequency Trade Data

Slide 34

Slide 34 text

BENEFITS OF TRELLISCOPE »  Creating a Trelliscope display is much less expensive and often just as effective as writing a custom application »  No technical expertise is needed to view a Trelliscope display »  Provides a user friendly way to bring domain experts into the iterative analysis process – visualization is the best medium for communication »  Provides the ability to ﬂexibly look at the data in detail, even when it’s big »  Visual and analytic methods coexist

Slide 35

Slide 35 text

A SIMPLE, SCALABLE INTERACTIVITY PARADIGM »  Instead of writing a custom interactive user interface for each big data problem that comes along, the idea of Trelliscope is to parametrize the desired interactivity into the data partitioning and speciﬁcation of cognostics »  This interactive paradigm has many advantages –  The UI is always the same – once you learn it, you don’t need to learn a new interface for each new data set / visualization app –  Facilitates simultaneous comparisons against different views of the data (as opposed to adjusting knobs and not remembering what you saw under different settings, etc.) –  Preserves state, provenance, etc. in a standard way – a very important aspect of interactive visualization