A Simple Scalable Visualization Approach for Large Complex Data

CONTEXT: DEEP ANALYSIS OF LARGE COMPLEX DATA »  Goals of
analysis: –  Uncover interesting or previously unknown behavior –  Develop new insights –  Identify deviations from expected behavior –  Conﬁrm or reject hypotheses or suspicions »  Visualization is critical in this process

DEEP ANALYSIS OF LARGE COMPLEX DATA »  Data most often
do not come with a manual for what to do »  If we already (think we) know the algorithm / model to apply and simply apply it to the data, we are not doing analysis, we are processing »  Deep analysis means detailed, comprehensive analysis that does not lose important information in the data »  It means learning from the data, not forcing our preconceptions on the data »  It means being willing and able to use any of the 1000s of statistical, machine learning, and visualization methods as dictated by the data »  It means trial and error, an iterative process of hypothesizing, ﬁtting, validating, learning

DEEP ANALYSIS OF LARGE COMPLEX DATA »  Any or all
of the following: – Large number of records – Many variables – Complex data structures not readily put into tabular form of cases by variables – Intricate patterns and dependencies that require complex models and methods of analysis – Does not conform to simple assumptions made by many algorithms

EXAMPLE: POWER GRID DATA »  2 TB data set of
high-frequency power grid measurements at several locations on the grid »  Identiﬁed, validated, and built precise statistical algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2

KEYS TO EFFECTIVE VISUALIZATION »  Flexibility »  Rapid and cheap
development »  Scalability

FLEXIBILITY »  Visualizations must be easily tailored to the domain,
data, or analysis context, which can readily change throughout the course of the analysis »  Any effective visual method that can be imagined should be able to be employed

RAPID AND CHEAP DEVELOPMENT »  Analysis process is iterative –
repeated trial and error »  Visualization is usually the driver of the iteration – it is most effective at helping us realize we are doing something wrong or at giving us ideas of something new to try »  Spending a lot of time / effort / money on any single visualization in this process will slow or stop the iteration »  Most often this requires skill and knowledge using a high-level programming environment, although systems like Tableau, Lyra, and others are helping to make it easier to rapidly specify visualizations without programming

SCALABILITY »  We cannot rely on summary plots alone » 
We need to be able to rapidly and ﬂexibly look at big data in detail, at scale average y value Set 1 Set 2 Set 3 Set 4 7.2 7.4 7.6 7.8 x y 4 6 8 10 12 Set 1 5 10 15 Set 2 5 10 15 Set 3 4 6 8 10 12 Set 4

Is there a solution that

TRELLIS DISPLAY »  Data are split into meaningful subsets, usually
conditioning on variables of the dataset »  A visualization method is applied to each subset »  The image for each subset is called a “panel” »  Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis Time Seasonal Component -0.5 0.0 0.5 Jan 1960197019801990 Feb Mar 1960197019801990 Apr May 1960197019801990 Jun 1960197019801990 Jul Aug 1960197019801990 Sep Oct 1960197019801990 Nov -0.5 0.0 0.5 Dec Average yearly deaths due to cancer per 100000 -0.2 0.0 0.2 0.4 -0.5 0.0 0.5 rate.male -0.2 0.0 0.2 0.4 rate.female 100 200 300 400 500 600

WHY TRELLIS IS EFFECTIVE »  Edward Tufte’s term for panels
in Trellis Display is small multiples: –  Once a viewer understands one panel, they have immediate access to the data in all other panels –  Small multiples directly depict comparisons to reveal repetition and change, pattern and surprise »  Fisher barley data example –  Average barley yields for 10 varieties at 6 sites across 2 years –  A glaring error in the data went unnoticed for nearly 60 years Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931 The Visual Display of Quantitative Information, Tufte Visualizing Data, Cleveland

TRELLIS VS. A CUSTOM INTERFACE »  Often complexity in data
(multiple variables, etc.) is handled by making a custom user interface to navigate the complexities »  Example: choosing model parameters for smoothing data: 0 2 4 6 8 -2 0 2 4 x y

TRELLIS VS. A CUSTOM INTERFACE »  Smoothing parameters: –  Degree
of smoothing polynomial (0, 1, 2) –  Span of smoothing window (from 0 to 1) »  Goal: choose parameters that yield smoothest curve possible that still follows the pattern in the data Demo

TRELLIS VS. A CUSTOM INTERFACE »  A very simple alternative:
Trellis display across a range of parameter settings x y -2 0 2 4 span: 0.01 degree: 0 0 2 4 6 8 span: 0.05 degree: 0 span: 0.1 degree: 0 0 2 4 6 8 span: 0.2 degree: 0 span: 0.35 degree: 0 0 2 4 6 8 span: 0.5 degree: 0 span: 0.75 degree: 0 span: 0.01 degree: 1 span: 0.05 degree: 1 span: 0.1 degree: 1 span: 0.2 degree: 1 span: 0.35 degree: 1 span: 0.5 degree: 1 -2 0 2 4 span: 0.75 degree: 1 -2 0 2 4 0 2 4 6 8 span: 0.01 degree: 2 span: 0.05 degree: 2 0 2 4 6 8 span: 0.1 degree: 2 span: 0.2 degree: 2 0 2 4 6 8 span: 0.35 degree: 2 span: 0.5 degree: 2 0 2 4 6 8 span: 0.75 degree: 2

TRELLIS VS. A CUSTOM INTERFACE »  Hypothesis: choosing the most
appropriate smoothing model is easier with the Trellis display –  We can see multiple models at once – we don’t have to try to remember what we’ve seen with other parameter choices –  Would be interesting to validate this with an experiment »  If you are skeptical about this claim: –  It’s still hard to argue that it would be any more difﬁcult –  It is most likely much faster to make a reasonable choice from the Trellis display (consider a timed experiment) –  Creating the Trellis display is much easier and much faster than creating the interface

WHY TRELLIS IS EFFECTIVE »  Trellis display is ﬂexible – 
Data can be broken up in many ways, facilitating many different views of the data, the ability to visualize higher dimensions, etc. –  You can plot anything you want inside a panel –  When using a programming environment like R, a large collection of plotting methods is at your disposal »  Trellis displays can be developed rapidly –  As discussed in the previous example –  Displays can be speciﬁed through a simple set of commands »  But does Trellis display scale?

SCALING TRELLIS »  Big data lends itself nicely to the
idea of small multiples –  Typically “big data” is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. –  It is natural to break the data up based on these dimensions and plot it »  But this means Trellis displays with potentially thousands or millions of panels »  We can create millions of plots, but we will never be able to (or want to) view all of them!

»  Pioneering statistician John Tukey realized this problem decades ago
“As multiple-aspect data continues to grow…, the the ability of human eyes to scan the reasonable displays soon runs out” »  He put forth the idea of computing diagnostics quantities for each panel that judge the relative interest or importance of viewing a panel SCALING TRELLIS “It seems natural to call such computer guiding diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays.”

SCALING TRELLIS »  To scale, we can apply the same
steps as in Trellis display, with one extra step: – Data are split into meaningful subsets, usually conditioning on variables of the dataset – A visualization method is applied to each subset – A set of cognostic metrics is computed for each subset – Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being speciﬁed through interactions with the cognostics

SIMPLE EXAMPLE »  Barley data – Variety vs. yield plotted for
each of 6 sites – Small data set but illustrates the principle »  Suppose we compute the following cognostics for each site: – Average yield – Site name Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931

SORT BY AVERAGE YIELD

SORT BY SITE NAME

FILTER ON MEAN YIELD < 30

FUN WITH COGNOSTICS »  We can get creative with cognostics,
even drawing from other data sets if available –  Mean difference between 1932 and 1931 yields –  Lat/long coordinates of the sites –  Average temperature at each of the sites –  Model coefﬁcients –  Anomaly scores –  etc. »  Often arriving at a set of useful cognostics is an iterative process »  Often domain expertise can provide meaningful cognostics

TRELLISCOPE An implementation of scalable Trellis display with cognostics

TESSERA »  Trelliscope is part of a larger project, Tessera
»  Tessera is a general environment for deep analysis of large complex data »  Tessera provides scalable access to 1000s of analytic methods of statistics, machine learning, and visualization available in the R environment »  Trelliscope is the visualization component of Tessera »  (recall the importance of visualization and analysis living together) »  More information at http://tessera.io

BACK END AGNOSTIC Interface stays the same regardless of back
end

CREATING TRELLISCOPE DISPLAYS »  Trelliscope displays are created by specifying
a way to divide the data, a panel function and (optionally) a cognostics function dotplot(variety ~ yield | site, data = barley, groups = year, auto.key = list(space = "top", columns = 2), xlab = "Barley Yield (bushels/acre) ", aspect = 0.5, layout = c(3, 2), ylab = NULL) Trellis code:

CREATING TRELLISCOPE DISPLAYS »  Trelliscope displays are created by specifying
a way to divide the data, a panel function and (optionally) a cognostics function barley %>% qtrellis(by = "site", panel = function(x) dotplot(variety ~ yield, data = x, groups = year, auto.key = list(space = "top", columns = 2), xlab = "Barley Yield (bushels/acre)"), cog = function(x) list(meanYield = cog(mean(x$yield)), meanYieldDiff = cog(mean(x$yield * ifelse(x$year == "1931", -1, 1)))), lims = list(x = "same"), width = 300, height = 200, layout = c(2, 3)) Trelliscope code:

Trelliscope Viewer »  A web-based viewer of Trelliscope displays allows
the user to interact with panels based on cognostics – Layout (rows, columns), paging – Sorting, ﬁltering on cognostics – Univariate and multivariate visual range ﬁlters Demo

A (MUCH) BIGGER EXAMPLE »  High frequency ﬁnancial trade data
– 25ms resolution (trade price, volume, etc.) – Hundreds of gigabytes stored on Hadoop – Use Tessera to partition data by stock symbol and day – Use Tessera to aggregate within symbol / day by second – Over 900K subsets – Use Trelliscope to visualize price vs. time

DEMO High Frequency Trade Data

BENEFITS OF TRELLISCOPE »  Creating a Trelliscope display is much
less expensive and often just as effective as writing a custom application »  No technical expertise is needed to view a Trelliscope display »  Provides a user friendly way to bring domain experts into the iterative analysis process – visualization is the best medium for communication »  Provides the ability to ﬂexibly look at the data in detail, even when it’s big »  Visual and analytic methods coexist

A SIMPLE, SCALABLE INTERACTIVITY PARADIGM »  Instead of writing a
custom interactive user interface for each big data problem that comes along, the idea of Trelliscope is to parametrize the desired interactivity into the data partitioning and speciﬁcation of cognostics »  This interactive paradigm has many advantages –  The UI is always the same – once you learn it, you don’t need to learn a new interface for each new data set / visualization app –  Facilitates simultaneous comparisons against different views of the data (as opposed to adjusting knobs and not remembering what you saw under different settings, etc.) –  Preserves state, provenance, etc. in a standard way – a very important aspect of interactive visualization

A Simple Scalable Visualization Approach for L...

A Simple Scalable Visualization Approach for Large Complex Data

More Decks by hafen

Other Decks in Science

Featured

Transcript