Tessera - Open source environment for deep analysis of large complex data

Tessera

Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many
other researchers at PNNL and Purdue

»  Motivation »  Statistics: Divide and Recombine »  Computational Environment:
Tessera

DEEP ANALYSIS OF

Example: Power Grid »  2 TB data set of high-frequency
power grid measurements at several locations on the grid »  Identiﬁed, validated, and built precise statistical algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2

“Restricting one's self to planned analysis - failing to accompany
it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey

DEEP ANALYSIS OF

“If [you have no] concern about error bars, about heterogeneity,
about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan

DEEP ANALYSIS OF

Systems that do not work

What we want to be able to do: »  Work
in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much ﬂexibility and ease as small data

DIVIDE AND RECOMBINE

Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent
divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)

Data Subset Subset Subset Subset Subset Subset Divide Output Output
Output Output Output Output One Analytic Method of Analysis Thread Recombine Result Statistic Recombination New Data for Analysis Sub-Thread Analytic Recombination Visual Displays Visualization Recombination

How to Divide the Data? »  It depends! »  Random
replicate division –  randomly partition the data »  Conditioning variable division –  Very often data are “embarrassingly divisible” –  Break up the data based on the subject matter –  Example: •  25 years of 90 daily ﬁnancial variables for 100 banks in the U.S. •  Divide the data by bank •  Divide the data by year •  Divide the data by geography –  This is the major division method used in our own analyses –  Has already been widely used in statistics, machine learning, and visualization for datasets of all sizes

Analytic Recombination »  Analytic recombination begins with applying an analytic
method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends mostly on the subject matter –  Example: •  subsets each with the same model with parameters (e.g. linear model) •  parameters are modeled as stochastic too: independent draws from a distribution •  recombination: analysis to build statistical model for the parameters using the subset estimated coefﬁcients

Conditioning Division Example:

Analytic Recombination »  Analytic recombination begins with applying an analytic
method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For random replicate division: –  Observations are seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division

Analytic Recombination with Random Division – The Naïve Approach: Y
= X + ✏ Linear model: ˆ = r X s=1 X0 s Xs ! 1 r X s=1 X0 s Ys Entire-data least squares estimate: ¨ = 1 r r X s=1 (X0 s Xs) 1X0 s Ys D&R approximation: ¨ ⇡ ˆ Under certain conditions, we can show: Partition into random subsets: 2 6 6 6 4 Y1 Y2 . . . Yr 3 7 7 7 5 = 2 6 6 6 4 X1 X2 . . . Xr 3 7 7 7 5 + 2 6 6 6 4 ✏1 ✏ . . . ✏ 3 7 7 7 5

Note »  We can do this for GLMs, general factor-
response models, etc. »  To run this, we only need R’s lm() function »  Computation is embarrassingly parallel »  We can (and want to) do it in one pass through the data »  But we can do better in terms of accuracy

Scatter Matrix Stability Weighting »  Compute a measure of concordance
between the scatter matrix of individual blocks XT S X S and the overall scatter matrix XTX »  Use this measure to weight the averaging to obtain the ﬁnal estimate »  Requires two passes – one to get overall scatter matrix and one to compare blocks to overall »  Try to avoid iteration as data is most often too large to ﬁt in memory and disk IO is slow

Subset Likelihood Modeling »  Suppose we have a log likelihood
for a hypothesized model to fit with n independent observations, xi »  Break the data into r subsets »  Fit subset likelihoods with parametric model (e.g. quadratic) »  Recombine by summing the fitted subset likelihood models to get a fitted all-data likelihood model »  Approach the problem as building a model for the likelihood `(✓) = log n Y i=1 f(xi | ✓) `(✓) = r X s=1 `s(✓)

Bag of Little Bootstraps Our Approach: BLB X1, . .
. , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s Divide into random subsets Resample* each subset and compute estimate Compute bootstrap metric Image from “Bootstrapping Big Data”, Ariel Kleiner, et. al. * with n replicates

Consensus MCMC »  Assume observations are conditionally independent across subsets,
given parameters »  Run a separate Monte Carlo algorithm for each subset »  Combine the posterior simulations from each subset to produce a set of global draws representing the consensus belief among all subsets p ( ✓ | x ) / r Y s=1 p ( xs | ✓ ) p ( ✓ )1/r

Visual Recombination:

Why Trellis is Effective »  Edward Tufte’s term for panels
in Trellis Display is small multiples: –  “The same graphical design structure is repeated for each slice of a data set” –  Once a viewer understands one panel, they have immediate access to the data in all other panels –  Small multiples directly depict comparisons to reveal repetition and change, pattern and surprise »  Fisher barley data example –  Average barley yields for 10 varieties at 6 sites across 2 years –  A glaring error in the data went unnoticed for nearly 60 years 26 26 Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931 The Visual Display of Quantitative Information, Tufte Visualizing Data, Cleveland

Scaling Trellis »  What do we do when the number
of panels is very large? –  Trellis can scale computationally, but does not scale visually –  We cannot look at millions of panels »  John Tukey realized this problem decades ago –  “As multiple-aspect data continues to grow…, the the ability of human eyes to scan the reasonable displays soon runs out” –  He put forth the idea of computing diagnostics quantities for each panel that judge the relative interest or importance of viewing a panel “It seems natural to call such computer guiding diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays.”

Visual Recombination »  For each subset – Specify a visualization – Specify
a set of cognostics, metrics that identify an attribute of interest in the subset »  Recombine visually by sampling, sorting, or ﬁltering subsets based on the cognostics »  Cognostics are computed for all subsets »  Panels are not

TESSERA Software for Divide and Recombine

Tessera: Front End »  R –  Elegant design makes programming
with the data very efﬁcient –  Saves the analyst time, which is more important than processing time –  Access to 1000s of analytic methods of statistics, machine learning, and visualization –  Very large supporting and user community »  D&R Interface –  datadr R package: R implementation of D&R that ties to scalable back ends –  Trelliscope R package: scalable Trellis display system

Tessera: Back End »  Datadr and Trelliscope are high-level interfaces
for specifying D&R analytic and visual methods that hide the details of distributed computing –  So how does the scalable computation get done? »  Sufﬁcient conditions for a Tessera back-end: –  Key / Value storage –  MapReduce computation datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage

Back End Agnostic And more… (like Spark!)

datadr »  Representation of distributed data objects (ddo) / data
frames (ddf) as R objects »  A division framework –  conditional variable division –  random replicate division »  An extensible framework for applying generic transformations or common analytical methods (blb, etc.) »  Recombine: collect, average, rbind, etc. »  Goal is to implement best analytic method / recombination pairs »  Common data operations –  ﬁlter, join, sample, read.table and friends »  Division-independent methods: –  quantile, aggregate, hexbin

Trelliscope »  Trelliscope works on ddo / ddf datadr objects
–  Data can be in memory, on disk, or on a scalable storage back-end like the Hadoop Distributed File System »  The analyst specifies a panel function to be applied to each subset –  The function can consist of any collection of R plotting commands –  Panels are “potential” – in that they are not computed up front, but any panel can be potentially viewed, even if it is impossible or infeasible to view all of them »  The analyst also specifies a cognostics function –  This function returns a vector of metrics about each panel that describe some behavior of interest in the data slice –  Panels can be sorted, filtered, arranged based on the cognostics, providing the interface to access any of the potentially large number of panels

Trelliscope Viewer »  A web-based viewer of Trelliscope displays allows
the user to interact with panels based on cognostics, built with Shiny – Layout (rows, columns), paging – Sorting, ﬁltering on cognostics – Univariate and multivariate visual range ﬁlters – More to come…

DEMO Zillow Home Price Data

Example: UN Voting Data »  ~511K observations of voting records
»  ~1500 UN resolutions »  ~ 3100 votes (resolutions can have multiple issues) »  ~ 200 countries »  From 1988 to 2013 »  Votes are “yes”, “no”, “abstain” 'data.frame': 510694 obs. of 8 variables: $ country : Factor w/ 195 levels "Afghanistan",..: 1 1 1 1 ... $ sessionNumber: int 54 58 44 51 66 45 45 60 45 67 ... $ resolutionID : int 14306 14548 13546 14047 15566 13633 13613 ... $ resolution : Factor w/ 1492 levels "","2015 Review",..: 932 ... $ voteDate : Date, format: "1999-12-06" "2003-12-08" ... $ issue : Factor w/ 16 levels "A","B","C","E",..: 14 12 5 ... $ vote : Factor w/ 3 levels "abstain","no",..: 3 3 3 3 3 ... $ region : Factor w/ 55 levels "Africa, Middle East",..: 42 ...

Split the Data by Country > byCountry <- divide(un, by
= "country") > > byCountry[[1]] $key [1] "country=Afghanistan" $value sessionNumber resolutionID resolution voteDate issue vote 1 54 14306 R/54/164 1999-12-06 T yes 2 58 14548 R/58/53 2003-12-08 R yes 3 44 13546 R/44/110 1989-12-06 F yes 4 51 14047 R/51/17 1996-11-03 U yes ...

Panel Function »  For each country: –  Compute percentage of
votes agreeing w/ U.S. in each year (ignore abstain) –  Plot them with a smooth local regression ﬁt superposed »  Ex: Afghanistan -->> 1990 1995 2000 2005 2010 0 20 40 60 80 100 Year Percentage of votes agreeing with U.S.

Cognostics Function »  For each country, compute: –  Mean percent
agreement –  Most recent percent agreement –  Change in agreement during Clinton, W. Bush, and Obama administrations –  A link to the country on wikipedia »  Ex: Afghanistan -->> $meanPct [1] 18.60343 $endPct [1] 17.46032 $clintonDelta [1] -2.119509 $bushDelta [1] -0.4905913 $obamaDelta [1] 3.272854 $wiki [1] "<a href=\"http://en.wikipe dia.org/wiki/Afghanistan\" target=\"_blank\">link</a>"

Beneﬁts of Trelliscope »  Helps drive the iterative statistical analysis
process »  The interactive paradigm of the Trelliscope Viewer: –  Once you learn it, you don’t need to create or learn a new interface for each new data set / visualization –  Good for preserving state, provenance, etc. –  Facilitates comparisons against different views of the data (as opposed to adjusting knobs and not remembering what you saw under different settings, etc.) »  Fosters interaction with domain scientists – visualization is the best medium for communication »  Ability to look at the data in detail, even when it’s big »  Visual and numerical methods coexist

Tessera: What’s Next »  In-memory back-ends – Spark – GridGain in-memory HDFS
»  More user-friendly support for EMR / cloud »  Under the hood optimizations »  Easy-to-use implementations of analytical recombination techniques

Resources »  tessera.io – Scripts to get an environment set up
•  Workstation •  Vagrant •  AWS Elastic MapReduce – Links to tutorials, papers – Blog »  github.com/tesseradata »  @TesseraIO

How to Help & Contribute »  Open source BSD /
Apache license »  Google user group »  Start using it! –  If you have some applications in mind, give it a try! –  You don’t need big data or a cluster to use Tessera –  Ask us for help, let us help you showcase your work –  Give us feedback »  See resources page in tessera.io »  Theoretical / methodological research –  There’s plenty of fertile ground

Acknowledgements »  U.S. Department of Defense Advanced Research Projects Agency,
XDATA program »  U.S. Department of Homeland Security, Science and Technology Directorate »  Division of Math Sciences CDS&E Grant, National Science Foundation »  PNNL, operated by Battelle for the U.S. Department of Energy, LDRD Program, Signature Discovery and Future Power Grid Initiatives

Tessera - Open source environment for deep anal...

Tessera - Open source environment for deep analysis of large complex data

More Decks by hafen

Other Decks in Science

Featured

Transcript