Tessera

Tessera

Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many
other researchers at PNNL and Purdue Funding from DARPA XDATA Program (and others)

What is Tessera? »  A high level R interface for
analyzing complex data large and small »  Code is simple and consistent regardless of size »  Powered by statistical methodology Divide and Recombine (D&R) »  Provides access to 1000s of statistical, machine learning, and visualization methods »  Detailed, ﬂexible, scalable visualization with Trelliscope http://tessera.io

Tessera »  Front end: two R packages, datadr & trelliscope
»  Back ends: R, Hadoop, Spark, etc. »  R <-> backend bridges: RHIPE, SparkR, etc. datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage

Back End Agnostic Interface datadr / trelliscope Memory R Interface
Computation Storage HDFS SparkR / Spark Computation Storage HDFS RHIPE / Hadoop Computation Storage Local Disk Multicore R Computation Storage Storage (under development)

Trelliscope »  Extension of multi-panel display systems, e.g. Trellis Display
or faceting in ggplot »  Number of panels can be very large »  Panels can be interactively navigated through the use of cognostics demo: https://tessera.shinyapps.io/ny-demo

rbokeh (tangent part I) »  A lot of interactivity we
seek in visualization can be easily parameterized into a Trelliscope display via speciﬁcation of data partitioning and cognostics »  But more transient interactivity can also be very useful (e.g. zoom/pan, tooltips, etc.) »  rbokeh: an R interface to the Bokeh plotting library (http://bokeh.pydata.org/) http://hafen.github.io/rbokeh/ https://github.com/bokeh/rbokeh

htmlwidgets gallery (tangent part II) While we’re on the subject
of htmlwidgets… Take a look at all the great work people are doing on htmlwidgets: http://hafen.github.io/htmlwidgetsgallery/

DEEP ANALYSIS OF

Example: Power Grid »  2 TB data set of high-frequency
power grid measurements at several locations on the grid »  Exploratory analysis and statistical modeling helped identify, validate, and build precise algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2

“Restricting one's self to planned analysis - failing to accompany
it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey

DEEP ANALYSIS OF

“If [you have no] concern about error bars, about heterogeneity,
about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan

DEEP ANALYSIS OF

Systems that do not work

What we want to be able to do: »  Work
in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much ﬂexibility and ease as small data

DIVIDE AND RECOMBINE

Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent
divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)

Divide and Recombine paper

How to Divide the Data? »  Typically “big data” is
big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. »  It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually »  We call this “conditioning variable” division »  It is in practice by far the most common thing we do (and it’s nothing new) »  Another option is “random replicate” division

Analytic Recombination »  Analytic recombination begins with applying an analytic
method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends on the subject matter –  Example: apply the same model to each subset and combine the subset estimated coefﬁcients and build a statistical model or visually study the resulting collection of coefﬁcients

Analytic Recombination »  For random replicate division: –  Observations are
seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division –  Results are often approximations »  Approaches that ﬁt this paradigm –  Coefﬁcient averaging –  Subset likelihood modeling –  Bag of little bootstraps –  Consensus MCMC –  Alternating direction method of multipliers (ADMM) Our Approach: BLB X1, . . . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s

Tessera - NY Open Statistical Programming Meetup

Tessera - NY Open Statistical Programming Meetup

hafen

More Decks by hafen

Other Decks in Technology

Featured

Transcript

Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many

What is Tessera? »  A high level R interface for

Tessera »  Front end: two R packages, datadr & trelliscope

Back End Agnostic Interface datadr / trelliscope Memory R Interface

Trelliscope »  Extension of multi-panel display systems, e.g. Trellis Display

rbokeh (tangent part I) »  A lot of interactivity we

htmlwidgets gallery (tangent part II) While we’re on the subject

DEEP ANALYSIS OF

Example: Power Grid »  2 TB data set of high-frequency

“Restricting one's self to planned analysis - failing to accompany

DEEP ANALYSIS OF

“If [you have no] concern about error bars, about heterogeneity,

DEEP ANALYSIS OF

Systems that do not work

What we want to be able to do: »  Work

DIVIDE AND RECOMBINE

Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent

Divide and Recombine paper

How to Divide the Data? »  Typically “big data” is

Analytic Recombination »  Analytic recombination begins with applying an analytic

Analytic Recombination »  For random replicate division: –  Observations are