Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tessera - NY Open Statistical Programming Meetup

hafen
July 30, 2015

Tessera - NY Open Statistical Programming Meetup

hafen

July 30, 2015
Tweet

More Decks by hafen

Other Decks in Technology

Transcript

  1. Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many

    other researchers at PNNL and Purdue Funding from DARPA XDATA Program (and others)
  2. What is Tessera? »  A high level R interface for

    analyzing complex data large and small »  Code is simple and consistent regardless of size »  Powered by statistical methodology Divide and Recombine (D&R) »  Provides access to 1000s of statistical, machine learning, and visualization methods »  Detailed, flexible, scalable visualization with Trelliscope http://tessera.io
  3. Tessera »  Front end: two R packages, datadr & trelliscope

    »  Back ends: R, Hadoop, Spark, etc. »  R <-> backend bridges: RHIPE, SparkR, etc. datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage
  4. Back End Agnostic Interface datadr / trelliscope Memory R Interface

    Computation Storage HDFS SparkR / Spark Computation Storage HDFS RHIPE / Hadoop Computation Storage Local Disk Multicore R Computation Storage Storage (under development)
  5. Trelliscope »  Extension of multi-panel display systems, e.g. Trellis Display

    or faceting in ggplot »  Number of panels can be very large »  Panels can be interactively navigated through the use of cognostics demo: https://tessera.shinyapps.io/ny-demo
  6. rbokeh (tangent part I) »  A lot of interactivity we

    seek in visualization can be easily parameterized into a Trelliscope display via specification of data partitioning and cognostics »  But more transient interactivity can also be very useful (e.g. zoom/pan, tooltips, etc.) »  rbokeh: an R interface to the Bokeh plotting library (http://bokeh.pydata.org/) http://hafen.github.io/rbokeh/ https://github.com/bokeh/rbokeh
  7. htmlwidgets gallery (tangent part II) While we’re on the subject

    of htmlwidgets… Take a look at all the great work people are doing on htmlwidgets: http://hafen.github.io/htmlwidgetsgallery/
  8. Example: Power Grid »  2 TB data set of high-frequency

    power grid measurements at several locations on the grid »  Exploratory analysis and statistical modeling helped identify, validate, and build precise algorithms to filter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
  9. “Restricting one's self to planned analysis - failing to accompany

    it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey
  10. “If [you have no] concern about error bars, about heterogeneity,

    about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
  11. What we want to be able to do: »  Work

    in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much flexibility and ease as small data
  12. Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent

    divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)
  13. How to Divide the Data? »  Typically “big data” is

    big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. »  It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually »  We call this “conditioning variable” division »  It is in practice by far the most common thing we do (and it’s nothing new) »  Another option is “random replicate” division
  14. Analytic Recombination »  Analytic recombination begins with applying an analytic

    method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends on the subject matter –  Example: apply the same model to each subset and combine the subset estimated coefficients and build a statistical model or visually study the resulting collection of coefficients
  15. Analytic Recombination »  For random replicate division: –  Observations are

    seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division –  Results are often approximations »  Approaches that fit this paradigm –  Coefficient averaging –  Subset likelihood modeling –  Bag of little bootstraps –  Consensus MCMC –  Alternating direction method of multipliers (ADMM) Our Approach: BLB X1, . . . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s