Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tessera - Open source environment for deep analysis of large complex data

hafen
November 15, 2014

Tessera - Open source environment for deep analysis of large complex data

Talk given at Seattle userR meetup

hafen

November 15, 2014
Tweet

More Decks by hafen

Other Decks in Science

Transcript

  1. Example: Power Grid »  2 TB data set of high-frequency

    power grid measurements at several locations on the grid »  Identified, validated, and built precise statistical algorithms to filter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
  2. “Restricting one's self to planned analysis - failing to accompany

    it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey
  3. “If [you have no] concern about error bars, about heterogeneity,

    about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
  4. What we want to be able to do: »  Work

    in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much flexibility and ease as small data
  5. Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent

    divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)
  6. Data Subset Subset Subset Subset Subset Subset Divide Output Output

    Output Output Output Output One Analytic Method of Analysis Thread Recombine Result Statistic Recombination New Data for Analysis Sub-Thread Analytic Recombination Visual Displays Visualization Recombination
  7. How to Divide the Data? »  It depends! »  Random

    replicate division –  randomly partition the data »  Conditioning variable division –  Very often data are “embarrassingly divisible” –  Break up the data based on the subject matter –  Example: •  25 years of 90 daily financial variables for 100 banks in the U.S. •  Divide the data by bank •  Divide the data by year •  Divide the data by geography –  This is the major division method used in our own analyses –  Has already been widely used in statistics, machine learning, and visualization for datasets of all sizes
  8. Analytic Recombination »  Analytic recombination begins with applying an analytic

    method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends mostly on the subject matter –  Example: •  subsets each with the same model with parameters (e.g. linear model) •  parameters are modeled as stochastic too: independent draws from a distribution •  recombination: analysis to build statistical model for the parameters using the subset estimated coefficients
  9. Analytic Recombination »  Analytic recombination begins with applying an analytic

    method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For random replicate division: –  Observations are seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division
  10. Analytic Recombination with Random Division – The Naïve Approach: Y

    = X + ✏ Linear model: ˆ = r X s=1 X0 s Xs ! 1 r X s=1 X0 s Ys Entire-data least squares estimate: ¨ = 1 r r X s=1 (X0 s Xs) 1X0 s Ys D&R approximation: ¨ ⇡ ˆ Under certain conditions, we can show: Partition into random subsets: 2 6 6 6 4 Y1 Y2 . . . Yr 3 7 7 7 5 = 2 6 6 6 4 X1 X2 . . . Xr 3 7 7 7 5 + 2 6 6 6 4 ✏1 ✏ . . . ✏ 3 7 7 7 5
  11. Note »  We can do this for GLMs, general factor-

    response models, etc. »  To run this, we only need R’s lm() function »  Computation is embarrassingly parallel »  We can (and want to) do it in one pass through the data »  But we can do better in terms of accuracy
  12. Scatter Matrix Stability Weighting »  Compute a measure of concordance

    between the scatter matrix of individual blocks XT S X S and the overall scatter matrix XTX »  Use this measure to weight the averaging to obtain the final estimate »  Requires two passes – one to get overall scatter matrix and one to compare blocks to overall »  Try to avoid iteration as data is most often too large to fit in memory and disk IO is slow
  13. Subset Likelihood Modeling »  Suppose we have a log likelihood

    for a hypothesized model to fit with n independent observations, xi »  Break the data into r subsets »  Fit subset likelihoods with parametric model (e.g. quadratic) »  Recombine by summing the fitted subset likelihood models to get a fitted all-data likelihood model »  Approach the problem as building a model for the likelihood `(✓) = log n Y i=1 f(xi | ✓) `(✓) = r X s=1 `s(✓)
  14. Bag of Little Bootstraps Our Approach: BLB X1, . .

    . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s Divide into random subsets Resample* each subset and compute estimate Compute bootstrap metric Image from “Bootstrapping Big Data”, Ariel Kleiner, et. al. * with n replicates
  15. Consensus MCMC »  Assume observations are conditionally independent across subsets,

    given parameters »  Run a separate Monte Carlo algorithm for each subset »  Combine the posterior simulations from each subset to produce a set of global draws representing the consensus belief among all subsets p ( ✓ | x ) / r Y s=1 p ( xs | ✓ ) p ( ✓ )1/r
  16. Why Trellis is Effective »  Edward Tufte’s term for panels

    in Trellis Display is small multiples: –  “The same graphical design structure is repeated for each slice of a data set” –  Once a viewer understands one panel, they have immediate access to the data in all other panels –  Small multiples directly depict comparisons to reveal repetition and change, pattern and surprise »  Fisher barley data example –  Average barley yields for 10 varieties at 6 sites across 2 years –  A glaring error in the data went unnoticed for nearly 60 years 26 26 Barley Yield (bushels/acre) Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi 20 30 40 50 60 Grand Rapids Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Duluth Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi University Farm Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Morris Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Crookston Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi Waseca 1932 1931 The Visual Display of Quantitative Information, Tufte Visualizing Data, Cleveland
  17. Scaling Trellis »  What do we do when the number

    of panels is very large? –  Trellis can scale computationally, but does not scale visually –  We cannot look at millions of panels »  John Tukey realized this problem decades ago –  “As multiple-aspect data continues to grow…, the the ability of human eyes to scan the reasonable displays soon runs out” –  He put forth the idea of computing diagnostics quantities for each panel that judge the relative interest or importance of viewing a panel “It seems natural to call such computer guiding diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays.”
  18. Visual Recombination »  For each subset – Specify a visualization – Specify

    a set of cognostics, metrics that identify an attribute of interest in the subset »  Recombine visually by sampling, sorting, or filtering subsets based on the cognostics »  Cognostics are computed for all subsets »  Panels are not
  19. Tessera: Front End »  R –  Elegant design makes programming

    with the data very efficient –  Saves the analyst time, which is more important than processing time –  Access to 1000s of analytic methods of statistics, machine learning, and visualization –  Very large supporting and user community »  D&R Interface –  datadr R package: R implementation of D&R that ties to scalable back ends –  Trelliscope R package: scalable Trellis display system
  20. Tessera: Back End »  Datadr and Trelliscope are high-level interfaces

    for specifying D&R analytic and visual methods that hide the details of distributed computing –  So how does the scalable computation get done? »  Sufficient conditions for a Tessera back-end: –  Key / Value storage –  MapReduce computation datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage
  21. datadr »  Representation of distributed data objects (ddo) / data

    frames (ddf) as R objects »  A division framework –  conditional variable division –  random replicate division »  An extensible framework for applying generic transformations or common analytical methods (blb, etc.) »  Recombine: collect, average, rbind, etc. »  Goal is to implement best analytic method / recombination pairs »  Common data operations –  filter, join, sample, read.table and friends »  Division-independent methods: –  quantile, aggregate, hexbin
  22. Trelliscope »  Trelliscope works on ddo / ddf datadr objects

    –  Data can be in memory, on disk, or on a scalable storage back-end like the Hadoop Distributed File System »  The analyst specifies a panel function to be applied to each subset –  The function can consist of any collection of R plotting commands –  Panels are “potential” – in that they are not computed up front, but any panel can be potentially viewed, even if it is impossible or infeasible to view all of them »  The analyst also specifies a cognostics function –  This function returns a vector of metrics about each panel that describe some behavior of interest in the data slice –  Panels can be sorted, filtered, arranged based on the cognostics, providing the interface to access any of the potentially large number of panels
  23. Trelliscope Viewer »  A web-based viewer of Trelliscope displays allows

    the user to interact with panels based on cognostics, built with Shiny – Layout (rows, columns), paging – Sorting, filtering on cognostics – Univariate and multivariate visual range filters – More to come…
  24. Example: UN Voting Data »  ~511K observations of voting records

    »  ~1500 UN resolutions »  ~ 3100 votes (resolutions can have multiple issues) »  ~ 200 countries »  From 1988 to 2013 »  Votes are “yes”, “no”, “abstain” 'data.frame': 510694 obs. of 8 variables: $ country : Factor w/ 195 levels "Afghanistan",..: 1 1 1 1 ... $ sessionNumber: int 54 58 44 51 66 45 45 60 45 67 ... $ resolutionID : int 14306 14548 13546 14047 15566 13633 13613 ... $ resolution : Factor w/ 1492 levels "","2015 Review",..: 932 ... $ voteDate : Date, format: "1999-12-06" "2003-12-08" ... $ issue : Factor w/ 16 levels "A","B","C","E",..: 14 12 5 ... $ vote : Factor w/ 3 levels "abstain","no",..: 3 3 3 3 3 ... $ region : Factor w/ 55 levels "Africa, Middle East",..: 42 ...
  25. Split the Data by Country > byCountry <- divide(un, by

    = "country") > > byCountry[[1]] $key [1] "country=Afghanistan" $value sessionNumber resolutionID resolution voteDate issue vote 1 54 14306 R/54/164 1999-12-06 T yes 2 58 14548 R/58/53 2003-12-08 R yes 3 44 13546 R/44/110 1989-12-06 F yes 4 51 14047 R/51/17 1996-11-03 U yes ...
  26. Panel Function »  For each country: –  Compute percentage of

    votes agreeing w/ U.S. in each year (ignore abstain) –  Plot them with a smooth local regression fit superposed »  Ex: Afghanistan -->> 1990 1995 2000 2005 2010 0 20 40 60 80 100 Year Percentage of votes agreeing with U.S.
  27. Cognostics Function »  For each country, compute: –  Mean percent

    agreement –  Most recent percent agreement –  Change in agreement during Clinton, W. Bush, and Obama administrations –  A link to the country on wikipedia »  Ex: Afghanistan -->> $meanPct [1] 18.60343 $endPct [1] 17.46032 $clintonDelta [1] -2.119509 $bushDelta [1] -0.4905913 $obamaDelta [1] 3.272854 $wiki [1] "<a href=\"http://en.wikipe dia.org/wiki/Afghanistan\" target=\"_blank\">link</a>"
  28. Benefits of Trelliscope »  Helps drive the iterative statistical analysis

    process »  The interactive paradigm of the Trelliscope Viewer: –  Once you learn it, you don’t need to create or learn a new interface for each new data set / visualization –  Good for preserving state, provenance, etc. –  Facilitates comparisons against different views of the data (as opposed to adjusting knobs and not remembering what you saw under different settings, etc.) »  Fosters interaction with domain scientists – visualization is the best medium for communication »  Ability to look at the data in detail, even when it’s big »  Visual and numerical methods coexist
  29. Tessera: What’s Next »  In-memory back-ends – Spark – GridGain in-memory HDFS

    »  More user-friendly support for EMR / cloud »  Under the hood optimizations »  Easy-to-use implementations of analytical recombination techniques
  30. Resources »  tessera.io – Scripts to get an environment set up

    •  Workstation •  Vagrant •  AWS Elastic MapReduce – Links to tutorials, papers – Blog »  github.com/tesseradata »  @TesseraIO
  31. How to Help & Contribute »  Open source BSD /

    Apache license »  Google user group »  Start using it! –  If you have some applications in mind, give it a try! –  You don’t need big data or a cluster to use Tessera –  Ask us for help, let us help you showcase your work –  Give us feedback »  See resources page in tessera.io »  Theoretical / methodological research –  There’s plenty of fertile ground
  32. Acknowledgements »  U.S. Department of Defense Advanced Research Projects Agency,

    XDATA program »  U.S. Department of Homeland Security, Science and Technology Directorate »  Division of Math Sciences CDS&E Grant, National Science Foundation »  PNNL, operated by Battelle for the U.S. Department of Energy, LDRD Program, Signature Discovery and Future Power Grid Initiatives