Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Interface, Design, and Computational Considerations for Divide and Recombine

hafen
June 12, 2015

Interface, Design, and Computational Considerations for Divide and Recombine

s

hafen

June 12, 2015
Tweet

More Decks by hafen

Other Decks in Science

Transcript

  1. Interface, Design, and Computational Considerations for Divide and Recombine Ryan

    Hafen @hafenstats Hafen Consulting, Purdue University Interface Symposium June 12, 2015
  2. Example: Power Grid »  2 TB data set of high-frequency

    power grid measurements at several locations on the grid »  Identified, validated, and built precise statistical algorithms to filter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
  3. “If [you have no] concern about error bars, about heterogeneity,

    about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
  4. Divide and Recombine (D&R) »  Simple idea: – specify a meaningful

    division of the data – apply an analytic or visual method independently to each subset of the divided data in embarrassingly parallel fashion – recombine the results to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)
  5. Data Subset Subset Subset Subset Subset Subset Divide Output Output

    Output Output Output Output One Analytic Method of Analysis Thread Recombine Result Statistic Recombination New Data for Analysis Sub-Thread Analytic Recombination Visual Displays Visualization Recombination
  6. How to Divide the Data? »  Typically “big data” is

    big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. »  It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually »  We call this “conditioning variable” division »  It is in practice by far the most common thing we do (and it’s nothing new) »  Another option is “random replicate” division
  7. Analytic Recombination »  Analytic recombination begins with applying an analytic

    method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends on the subject matter –  Example: apply the same model to each subset and combine the subset estimated coefficients and build a statistical model or visually study the resulting collection of coefficients
  8. Analytic Recombination »  For random replicate division: –  Observations are

    seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division –  Results are often approximations »  Approaches that fit this paradigm –  Coefficient averaging –  Subset likelihood modeling –  Bag of little bootstraps –  Consensus MCMC –  Alternating direction method of multipliers (ADMM) Our Approach: BLB X1, . . . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s
  9. Visual Recombination »  Data split into meaningful subsets, usually conditioning

    on variables of the dataset »  For each subset: – A visualization method is applied – A set of cognostics, metrics that identify attributes of interest in the subset, are computed »  Recombine visually by sampling, sorting, or filtering subsets based on the cognostics »  Implemented in the trelliscope package
  10. Data structures for D&R »  Must be able to break

    data down into pieces for independent storage / computation »  Recall the potential for: “Complex data structures not readily put into tabular form of cases by variables” »  Key-value pairs: a flexible storage paradigm for divided data – each subset is an R list with two elements: key, value – keys and values can be any R object
  11. [[1]]! $key! [1] "setosa"! ! $value! Sepal.Length Sepal.Width Petal.Length Petal.Width!

    1 5.1 3.5 1.4 0.2! 2 4.9 3.0 1.4 0.2! 3 4.7 3.2 1.3 0.2! 4 4.6 3.1 1.5 0.2! 5 5.0 3.6 1.4 0.2! ...! ! ! [[2]]! $key! [1] "versicolor"! ! $value! Sepal.Length Sepal.Width Petal.Length Petal.Width! 51 7.0 3.2 4.7 1.4! 52 6.4 3.2 4.5 1.5! 53 6.9 3.1 4.9 1.5! 54 5.5 2.3 4.0 1.3! 55 6.5 2.8 4.6 1.5! ...!
  12. Distributed data objects (ddo) »  A collection of k/v pairs

    that constitutes a set of data »  Arbitrary data structure (but same structure across subsets) > irisDdo! ! Distributed data object backed by 'kvMemory' connection! ! attribute | value! ----------------+--------------------------------------------! size (stored) | 12.67 KB! size (object) | 12.67 KB! # subsets | 3! ! * Other attributes: getKeys()! * Missing attributes: splitSizeDistn!
  13. Distributed data frames (ddf) »  A distributed data object where

    the value of each key- value pair is a data frame »  Now we have more meaningful attributes (names, number of rows & columns, summary statistics, etc.) > irisDdf! ! Distributed data frame backed by 'kvMemory' connection! ! attribute | value! ----------------+-----------------------------------------------------! names | Sepal.Length(num), Sepal.Width(num), and 3 more! nrow | 150! size (stored) | 12.67 KB! size (object) | 12.67 KB! # subsets | 3! ! * Other attrs: getKeys(), splitSizeDistn(), splitRowDistn(), summary()!
  14. D&R computation »  MapReduce is sufficient for all D&R operations

    – Everything uses MapReduce under the hood – Division, recombination, summaries, etc.
  15. What does a candidate back end need? »  MapReduce that

    can run R in the map and reduce »  Distributed key-value store »  Fast random access by key »  Ability to broadcast auxiliary data to nodes »  A control mechanism to handle backend-specific settings (Hadoop parameters, etc.) »  To plug in a back end, implement methods that tie to generic MapReduce and data connection classes
  16. datadr »  Distributed data types / backend connections –  localDiskConn(),

    hdfsConn(), sparkDataConn() connections to ddo / ddf objects persisted on a backend storage system –  ddo(): instantiate a ddo from a backend connection –  ddf(): instantiate a ddf from a backend connection »  Conversion methods between data stored on different backends
  17. datadr: division-independent methods »  drQuantile(): estimate all-data quantiles, optionally by

    a grouping variable »  drAggregate(): all-data tabulation »  drHexbin(): all-data hexagonal binning aggregation »  summary() method computes numerically stable moments, other summary stats (freq table, range, #NA, etc.)
  18. datadr: division and recombination »  A divide function takes a

    ddf and splits it by columns in the data or randomly »  Division of ddos with arbitrary data structures must typically be done with custom MapReduce code (unless data can be temporarily transformed into a ddf) »  Analytic methods are applied to a ddo/ddf with the addTransform function »  Recombinations are specified with recombine, which provides some standard combiner methods, such as combRbind, which binds transformed results into single data frame
  19. datadr: data operations »  drLapply(): apply a function to each

    subset of a ddo/ddf and obtain a new ddo/ddf »  drJoin(): join multiple ddo/ddf objects by key »  drSample(): take a random sample of subsets of a ddo/ddf »  drFilter(): filter out subsets of a ddo/ddf that do not meet a specified criteria »  drSubset(): return a subset data frame of a ddf »  drRead.table() and friends »  mrExec(): run a traditional MapReduce job on a ddo/ddf
  20. maxMap <- expression({! for(curMapVal in map.values)! collect("max", max(curMapVal$Petal.Length))! })! !

    maxReduce <- expression(! pre = {! globalMax <- NULL! },! reduce = {! globalMax <- max(c(globalMax, unlist(reduce.values)))! },! post = {! collect(reduce.key, globalMax)! }! )! ! maxRes <- mrExec(hdfsConn("path_to_data"),! map = maxMap,! reduce = maxReduce! control =! )!
  21. maxMap <- expression({! for(curMapVal in map.values)! collect("max", max(curMapVal$Petal.Length))! })! !

    maxReduce <- expression(! pre = {! globalMax <- NULL! },! reduce = {! globalMax <- max(c(globalMax, unlist(reduce.values)))! },! post = {! collect(reduce.key, globalMax)! }! )! ! maxRes <- mrExec(sparkDataConn("path_to_data"),! map = maxMap,! reduce = maxReduce! control =! )!
  22. maxMap <- expression({! for(curMapVal in map.values)! collect("max", max(curMapVal$Petal.Length))! })! !

    maxReduce <- expression(! pre = {! globalMax <- NULL! },! reduce = {! globalMax <- max(c(globalMax, unlist(reduce.values)))! },! post = {! collect(reduce.key, globalMax)! }! )! ! maxRes <- mrExec(localDiskConn("path_to_data"),! map = maxMap,! reduce = maxReduce! control =! )!
  23. maxMap <- expression({! for(curMapVal in map.values)! collect("max", max(curMapVal$Petal.Length))! })! !

    maxReduce <- expression(! pre = {! globalMax <- NULL! },! reduce = {! globalMax <- max(c(globalMax, unlist(reduce.values)))! },! post = {! collect(reduce.key, globalMax)! }! )! ! maxRes <- mrExec(data,! map = maxMap,! reduce = maxReduce! control =! )!
  24. D&R Example »  Zillow home price data: > head(housing) fips

    county state time nSold medListPriceSqft medSoldPriceSqft 1 06001 Alameda County CA 2008-10-01 NA 307.9787 325.8118 2 06001 Alameda County CA 2008-11-01 NA 299.1667 NA 3 06001 Alameda County CA 2008-11-01 NA NA 318.1150 4 06001 Alameda County CA 2008-12-01 NA 289.8815 305.7878 5 06001 Alameda County CA 2009-01-01 NA 288.5000 291.5977 6 06001 Alameda County CA 2009-02-01 NA 287.0370 NA
  25. D&R Example »  Divide by county and state > byCounty

    <- divide(housing, ! > by = c("county", "state"), update = TRUE)! > ! > byCounty! ! Distributed data frame backed by 'kvMemory' connection attribute | value ----------------+---------------------------------------------------------------- names | fips(cha), time(Dat), nSold(num), and 2 more nrow | 224369 size (stored) | 16.45 MB size (object) | 16.45 MB # subsets | 2883 * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary() * Conditioning variables: county, state
  26. D&R Example »  Look at a subset > byCounty[[1]]! !

    $key [1] "county=Abbeville County|state=SC" $value fips time nSold medListPriceSqft medSoldPriceSqft 1 45001 2008-10-01 NA 73.06226 NA 2 45001 2008-11-01 NA 70.71429 NA 3 45001 2008-12-01 NA 70.71429 NA 4 45001 2009-01-01 NA 73.43750 NA 5 45001 2009-02-01 NA 78.69565 NA ...
  27. D&R Example »  Look at a subset by key >

    byCounty[["county=Monongalia County|state=WV"]]! ! $key! [1] "county=Monongalia County|state=WV"! ! $value! fips time nSold medListPriceSqft medSoldPriceSqft! 1 54061 2008-10-01 NA 120.4167 NA! 2 54061 2008-11-01 NA 121.7949 NA! 3 54061 2008-11-01 NA NA NA! 4 54061 2008-12-01 NA 121.3571 NA! 5 54061 2009-01-01 NA 121.3571 NA! ...! ! ! !
  28. D&R Example »  Apply a transformation to get slope of

    fitted line of list price vs. time > lmCoef <- function(x)! > coef(lm(medListPriceSqft ~ time, data = x))[2]! > ! > byCountySlope <- addTransform(byCounty, lmCoef)! > ! > byCountySlope[[1]]! ! $key! [1] "county=Abbeville County|state=SC"! ! $value! time ! -0.0002323686 ! ! ! ! !
  29. D&R Example »  Recombine the slope coefficients into a data

    frame > countySlopes <- recombine(byCountySlope, combRbind)! > ! > head(countySlopes) county state val time Abbeville County SC -0.0002323686 time1 Acadia Parish LA 0.0019518441 time2 Accomack County VA -0.0092717711 time3 Ada County ID -0.0030197554 time4 Adair County IA -0.0308381951 time5 Adair County KY 0.0034399585
  30. A Note About Lazy Evaluation »  Systems like Spark provide

    lazy evaluation – Specify a series of computation steps but don’t execute until a result is asked for – The idea is that the resulting computation graph can be optimized »  In D&R, we (mostly) don’t do this – Any divide, recombine, or function beginning with dr immediately kicks off a MapReduce job – This is a deliberate choice made for good reason
  31. Why Not Lazy Evaluation in D&R? »  Divisions can typically

    be accomplished with one MapReduce job and they are to be persistent, so why not compute right away? »  Applying an analytic method and recombining is also one MapReduce job, and we want the result right away in this case as well »  So really, we don’t need to do lazy evaluation »  Ok, there are a few cases where we string data operations together, e.g. divide followed by drFilter, etc. –  You could argue we should have lazy evaluation here –  Why not? Debugging!
  32. Debugging in Distributed Computing »  Distributed debugging is very difficult

    –  Which subset did the error come from? –  What was the environment like in the R instance running on the node where the error occurred? –  etc. »  One of the most common causes of bugs is specifying operations on data that we have not yet seen and therefore do not know exactly what its structure is (and we get it wrong) »  This is a major reason we don’t lazy evaluate sequences of commands
  33. The One Lazy Evaluation Exception »  Applying transformations to ddo/ddf

    objects with addTransform is a lazily evaluated operation – The transformation is made note of and applied when the transformed object is computed on – We can do this and still keep things simple – A transformed object behaves in every way as if it has already been transformed
  34. Lazy Evaluation of addTransform > lmCoef <- function(x)! > coef(lm(medListPriceSqft

    ~ time, data = x))[2]! > ! > byCountySlope <- addTransform(byCounty, lmCoef)! ! Transformed distributed data object backed by 'kvMemory' connection! ! attribute | value! ----------------+----------------------------------------------------------------! size (stored) | 16.45 MB (before transformation)! size (object) | 16.45 MB (before transformation)! # subsets | 2883! ! * Other attributes: getKeys()! * Conditioning variables: county, state! ! ! > byCountySlope[[1]]! ! $key! [1] "county=Abbeville County|state=SC"! ! $value! time ! -0.0002323686!
  35. Recap of Some Key Points »  D&R is a simple

    but powerful and scalable paradigm »  Think of D&R as turning a big data problem into many small data problems, which we can attack with the full arsenal of R »  MapReduce is sufficient for D&R, but not the same thing »  We strive to use methods that do not require iterative application of MapReduce »  Key/Value pairs for storage – provide the flexibility we need to deal with large complex data »  Divisions are persistent (and expensive to compute) and should be well thought out »  A single data set can (and usually does) have multiple divisions »  Typically there are many recombinations applied to a given division – recombinations are much faster to compute
  36. Learning More About Tessera »  tessera.io – Scripts to get an

    environment set up •  Workstation •  Vagrant •  AWS Elastic MapReduce – Links to tutorials, papers – Blog »  github.com/tesseradata »  @TesseraIO
  37. How to Help & Contribute »  Open source BSD /

    Apache license »  Google user group »  Start using it! –  If you have some applications in mind, give it a try! –  You don’t need big data or a cluster to use Tessera –  Ask us for help, let us help you showcase your work –  Give us feedback »  See resources page in tessera.io »  Theoretical / methodological research –  There’s plenty of fertile ground
  38. Acknowledgements »  U.S. Department of Defense Advanced Research Projects Agency,

    XDATA program »  U.S. Department of Homeland Security, Science and Technology Directorate »  Division of Math Sciences CDS&E Grant, National Science Foundation »  PNNL, operated by Battelle for the U.S. Department of Energy, LDRD Program, Signature Discovery and Future Power Grid Initiatives