power grid measurements at several locations on the grid » Identiﬁed, validated, and built precise statistical algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
division of the data – apply an analytic or visual method independently to each subset of the divided data in embarrassingly parallel fashion – recombine the results to yield a statistically valid D&R result for the analytic method » D&R is not the same as MapReduce (but makes heavy use of it)
Output Output Output Output One Analytic Method of Analysis Thread Recombine Result Statistic Recombination New Data for Analysis Sub-Thread Analytic Recombination Visual Displays Visualization Recombination
big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. » It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually » We call this “conditioning variable” division » It is in practice by far the most common thing we do (and it’s nothing new) » Another option is “random replicate” division
method independently to each subset – The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) » For conditioning-variable division: – Typically the recombination depends on the subject matter – Example: apply the same model to each subset and combine the subset estimated coefﬁcients and build a statistical model or visually study the resulting collection of coefﬁcients
on variables of the dataset » For each subset: – A visualization method is applied – A set of cognostics, metrics that identify attributes of interest in the subset, are computed » Recombine visually by sampling, sorting, or ﬁltering subsets based on the cognostics » Implemented in the trelliscope package
data down into pieces for independent storage / computation » Recall the potential for: “Complex data structures not readily put into tabular form of cases by variables” » Key-value pairs: a ﬂexible storage paradigm for divided data – each subset is an R list with two elements: key, value – keys and values can be any R object
that constitutes a set of data » Arbitrary data structure (but same structure across subsets) > irisDdo! ! Distributed data object backed by 'kvMemory' connection! ! attribute | value! ----------------+--------------------------------------------! size (stored) | 12.67 KB! size (object) | 12.67 KB! # subsets | 3! ! * Other attributes: getKeys()! * Missing attributes: splitSizeDistn!
the value of each key- value pair is a data frame » Now we have more meaningful attributes (names, number of rows & columns, summary statistics, etc.) > irisDdf! ! Distributed data frame backed by 'kvMemory' connection! ! attribute | value! ----------------+-----------------------------------------------------! names | Sepal.Length(num), Sepal.Width(num), and 3 more! nrow | 150! size (stored) | 12.67 KB! size (object) | 12.67 KB! # subsets | 3! ! * Other attrs: getKeys(), splitSizeDistn(), splitRowDistn(), summary()!
can run R in the map and reduce » Distributed key-value store » Fast random access by key » Ability to broadcast auxiliary data to nodes » A control mechanism to handle backend-speciﬁc settings (Hadoop parameters, etc.) » To plug in a back end, implement methods that tie to generic MapReduce and data connection classes
hdfsConn(), sparkDataConn() connections to ddo / ddf objects persisted on a backend storage system – ddo(): instantiate a ddo from a backend connection – ddf(): instantiate a ddf from a backend connection » Conversion methods between data stored on different backends
ddf and splits it by columns in the data or randomly » Division of ddos with arbitrary data structures must typically be done with custom MapReduce code (unless data can be temporarily transformed into a ddf) » Analytic methods are applied to a ddo/ddf with the addTransform function » Recombinations are speciﬁed with recombine, which provides some standard combiner methods, such as combRbind, which binds transformed results into single data frame
subset of a ddo/ddf and obtain a new ddo/ddf » drJoin(): join multiple ddo/ddf objects by key » drSample(): take a random sample of subsets of a ddo/ddf » drFilter(): ﬁlter out subsets of a ddo/ddf that do not meet a speciﬁed criteria » drSubset(): return a subset data frame of a ddf » drRead.table() and friends » mrExec(): run a traditional MapReduce job on a ddo/ddf
county state time nSold medListPriceSqft medSoldPriceSqft 1 06001 Alameda County CA 2008-10-01 NA 307.9787 325.8118 2 06001 Alameda County CA 2008-11-01 NA 299.1667 NA 3 06001 Alameda County CA 2008-11-01 NA NA 318.1150 4 06001 Alameda County CA 2008-12-01 NA 289.8815 305.7878 5 06001 Alameda County CA 2009-01-01 NA 288.5000 291.5977 6 06001 Alameda County CA 2009-02-01 NA 287.0370 NA
$key  "county=Abbeville County|state=SC" $value fips time nSold medListPriceSqft medSoldPriceSqft 1 45001 2008-10-01 NA 73.06226 NA 2 45001 2008-11-01 NA 70.71429 NA 3 45001 2008-12-01 NA 70.71429 NA 4 45001 2009-01-01 NA 73.43750 NA 5 45001 2009-02-01 NA 78.69565 NA ...
frame > countySlopes <- recombine(byCountySlope, combRbind)! > ! > head(countySlopes) county state val time Abbeville County SC -0.0002323686 time1 Acadia Parish LA 0.0019518441 time2 Accomack County VA -0.0092717711 time3 Ada County ID -0.0030197554 time4 Adair County IA -0.0308381951 time5 Adair County KY 0.0034399585
lazy evaluation – Specify a series of computation steps but don’t execute until a result is asked for – The idea is that the resulting computation graph can be optimized » In D&R, we (mostly) don’t do this – Any divide, recombine, or function beginning with dr immediately kicks off a MapReduce job – This is a deliberate choice made for good reason
be accomplished with one MapReduce job and they are to be persistent, so why not compute right away? » Applying an analytic method and recombining is also one MapReduce job, and we want the result right away in this case as well » So really, we don’t need to do lazy evaluation » Ok, there are a few cases where we string data operations together, e.g. divide followed by drFilter, etc. – You could argue we should have lazy evaluation here – Why not? Debugging!
– Which subset did the error come from? – What was the environment like in the R instance running on the node where the error occurred? – etc. » One of the most common causes of bugs is specifying operations on data that we have not yet seen and therefore do not know exactly what its structure is (and we get it wrong) » This is a major reason we don’t lazy evaluate sequences of commands
objects with addTransform is a lazily evaluated operation – The transformation is made note of and applied when the transformed object is computed on – We can do this and still keep things simple – A transformed object behaves in every way as if it has already been transformed
but powerful and scalable paradigm » Think of D&R as turning a big data problem into many small data problems, which we can attack with the full arsenal of R » MapReduce is sufﬁcient for D&R, but not the same thing » We strive to use methods that do not require iterative application of MapReduce » Key/Value pairs for storage – provide the ﬂexibility we need to deal with large complex data » Divisions are persistent (and expensive to compute) and should be well thought out » A single data set can (and usually does) have multiple divisions » Typically there are many recombinations applied to a given division – recombinations are much faster to compute
Apache license » Google user group » Start using it! – If you have some applications in mind, give it a try! – You don’t need big data or a cluster to use Tessera – Ask us for help, let us help you showcase your work – Give us feedback » See resources page in tessera.io » Theoretical / methodological research – There’s plenty of fertile ground
XDATA program » U.S. Department of Homeland Security, Science and Technology Directorate » Division of Math Sciences CDS&E Grant, National Science Foundation » PNNL, operated by Battelle for the U.S. Department of Energy, LDRD Program, Signature Discovery and Future Power Grid Initiatives