analyzing complex data large and small » Code is simple and consistent regardless of size » Powered by statistical methodology Divide and Recombine (D&R) » Provides access to 1000s of statistical, machine learning, and visualization methods » Detailed, ﬂexible, scalable visualization with Trelliscope http://tessera.io
» Back ends: R, Hadoop, Spark, etc. » R <-> backend bridges: RHIPE, SparkR, etc. datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage
or faceting in ggplot » Number of panels can be very large » Panels can be interactively navigated through the use of cognostics demo: https://tessera.shinyapps.io/ny-demo
seek in visualization can be easily parameterized into a Trelliscope display via speciﬁcation of data partitioning and cognostics » But more transient interactivity can also be very useful (e.g. zoom/pan, tooltips, etc.) » rbokeh: an R interface to the Bokeh plotting library (http://bokeh.pydata.org/) http://hafen.github.io/rbokeh/ https://github.com/bokeh/rbokeh
power grid measurements at several locations on the grid » Exploratory analysis and statistical modeling helped identify, validate, and build precise algorithms to ﬁlter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
in familiar high-level statistical programming environment » Have access to the 1000s of statistical, ML, and vis methods » Minimize time thinking about code or distributed systems » Maximize time thinking about the data » Be able to analyze large complex data with nearly as much ﬂexibility and ease as small data
divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method » D&R is not the same as MapReduce (but makes heavy use of it)
big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. » It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually » We call this “conditioning variable” division » It is in practice by far the most common thing we do (and it’s nothing new) » Another option is “random replicate” division
method independently to each subset – The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) » For conditioning-variable division: – Typically the recombination depends on the subject matter – Example: apply the same model to each subset and combine the subset estimated coefﬁcients and build a statistical model or visually study the resulting collection of coefﬁcients