Slide 1

Slide 1 text

Tessera

Slide 2

Slide 2 text

Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many other researchers at PNNL and Purdue Funding from DARPA XDATA Program (and others)

Slide 3

Slide 3 text

What is Tessera? »  A high level R interface for analyzing complex data large and small »  Code is simple and consistent regardless of size »  Powered by statistical methodology Divide and Recombine (D&R) »  Provides access to 1000s of statistical, machine learning, and visualization methods »  Detailed, flexible, scalable visualization with Trelliscope http://tessera.io

Slide 4

Slide 4 text

Tessera »  Front end: two R packages, datadr & trelliscope »  Back ends: R, Hadoop, Spark, etc. »  R <-> backend bridges: RHIPE, SparkR, etc. datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage

Slide 5

Slide 5 text

Back End Agnostic Interface datadr / trelliscope Memory R Interface Computation Storage HDFS SparkR / Spark Computation Storage HDFS RHIPE / Hadoop Computation Storage Local Disk Multicore R Computation Storage Storage (under development)

Slide 6

Slide 6 text

Trelliscope »  Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot »  Number of panels can be very large »  Panels can be interactively navigated through the use of cognostics demo: https://tessera.shinyapps.io/ny-demo

Slide 7

Slide 7 text

rbokeh (tangent part I) »  A lot of interactivity we seek in visualization can be easily parameterized into a Trelliscope display via specification of data partitioning and cognostics »  But more transient interactivity can also be very useful (e.g. zoom/pan, tooltips, etc.) »  rbokeh: an R interface to the Bokeh plotting library (http://bokeh.pydata.org/) http://hafen.github.io/rbokeh/ https://github.com/bokeh/rbokeh

Slide 8

Slide 8 text

htmlwidgets gallery (tangent part II) While we’re on the subject of htmlwidgets… Take a look at all the great work people are doing on htmlwidgets: http://hafen.github.io/htmlwidgetsgallery/

Slide 9

Slide 9 text

DEEP ANALYSIS OF

Slide 10

Slide 10 text

Example: Power Grid »  2 TB data set of high-frequency power grid measurements at several locations on the grid »  Exploratory analysis and statistical modeling helped identify, validate, and build precise algorithms to filter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2

Slide 11

Slide 11 text

“Restricting one's self to planned analysis - failing to accompany it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey

Slide 12

Slide 12 text

DEEP ANALYSIS OF

Slide 13

Slide 13 text

“If [you have no] concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan

Slide 14

Slide 14 text

DEEP ANALYSIS OF

Slide 15

Slide 15 text

Systems that do not work

Slide 16

Slide 16 text

What we want to be able to do: »  Work in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much flexibility and ease as small data

Slide 17

Slide 17 text

DIVIDE AND RECOMBINE

Slide 18

Slide 18 text

Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)

Slide 19

Slide 19 text

Divide and Recombine paper

Slide 20

Slide 20 text

How to Divide the Data? »  Typically “big data” is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. »  It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually »  We call this “conditioning variable” division »  It is in practice by far the most common thing we do (and it’s nothing new) »  Another option is “random replicate” division

Slide 21

Slide 21 text

Analytic Recombination »  Analytic recombination begins with applying an analytic method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends on the subject matter –  Example: apply the same model to each subset and combine the subset estimated coefficients and build a statistical model or visually study the resulting collection of coefficients

Slide 22

Slide 22 text

Analytic Recombination »  For random replicate division: –  Observations are seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division –  Results are often approximations »  Approaches that fit this paradigm –  Coefficient averaging –  Subset likelihood modeling –  Bag of little bootstraps –  Consensus MCMC –  Alternating direction method of multipliers (ADMM) Our Approach: BLB X1, . . . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content