Tessera - NY Open Statistical Programming Meetup

D738ce8f2d307502731b66391cbfbb9d?s=47 hafen
July 30, 2015

Tessera - NY Open Statistical Programming Meetup

D738ce8f2d307502731b66391cbfbb9d?s=128

hafen

July 30, 2015
Tweet

Transcript

  1. Tessera

  2. Joint work with: Bill Cleveland, Purdue Saptarshi Guha, Mozilla Many

    other researchers at PNNL and Purdue Funding from DARPA XDATA Program (and others)
  3. What is Tessera? »  A high level R interface for

    analyzing complex data large and small »  Code is simple and consistent regardless of size »  Powered by statistical methodology Divide and Recombine (D&R) »  Provides access to 1000s of statistical, machine learning, and visualization methods »  Detailed, flexible, scalable visualization with Trelliscope http://tessera.io
  4. Tessera »  Front end: two R packages, datadr & trelliscope

    »  Back ends: R, Hadoop, Spark, etc. »  R <-> backend bridges: RHIPE, SparkR, etc. datadr / trelliscope Key/Value Store MapReduce Interface Computation Storage
  5. Back End Agnostic Interface datadr / trelliscope Memory R Interface

    Computation Storage HDFS SparkR / Spark Computation Storage HDFS RHIPE / Hadoop Computation Storage Local Disk Multicore R Computation Storage Storage (under development)
  6. Trelliscope »  Extension of multi-panel display systems, e.g. Trellis Display

    or faceting in ggplot »  Number of panels can be very large »  Panels can be interactively navigated through the use of cognostics demo: https://tessera.shinyapps.io/ny-demo
  7. rbokeh (tangent part I) »  A lot of interactivity we

    seek in visualization can be easily parameterized into a Trelliscope display via specification of data partitioning and cognostics »  But more transient interactivity can also be very useful (e.g. zoom/pan, tooltips, etc.) »  rbokeh: an R interface to the Bokeh plotting library (http://bokeh.pydata.org/) http://hafen.github.io/rbokeh/ https://github.com/bokeh/rbokeh
  8. htmlwidgets gallery (tangent part II) While we’re on the subject

    of htmlwidgets… Take a look at all the great work people are doing on htmlwidgets: http://hafen.github.io/htmlwidgetsgallery/
  9. DEEP ANALYSIS OF

  10. Example: Power Grid »  2 TB data set of high-frequency

    power grid measurements at several locations on the grid »  Exploratory analysis and statistical modeling helped identify, validate, and build precise algorithms to filter out several types of bad data that had gone unnoticed in several prior analyses (~20% bad data!) Time (seconds) Frequency 59.998 59.999 60.000 60.001 60.002 60.003 41 42 43 44 45 46 31 1 20 1 1 18 2
  11. “Restricting one's self to planned analysis - failing to accompany

    it with exploration - loses sight of the most interesting results too frequently to be comfortable.” – John Tukey
  12. DEEP ANALYSIS OF

  13. “If [you have no] concern about error bars, about heterogeneity,

    about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re a statistician – then … there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.” – Michael Jordan
  14. DEEP ANALYSIS OF

  15. Systems that do not work

  16. What we want to be able to do: »  Work

    in familiar high-level statistical programming environment »  Have access to the 1000s of statistical, ML, and vis methods »  Minimize time thinking about code or distributed systems »  Maximize time thinking about the data »  Be able to analyze large complex data with nearly as much flexibility and ease as small data
  17. DIVIDE AND RECOMBINE

  18. Divide and Recombine (D&R) »  Simple idea: – specify meaningful, persistent

    divisions of the data – analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion – Results are recombined to yield a statistically valid D&R result for the analytic method »  D&R is not the same as MapReduce (but makes heavy use of it)
  19. Divide and Recombine paper

  20. How to Divide the Data? »  Typically “big data” is

    big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc. »  It is therefore natural to break the data up based on these dimensions and apply visual or analytical methods to the subsets individually »  We call this “conditioning variable” division »  It is in practice by far the most common thing we do (and it’s nothing new) »  Another option is “random replicate” division
  21. Analytic Recombination »  Analytic recombination begins with applying an analytic

    method independently to each subset –  The beauty of this is that we can use any of the small-data methods we have available (think of the 1000s of methods in R) »  For conditioning-variable division: –  Typically the recombination depends on the subject matter –  Example: apply the same model to each subset and combine the subset estimated coefficients and build a statistical model or visually study the resulting collection of coefficients
  22. Analytic Recombination »  For random replicate division: –  Observations are

    seen as exchangeable, with no conditioning variables considered –  Division methods are based on statistical matters, not the subject matter as in conditioning-variable division –  Results are often approximations »  Approaches that fit this paradigm –  Coefficient averaging –  Subset likelihood modeling –  Bag of little bootstraps –  Consensus MCMC –  Alternating direction method of multipliers (ADMM) Our Approach: BLB X1, . . . , Xn . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . . . . X⇤(1) 1 , . . . , X⇤(1) n X⇤(2) 1 , . . . , X⇤(2) n ˆ ✓⇤(1) n ˆ ✓⇤(2) n . . . ˇ X(1) 1 , . . . , ˇ X(1) b(n) avg(⇠⇤ 1 , . . . , ⇠⇤ s ) ˇ X(s) 1 , . . . , ˇ X(s) b(n) X⇤(r) 1 , . . . , X⇤(r) n X⇤(r) 1 , . . . , X⇤(r) n ˆ ✓⇤(r) n ˆ ✓⇤(r) n ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ 1 ⇠(ˆ ✓⇤(1) n , . . . , ˆ ✓⇤(r) n ) = ⇠⇤ s
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. None