Mining Big Data Streams: Better Algorithms or Faster Systems?

Mining Big Data Streams: Better Algorithms or Faster Systems?

The rate at which the world produces data is growing steadily, thus creating ever larger streams of continuously evolving data. However, current (de-facto standard) solutions for big data analysis are not designed to mine evolving streams. So, should we find better algorithms to mine data streams, or should we focus on building faster systems?

In this talk, we debunk this false dichotomy between algorithms and systems, and we argue that the data mining and distributed systems community need to work together to bring about the next revolution in data analysis. In doing so, we introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza.

As a case study, we present one of SAMOA's main algorithms for classification, the Vertical Hoeffding Tree (VHT). Then, we analyze the algorithm from a distributed systems perspective, highlight the issue of load balancing, and describe a generalizable solution to it. Finally, we conclude by envisioning system-algorithm co-design as a promising direction for the future of big data analytics.

Transcript

  1. Mining Big Data Streams Better Algorithms or Faster Systems? 


    
 Gianmarco De Francisci Morales
 gdfm@acm.org
 @gdfm7
  2. Vision Algorithms & Systems Distributed stream mining platform Development and

    collaboration framework
 for researchers Library of state-of-the-art algorithms
 for practitioners 2
  3. Agenda SAMOA
 (Scalable Advanced Massive Online Analysis) VHT
 (Vertical Hoeffding

    Tree) PKG
 (Partial Key Grouping) 3 System Algorithm API
  4. Visiting Scientist 
 @Aalto DMG Scientist @Yahoo Labs PPMC @

    Apache SAMOA Committer @ Apache Pig Contributor for Hadoop, 
 Giraph, Storm, S4, Grafos.ml 4
  5. What do I work on? Systems Distributed Mining News Streaming

    Grid Admin —2008 —2009 —2010 —2011 —2012 —2013 -—2014 -—2015 • IMT Lucca • M.Eng • Y!R Barcelona • PhD 5 PhD Student Postdoc Scientist
  6. “Panta rhei”
 (everything flows) -Heraclitus 6

  7. Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Example: spam detection in comments

    on Yahoo News Trends change in time Need to retrain model with new data 7
  8. Stream Batch data is a snapshot of streaming data 8

  9. Challenges Operational Need to rerun the pipeline and redeploy the

    model when new data arrives Paradigmatic New data lies in storage without generating new value until new model is retrained 9 Gather Clean Model Deploy
  10. Present of big data Too big to handle 10

  11. Future of big data Drinking from a firehose 11

  12. Evolution of SPEs 12 —2003 —2004 —2005 —2006 —2008 —2010

    —2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org
  13. Actor Model 13 PE PE Input Stream PEI PEI PEI

    PEI PEI Output Stream Event routing
  14. Paradigm Shift 14 Gather Clean Model Deploy + =

  15. Apache SAMOA Scalable Advanced Massive Online Analysis
 G. De Francisci

    Morales, A. Bifet
 JMLR 2015 15
  16. Taxonomy 16 Data Mining Distributed Batch Hadoop Mahout Stream Storm,

    S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  17. What about Mahout? SAMOA = Mahout for streaming But… More

    than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 17
  18. Architecture 18 SA SAMOA%

  19. Status Status 19 https://samoa.incubator.apache.org

  20. Status Status Parallel algorithms 19 https://samoa.incubator.apache.org

  21. Status Status Parallel algorithms Classification (Vertical Hoeffding Tree) 19 https://samoa.incubator.apache.org

  22. Status Status Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream)

    19 https://samoa.incubator.apache.org
  23. Status Status Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream)

    Regression (Adaptive Model Rules) 19 https://samoa.incubator.apache.org
  24. Status Status Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream)

    Regression (Adaptive Model Rules) Execution engines 19 https://samoa.incubator.apache.org
  25. Is SAMOA useful for you? Only if you need to

    deal with: Large fast data Evolving process (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 20
  26. Advantages (operational) Avoid deploy cycle No need to choose update

    frequency No system downtime No complex backup/update procedures Program once, run everywhere Reuse existing computational infrastructure 21
  27. Advantages (paradigmatic) Model freshness Immediate data value No stream/batch impedance

    mismatch 22
  28. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 23
  29. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 24
  30. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 24
  31. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 24
  32. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 25
  33. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 25
  34. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 25
  35. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 26
  36. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 26
  37. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 26
  38. VHT Vertical Hoeffding Tree
 A. Murdopo, A. Bifet, G. De

    Francisci Morales, N. Kourtellis
 (under submission) 27
  39. Decision Tree Nodes are tests on attributes Branches are possible

    outcomes Leafs are class assignments
 
 28 Class Instance Attributes Road Tested? Mileage? Age? No Yes High ✅ ❌ Low Old Recent ✅ ❌ Car deal?
  40. Hoeffding Tree Sample of stream enough for near optimal decision

    Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute,
 x2 the second most informative one Hoeffding bound: split if 29 G ( x1, x2) > ✏ = r R 2 ln(1 / ) 2 n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
  41. Parallel Decision Trees 30

  42. Parallel Decision Trees Which kind of parallelism? 30

  43. Parallel Decision Trees Which kind of parallelism? Task 30

  44. Parallel Decision Trees Which kind of parallelism? Task Data 30

    Data Attributes Instances
  45. Parallel Decision Trees Which kind of parallelism? Task Data Horizontal

    30 Data Attributes Instances
  46. Parallel Decision Trees Which kind of parallelism? Task Data Horizontal

    Vertical 30 Data Attributes Instances
  47. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  48. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  49. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  50. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  51. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  52. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  53. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31
  54. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates Single attribute tracked in multiple node 31
  55. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits 31
  56. Hoeffding Tree Profiling 32 Other 6% Split 24% Learn 70%

    CPU time for training
 100 nominal and 100 numeric attributes
  57. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  58. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  59. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  60. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  61. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  62. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  63. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  64. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  65. Vertical Parallelism 33 Stats Stats Stats Stream Model Attributes Splits

  66. Vertical Parallelism 33 Single attribute tracked in single node Stats

    Stats Stats Stream Model Attributes Splits
  67. Advantages of Vertical High number of attributes => high level

    of parallelism
 (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 34
  68. Vertical Hoeffding Tree 35 Control Split Result Source (n) Model

    (n) Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping
  69. Accuracy 36 No. Leaf Nodes VHT2 – tree-100 30 Very

    close and very high accuracy
  70. Performance 37 35 0 50 100 150 200 250 MHT

    VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  71. PKG Partial Key Grouping
 M. A. Uddin Nasir, G. De

    Francisci Morales, D. Garcia-Soriano, N. Kourtellis, 
 M. Serafini, “The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines”, ICDE 2015 38
  72. 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 100 101

    102 103 104 105 106 107 108 CCDF key frequency words in tweets wikipedia links Systems Challenges Skewed key distribution 39
  73. Key Grouping and Skew 40 Source Source Worker Worker Worker

    Stream
  74. Problem Statement Input stream of messages Load of worker Imbalance

    of the system Goal: partitioning function that minimizes imbalance 41 m = ht, k, vi Li(t) = |{h⌧, k, vi : P⌧ (k) = i ^ ⌧  t}| Pt : K ! N i 2 W I ( t ) = max i ( Li( t )) avg i ( Li( t )) , for i 2 W
  75. Shuffle Grouping 42 Source Source Worker Worker Stream Aggr. Worker

  76. Existing Stream Partitioning Key Grouping Memory and communication efficient :)

    Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :( 43
  77. Solution 1: Rebalancing At regular intervals move keys around workers

    Issues How often? Which keys to move? Key migration not supported with Storm/Samza API Many large routing tables (consistency and state) Hard to implement 44
  78. Solution 2: PoTC Balls and bins problem For each ball,

    pick two bins uniformly at random Put the ball in the least loaded one Issues Consensus and state to remember choice Load information in distributed system 45
  79. Solution 3: PKG Fully distributed adaptation of PoTC, handles skew

    Consensus and state to remember choice Key splitting:
 assign each key independently with PoTC Load information in distributed system Local load estimation:
 estimate worker load locally at each source 46
  80. Power of Both Choices 47 Source Source Worker Worker Stream

    Aggr. Worker
  81. Throw m balls with k colors in n bins with

    d choices Ball = msg, bin = worker, color = key, choice = hash Necessary condition: Imbalance is Chromatic Balls and Bins 48 p1  d n I(m) = ( O m n · ln n ln ln n , if d = 1 O m n , if d 2
  82. Comparison 49 Stream Grouping Pros Cons Key Grouping Memory efficient

    Load imbalance Shuffle Grouping Load balance Memory overhead Aggregation O(W) Partial Key Grouping Memory efficient Load balance Aggregation O(1)
  83. Experimental Design What is the effect of key splitting? How

    does local estimation compare to a global oracle? How does PKG perform in a real system? Measures: imbalance, throughput, latency, memory Datasets: Twitter, Wikipedia, graphs, synthetic 50
  84. Effect of Key Splitting 51 Average Imbalance 0% 1% 2%

    3% 4% Number of workers 5 10 50 100 PKG Off-Greedy PoTC KG
  85. Local vs Global 52 10-10 10-9 10-8 10-7 10-6 10-5

    10-4 10-3 10-2 10-1 5 10 50 100 Fraction of Imbalance workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 LN1 Fig. 2: Fraction of average imbalance with respect to total number of messages workers and number of sources. 5 10 50 100 workers W L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 LN1 L2 ction of average imbalance with respect to total number of messages fo d number of sources. 100 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 rage imbalance with respect to total number of messages for each datas sources. 5 10 50 100 workers P L10 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 5 LN2 ance with respect to total number of messages for each dataset, for diffe 0 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 5 10 50 100 workers LN2 H ect to total number of messages for each dataset, for different number 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 5 10 50 100 workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 g. 2: Fraction of average imbalance with respect to total number of mes rkers and number of sources. 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 5 10 50 100 Fraction of Imbalance workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 Fig. 2: Fraction of average imbalance with respect to total number of m workers and number of sources.
  86. Throughput vs Memory 53 0 200 400 600 800 1000

    1200 1400 1600 0 0.2 0.4 0.6 0.8 1 Throughput (keys/s) (a) CPU delay (ms) PKG SG KG 1000 1100 1200 0 1.105 2.105 3.105 4.105 (b) Memory (counters) 10s 30s 60s 300s 300s 600s 600s PKG SG KG Fig. 5: (a) Throughput for PKG, SG and KG for different CPU delays. (b) Throughput for PKG and SG vs. average memory for different aggregation periods.
  87. Latency 54 In the second experiment, we fix the CPU

    delay to 0.4ms per key, as it seems to be the limit of saturation for kg Table 4: Complete latency per message (ms) for di↵erent techniques, CPU delays and aggregation periods. CPU delay D (ms) Aggregation period T (s) D=0.1 D=0.5 D=1 T=10 T=30 T=60 pkg 3.81 6.24 11.01 6.93 6.79 6.47 sg 3.66 6.11 10.82 7.01 6.75 6.58 kg 3.65 9.82 19.35
  88. Impact Open source https://github.com/gdfm/partial-key-grouping Integrated in Apache Storm 0.10 (STORM-632,

    STORM-637) Plan to integrate it in Samza 55
  89. Conclusions Mining big data streams is an open field Needs

    collaboration between 
 algorithms and systems communities SAMOA: a platform for mining big data streams And for collaboration on distributed stream mining Algorithm-system co-design Promising future direction 56 System Algorithm API
  90. Future Work Algorithms Lift assumptions of ideal systems Systems New

    primitives targeted to mining algorithms Impact Open-source involvement with ASF 57
  91. Thanks! 58 https://samoa.incubator.apache.org @ApacheSAMOA @gdfm7 gdfm@acm.org