Slide 1

Slide 1 text

Big Data Streams The Next Frontier 
 Gianmarco De Francisci Morales
 Aalto University, Helsinki
 [email protected]
 @gdfm7

Slide 2

Slide 2 text

2 The Frontier

Slide 3

Slide 3 text

Vision Algorithms & Systems Distributed stream mining platform Development and collaboration framework
 for researchers Library of state-of-the-art algorithms
 for practitioners 3

Slide 4

Slide 4 text

Full Stack SAMOA
 (Scalable Advanced Massive Online Analysis) VHT + EVL
 (Vertical Hoeffding Tree)
 (Online Evaluation) PKG
 (Partial Key Grouping) 4 System Algorithm API

Slide 5

Slide 5 text

“Panta rhei”
 (everything flows) -Heraclitus 5

Slide 6

Slide 6 text

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Example: spam detection in comments on Yahoo News Trends change in time Need to retrain model with new data 6

Slide 7

Slide 7 text

Stream Batch data is a snapshot of streaming data 7

Slide 8

Slide 8 text

Present of big data Too big to handle 8

Slide 9

Slide 9 text

Future of big data Drinking from a firehose 9

Slide 10

Slide 10 text

Evolution of SPEs 10 —2003 —2004 —2005 —2006 —2008 —2010 —2011 —2013 —2014 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.apache.org Flink http://flink.apache.org

Slide 11

Slide 11 text

Actor Model 11 PE PE Input Stream PEI PEI PEI PEI PEI Output Stream Event routing

Slide 12

Slide 12 text

Paradigm Shift 12 Gather Clean Model Deploy + =

Slide 13

Slide 13 text

System Algorithm API Apache SAMOA Scalable Advanced Massive Online Analysis
 G. De Francisci Morales, A. Bifet
 JMLR 2015 13

Slide 14

Slide 14 text

Taxonomy 14 Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA

Slide 15

Slide 15 text

Architecture 15 SA SAMOA%

Slide 16

Slide 16 text

Status Status 16 https://samoa.incubator.apache.org

Slide 17

Slide 17 text

Parallel algorithms Status Status 16 https://samoa.incubator.apache.org

Slide 18

Slide 18 text

Parallel algorithms Classification (Vertical Hoeffding Tree) Status Status 16 https://samoa.incubator.apache.org

Slide 19

Slide 19 text

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Status Status 16 https://samoa.incubator.apache.org

Slide 20

Slide 20 text

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Status Status 16 https://samoa.incubator.apache.org

Slide 21

Slide 21 text

Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines
 Status Status 16 https://samoa.incubator.apache.org

Slide 22

Slide 22 text

Is SAMOA useful for you? Only if you need to deal with: Large fast data Evolving process (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 17

Slide 23

Slide 23 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 18

Slide 24

Slide 24 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 19

Slide 25

Slide 25 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 19

Slide 26

Slide 26 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 19

Slide 27

Slide 27 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 20

Slide 28

Slide 28 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 20

Slide 29

Slide 29 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 20

Slide 30

Slide 30 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 21

Slide 31

Slide 31 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 21

Slide 32

Slide 32 text

PE PE PEI PEI PEI PEI Groupings Key Grouping 
 (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 21

Slide 33

Slide 33 text

VHT Vertical Hoeffding Tree
 A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis
 (under submission) 22 System Algorithm API

Slide 34

Slide 34 text

Decision Tree Nodes are tests on attributes Branches are possible outcomes Leafs are class assignments
 
 23 Class Instance Attributes Road Tested? Mileage? Age? No Yes High ✅ ❌ Low Old Recent ✅ ❌ Car deal?

Slide 35

Slide 35 text

Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute,
 x2 the second most informative one Hoeffding bound: split if 24 G ( x1, x2) > ✏ = r R 2 ln(1 / ) 2 n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

Slide 36

Slide 36 text

Parallel Decision Trees 25

Slide 37

Slide 37 text

Parallel Decision Trees Which kind of parallelism? 25

Slide 38

Slide 38 text

Parallel Decision Trees Which kind of parallelism? Task 25

Slide 39

Slide 39 text

Parallel Decision Trees Which kind of parallelism? Task Data 25 Data Attributes Instances

Slide 40

Slide 40 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal 25 Data Attributes Instances

Slide 41

Slide 41 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal Vertical 25 Data Attributes Instances

Slide 42

Slide 42 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 43

Slide 43 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 44

Slide 44 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 45

Slide 45 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 46

Slide 46 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 47

Slide 47 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 48

Slide 48 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 49

Slide 49 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates Single attribute tracked in multiple node 26

Slide 50

Slide 50 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits 26

Slide 51

Slide 51 text

Hoeffding Tree Profiling 27 Other 6% Split 24% Learn 70% CPU time for training
 100 nominal and 100 numeric attributes

Slide 52

Slide 52 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 53

Slide 53 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 54

Slide 54 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 55

Slide 55 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 56

Slide 56 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 57

Slide 57 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 58

Slide 58 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 59

Slide 59 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 60

Slide 60 text

Vertical Parallelism 28 Stats Stats Stats Stream Model Attributes Splits

Slide 61

Slide 61 text

Vertical Parallelism 28 Single attribute tracked in single node Stats Stats Stats Stream Model Attributes Splits

Slide 62

Slide 62 text

Advantages of Vertical High number of attributes => high level of parallelism
 (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 29

Slide 63

Slide 63 text

Accuracy 30 No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy

Slide 64

Slide 64 text

Performance 31 35 0 50 100 150 200 250 MHT VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec

Slide 65

Slide 65 text

EVL Efficient Online Evaluation 
 of Big Data Stream Classifiers A. Bifet, G. De Francisci Morales, J. Read, G. Holmes, B. Pfahringer
 KDD 2015 32 System Algorithm API

Slide 66

Slide 66 text

Why? When is a classifier better than another? Statistically Online
 
 
 
 
 33 Classifier 1 Classifier 2 Stream - - - Evaluation

Slide 67

Slide 67 text

Issues I1: Validation Methodology I2: Statistical Test I3: Performance Measure I4: Forgetting Mechanism 34

Slide 68

Slide 68 text

Evaluation Pipeline Source Stream Validation Methodology Classifier (fold) Performance Measure Statistical Test } k classifiers in parallel I1 I2 I3 + I4

Slide 69

Slide 69 text

Evaluation Pipeline Source Stream Validation Methodology Classifier (fold) Performance Measure Statistical Test } k classifiers in parallel I1 I2 I3 + I4

Slide 70

Slide 70 text

I1: Validation Methodology Distributed Theory suggests k-fold Cross-validation Split-validation Bootstrap validation 36 Validation Methodology Classifier (fold) } k classifiers in parallel

Slide 71

Slide 71 text

I1: Validation Methodology Distributed Theory suggests k-fold Cross-validation Split-validation Bootstrap validation 37 Cross-Validation Classifier (fold) Classifier (fold) Classifier (fold) Classifier (fold) Train Test

Slide 72

Slide 72 text

I1: Validation Methodology Distributed Theory suggests k-fold Cross-validation Split-validation Bootstrap validation 38 Split-Validation Classifier (fold) Classifier (fold) Classifier (fold) Classifier (fold) Train Test

Slide 73

Slide 73 text

I1: Validation Methodology Distributed Theory suggests k-fold Cross-validation Split-validation Bootstrap validation 39 Bootstrap Validation Classifier (fold) Classifier (fold) Classifier (fold) Classifier (fold) Train Test 1 2 Training weights ~ Poisson(1) 0 0

Slide 74

Slide 74 text

I2: Statistical Testing Often misunderstood, hard to do correctly Non-parametric tests McNemar’s test Wilcoxon’s signed-rank test Sign test (omitted for brevity) 40

Slide 75

Slide 75 text

McNemar’s Test Example as trial a = number of examples where C1 is right & 
 C2 is wrong H0 => M ~ 2
 
 
 41 C1 \ C2 Correct Incorrect Correct - a Incorrect b - M = sign(a b) ⇥ (a b)2 (a + b)

Slide 76

Slide 76 text

Wilcoxon’s Test Fold as trial Rank absolute value of performance difference (ascending) Sum ranks of differences with same sign Compare minimum sum to critical value (z-score) H0 => W ~ Normal 42 Table 1: Comparison of two classifiers with Sign test Wilcoxon’s signed-rank test. Class. A Class. B Di↵ Rank 77.98 77.91 0.07 4 72.26 72.27 -0.01 1 76.95 76.97 -0.02 2 77.94 76.57 1.37 7 72.23 71.63 0.60 5 76.90 75.48 1.42 8 77.93 75.75 2.18 9 72.37 71.33 1.04 6 76.93 74.54 2.39 10 77.97 77.94 0.03 3 4. STATISTICAL TESTS FOR COMPAR CLASSIFIERS The three most used statistical tests for comparing

Slide 77

Slide 77 text

Experiment: Type I & II Error Test for false positives Randomized classifiers with different seeds (RF) Test for false negatives Add random noise filter 43 p = p0 ⇥ (1 p noise ) + (1 p0) ⇥ p noise c

Slide 78

Slide 78 text

FP: Type I Error 2: Average fraction of null hypothesis rejection for di↵erent combina datasets. The first column group concerns Type I errors, and the o No change C bootstrap cv split boo McNemar non-prequential 0.71 0.52 0.66 McNemar prequential 0.77 0.80 0.42 Sign test non-prequentiall 0.11 0.11 0.10 Sign test prequential 0.12 0.12 0.09 Wilcoxon non-prequential 0.11 0.11 0.14 Wilcoxon prequential 0.11 0.10 0.19 Avg. time non-prequential (s) 883 1105 415 Avg. time prequential (s) 813 1202 109 tors with concept drift most commonly found in the T

Slide 79

Slide 79 text

FN: Type II Error 2: Average fraction of null hypothesis rejection for di↵erent combina datasets. The first column group concerns Type I errors, and the o No change C bootstrap cv split boo McNemar non-prequential 0.71 0.52 0.66 McNemar prequential 0.77 0.80 0.42 Sign test non-prequentiall 0.11 0.11 0.10 Sign test prequential 0.12 0.12 0.09 Wilcoxon non-prequential 0.11 0.11 0.14 Wilcoxon prequential 0.11 0.10 0.19 Avg. time non-prequential (s) 883 1105 415 Avg. time prequential (s) 813 1202 109 tors with concept drift most commonly found in the T hypothesis rejection for di↵erent combinations of validation procedu n group concerns Type I errors, and the other two column groups c No change Change p noise = 0.05 C bootstrap cv split bootstrap cv split boo ntial 0.71 0.52 0.66 0.86 0.80 0.73 0.77 0.80 0.42 0.88 0.94 0.56 tiall 0.11 0.11 0.10 0.77 0.82 0.44 0.12 0.12 0.09 0.77 0.83 0.44 ntial 0.11 0.11 0.14 0.79 0.84 0.51 0.11 0.10 0.19 0.80 0.84 0.54 ential (s) 883 1105 415 877 1121 422 l (s) 813 1202 109 820 1214 111 most commonly found in the There is little di↵erence

Slide 80

Slide 80 text

Lessons Learned Statistical units: prefer folds to examples Wilcoxon’s ≻ McNemar’s Use data wisely Cross-validation ≻ Split-vaildation Bootstrap good tradeoff Caveat: using dependent folds as independent trials risky 46

Slide 81

Slide 81 text

PKG Partial Key Grouping
 M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini, “The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines”, ICDE 2015 47 System Algorithm API

Slide 82

Slide 82 text

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 100 101 102 103 104 105 106 107 108 CCDF key frequency words in tweets wikipedia links Systems Challenges Skewed key distribution 48

Slide 83

Slide 83 text

Key Grouping and Skew 49 Source Source Worker Worker Worker Stream

Slide 84

Slide 84 text

Problem Statement Input stream of messages Load of worker Imbalance of the system Goal: partitioning function that minimizes imbalance 50 m = ht, k, vi Li(t) = |{h⌧, k, vi : P⌧ (k) = i ^ ⌧  t}| Pt : K ! N i 2 W I ( t ) = max i ( Li( t )) avg i ( Li( t )) , for i 2 W

Slide 85

Slide 85 text

Shuffle Grouping 51 Source Source Worker Worker Stream Aggr. Worker

Slide 86

Slide 86 text

Solution 1: Rebalancing At regular intervals move keys around workers Issues How often? Which keys to move? Key migration not supported with Storm/Samza API Many large routing tables (consistency and state) Hard to implement 52

Slide 87

Slide 87 text

Solution 2: PoTC Balls and bins problem For each ball, pick two bins uniformly at random Put the ball in the least loaded one Issues Consensus and state to remember choice Load information in distributed system 53

Slide 88

Slide 88 text

Solution 3: PKG Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting:
 assign each key independently with PoTC Load information in distributed system Local load estimation:
 estimate worker load locally at each source 54

Slide 89

Slide 89 text

Power of Both Choices 55 Source Source Worker Worker Stream Aggr. Worker

Slide 90

Slide 90 text

Comparison 56 Stream Grouping Pros Cons Key Grouping Memory efficient Load imbalance Shuffle Grouping Load balance Memory overhead Aggregation O(W) Partial Key Grouping Memory efficient Load balance Aggregation O(1)

Slide 91

Slide 91 text

Experimental Design What is the effect of key splitting? How does local estimation compare to a global oracle? How does PKG perform in a real system? Measures: imbalance, throughput, memory, latency Datasets: Twitter, Wikipedia, graphs, synthetic 57

Slide 92

Slide 92 text

Effect of Key Splitting 58 Average Imbalance 0% 1% 2% 3% 4% Number of workers 5 10 50 100 PKG Off-Greedy PoTC KG

Slide 93

Slide 93 text

Local vs Global 59 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 5 10 50 100 Fraction of Imbalance workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 LN1 Fig. 2: Fraction of average imbalance with respect to total number of messages workers and number of sources. 5 10 50 100 workers W L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 LN1 L2 ction of average imbalance with respect to total number of messages fo d number of sources. 100 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 rage imbalance with respect to total number of messages for each datas sources. 5 10 50 100 workers P L10 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 5 LN2 ance with respect to total number of messages for each dataset, for diffe 0 5 10 50 100 workers CT L15 5 10 50 100 workers LN1 L20 5 10 50 100 workers LN2 H ect to total number of messages for each dataset, for different number 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 5 10 50 100 workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 g. 2: Fraction of average imbalance with respect to total number of mes rkers and number of sources. 10-10 10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 5 10 50 100 Fraction of Imbalance workers TW G L5 5 10 50 100 workers WP L10 5 10 50 100 workers CT L15 Fig. 2: Fraction of average imbalance with respect to total number of m workers and number of sources.

Slide 94

Slide 94 text

Throughput vs Memory 60 0 200 400 600 800 1000 1200 1400 1600 0 0.2 0.4 0.6 0.8 1 Throughput (keys/s) (a) CPU delay (ms) PKG SG KG 1000 1100 1200 0 1.105 2.105 3.105 4.105 (b) Memory (counters) 10s 30s 60s 300s 300s 600s 600s PKG SG KG Fig. 5: (a) Throughput for PKG, SG and KG for different CPU delays. (b) Throughput for PKG and SG vs. average memory for different aggregation periods.

Slide 95

Slide 95 text

Impact Open source https://github.com/gdfm/partial-key-grouping Integrated in Apache Storm 0.10 STORM-632, STORM-637 Integrating it in Kafka (Samza) and Flink KAFKA-2091, KAFKA-2092, FLINK-1725 61

Slide 96

Slide 96 text

Conclusions Mining big data streams is an open frontier Needs collaboration between 
 algorithms and systems communities SAMOA: a platform for mining big data streams And for collaboration on distributed stream mining 62 System Algorithm API

Slide 97

Slide 97 text

Vision 63 Distributed Streaming

Slide 98

Slide 98 text

Vision 63 Distributed Streaming Big Data Stream Mining

Slide 99

Slide 99 text

Vision 63 DistributedStreaming Big Data Stream Mining

Slide 100

Slide 100 text

Vision 63 DistributedStreaming Big Data Stream Mining

Slide 101

Slide 101 text

Thanks! 64 https://samoa.incubator.apache.org @ApacheSAMOA @gdfm7 [email protected]