Distributed Adaptive Model Rules for Mining Big Data Streams

Distributed Adaptive Model Rules for Mining Big Data Streams  Anh
Thu Vu, Gianmarco De Francisci Morales, Joao Gama, Albert Bifet

Motivation Regression: fundamental machine learning task Predict how much rain
tomorrow Applications Trend prediction Click-through rate prediction 2

Regression Input: training examples with numeric label Output: model that
predicts value of unlabeled instance x ŷ=ƒ(x) Minimize error  MSE = ∑(y-ŷ)2 3

Setting Big Data Streams High velocity, large volume Large model
Concept drift Scalable solution

SAMOA 5 SAMOA Data Mining Distributed Batch Hadoop Mahout Stream
Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA,… Stream MOA G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014) http://samoa-project.net

Rules Rules Rules: self-contained, modular, easy
to interpret,  no need to cover universe keeps sufﬁcient statistics to: make predictions expand the rule detect changes and anomalies 6

AMRules Rule sets Predicting with a rule set
E.g: x = [4, 1, 1, 2] ˆ f( x ) = X Rl 2S( x i ) ✓l ˆ yl, Adaptive Model Rules Ruleset: ensemble of rules Rule prediction: mean, linear model Ruleset prediction Weighted avg. of predictions of rules covering instance x Weights inversely proportional to error Default rule covers uncovered instances 7

Ensembles of Adaptive Model Rules from High-Speed Data Streams AMRules
Rule sets Algorithm 1: Training AMRules Input : S: Stream of examples begin R {}, D 0 foreach ( x , y) 2 S do foreach Rule r 2 S( x ) do if ¬IsAnomaly( x , r) then if PHTest(errorr , ) then Remove the rule from R else Update sufﬁcient statistics Lr ExpandRule(r) if S( x ) = ; then Update LD ExpandRule(D) if D expanded then R R [ D D 0 return (R, LD ) Rule Induction • Rule creation: default rule expansion • Rule expansion: split on attribute maximizing σ reduction • Hoeffding bound ε • Expand when σ1st /σ2nd < 1 - ε • Evict rule when drift is detected   (Page-Hinckley test error large) • Detect and explain local anomalies = r R2 ln(1/ ) 2n 8

DSPEs Live Streams Stream 1 Stream 2 Stream 3 PE
PE PE PE PE External Persister Output 1 Output 2 Event routing 9

Example status.text:"Introducing #S4: a distributed #stream processing system" PE1 PE2
PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 10

PE PE PEI PEI PEI PEI Groupings Key Grouping  
(hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 11

(hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 13

Model Aggregator Learner 1 Learner 2 Learner p Predictions Instances
New Rules Rule Updates VAMR Vertical AMRules Model: rule body + head Target mean updated continuously  with covered instances for predictions Default rule   (creates new rules) 15

VAMR Learner: statistics Vertical: Learner tracks statistics of independent subset
of rules One rule tracked by only one Learner Model -> Learner: key grouping on rule ID Model Aggregator Learner 1 Learner 2 Learner p Predictions Instances New Rules Rule Updates 16

HAMR VAMR single model is bottleneck Hybrid AMRules  (Vertical +
Horizontal) Shufﬂe among multiple  Models for parallelism Learners Model Aggregator 1 Model Aggregator 2 Model Aggregator r Predictions Instances New Rules Rule Updates Learners Learners 17

HAMR Problem: distributed default rule decreases performance Separate dedicate Learner
for default rule Predictions Instances New Rules Rule Updates Learners Learners Learners Model Aggregator 2 Model Aggregator 2 Model Aggregators Default Rule Learner New Rules 18

Task Overview Instances, Rules, Predictions Double line = broadcast Source
-> Model = shufﬂe grouping Model -> Learner =   key grouping Source Default Rule Learner Learner Model Aggregator Evaluator 19

Experiments 10-nodes Samza cluster + Kafka 2VCPUs, 4GB RAM Throughput,
Accuracy, Memory usage Compare with sequential algorithm in MOA # instances # attributes Airlines 5.8M 10 Electricity 2M 12 Waveform 1M 40 20

Throughput (Airlines) 1 2 4 8 Parallelism Level Fig. 5:
Throughput of distributed AMRules with electricity. 0 5 10 15 20 25 30 35 1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 6: Throughput of distributed AMRules with airlines. Fig. Al compu chang bottle instan coveri learne of the instan parall Th scalab the th model when throug is in t 21

Throughput (Electricity) 0 5 10 15 20 25 30 35
1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 5: Throughput of distributed AMRules with electricity. 0 5 10 15 20 25 30 35 Throughput (thousands instances/second) Fig. 7 22

Throughput (Waveform) ctricity. 0 5 10 15 20 25 30
35 1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 7: Throughput of distributed AMRules with waveform. 23

Throughput / Message Size (a) MAE Fig. 9: MAE and
RMSE of distributed AMR 0 10 20 30 40 50 500 Airlines Electricity 1000 Waveform 2000 Throughput (thousands instances/second) Result message size (B) Reference Max throughput Fig. 8: Maximum throughput of HAMR vs message size. TABL datase TABL datase 24

Accuracy (Airlines) 8 0 0.005 0.01 0.015 0.02 1 2
4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE 25

8 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1
2 4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE Accuracy (Electricity) 26

Accuracy (Waveform) 8 0 0.05 0.1 0.15 0.2 0.25 0.3
0.35 0.4 1 2 4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE 27

Memory Usage e. the ted the TABLE III: Memory consumption
of VAMR for different datasets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 1 286.3 5.0 171.7 2.5 2 286.8 4.3 119.5 10.4 4 289.1 5.9 46.5 12.1 8 287.3 3.1 33.8 5.7 28

Memory Usage e. the ted the TABLE III: Memory consumption
of VAMR for different datasets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 1 286.3 5.0 171.7 2.5 2 286.8 4.3 119.5 10.4 4 289.1 5.9 46.5 12.1 8 287.3 3.1 33.8 5.7 MRules with electricity dataset. ABLE II: Memory consumption of MAMR for different asets. Dataset Memory consumption (MB) Avg. Std. Dev. Electricity 52.4 2.1 Airlines 120.7 51.1 Waveform 223.5 8 ABLE III: Memory consumption of VAMR for different asets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 28

Memory Usage (Learner) SAMOA Distributed Streaming Regression Rules Evaluation Conclusions
Memory Usage Memory Usage of Learner 0 50 100 150 200 Airlines Electricity Waveform Average Memory Usage (MB) P=1 P=2 P=4 P=8 36 / 38 29

Conclusions Distributed streaming algorithm for regression Runs on top of
distributed stream processing engines Up to ~5x increase in throughput Accuracy comparable with sequential algorithm Scalable memory usage 30

Distributed Adaptive Model Rules for Mining Big...

Distributed Adaptive Model Rules for Mining Big Data Streams

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript