Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Adaptive Model Rules for Mining Big Data Streams

Distributed Adaptive Model Rules for Mining Big Data Streams

Decision rules are among the most expressive data mining models. We propose the first distributed streaming algorithm to learn decision rules for regression tasks. The algorithm is available in SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS), an open-source platform for mining big data streams. It uses a hybrid of vertical and horizontal parallelism to distribute Adaptive Model Rules (AMRules) on a cluster. The decision rules built by AMRules are comprehensible models, where the antecedent of a rule is a conjunction of conditions on the attribute values, and the consequent is a linear combination of the attributes. Our evaluation shows that this implementation is scalable in relation to CPU and memory consumption. On a small commodity Samza cluster of 9 nodes, it can handle a rate of more than 30000 instances per second, and achieve a speedup of up to 4.7x over the sequential version.

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Distributed Adaptive Model Rules
    for Mining Big Data Streams

    Anh Thu Vu, Gianmarco De Francisci Morales,
    Joao Gama, Albert Bifet

    View Slide

  2. Motivation
    Regression: fundamental machine learning task
    Predict how much rain tomorrow
    Applications
    Trend prediction
    Click-through rate prediction
    2

    View Slide

  3. Regression
    Input: training examples
    with numeric label
    Output: model that
    predicts value of
    unlabeled instance x
    ŷ=ƒ(x)
    Minimize error

    MSE = ∑(y-ŷ)2
    3

    View Slide

  4. Setting
    Big Data Streams
    High velocity, large volume
    Large model
    Concept drift
    Scalable solution

    View Slide

  5. SAMOA
    5
    SAMOA
    Data
    Mining
    Distributed
    Batch
    Hadoop
    Mahout
    Stream
    Storm, S4,
    Samza
    SAMOA
    Non
    Distributed
    Batch
    R,
    WEKA,…
    Stream
    MOA
    G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
    http://samoa-project.net

    View Slide

  6. Rules



    Rules
    Rules: self-contained,
    modular, easy to interpret,

    no need to cover universe
    keeps sufficient statistics to:
    make predictions
    expand the rule
    detect changes and
    anomalies
    6

    View Slide

  7. AMRules
    Rule sets
    Predicting with a rule set











    E.g:
    x
    = [4, 1, 1, 2]
    ˆ
    f(
    x
    ) =
    X
    Rl 2S(
    x
    i )
    ✓l ˆ
    yl,
    Adaptive Model Rules
    Ruleset: ensemble of rules
    Rule prediction: mean, linear model
    Ruleset prediction
    Weighted avg. of predictions of
    rules covering instance x
    Weights inversely
    proportional to error
    Default rule covers uncovered
    instances
    7

    View Slide

  8. Ensembles of Adaptive Model Rules from High-Speed Data Streams
    AMRules
    Rule sets
    Algorithm 1:
    Training AMRules
    Input
    : S: Stream of examples
    begin
    R {}, D 0
    foreach
    (
    x
    , y) 2 S
    do
    foreach
    Rule r 2 S(
    x
    )
    do
    if
    ¬IsAnomaly(
    x
    , r)
    then
    if
    PHTest(errorr
    , )
    then
    Remove the rule from R
    else
    Update sufficient statistics Lr
    ExpandRule(r)
    if
    S(
    x
    ) = ;
    then
    Update LD
    ExpandRule(D)
    if
    D expanded
    then
    R R [ D
    D 0
    return
    (R, LD
    )
    Rule Induction
    • Rule creation: default rule
    expansion
    • Rule expansion: split on attribute
    maximizing σ reduction
    • Hoeffding bound ε
    • Expand when σ1st
    /σ2nd
    < 1 - ε
    • Evict rule when drift is detected 

    (Page-Hinckley test error large)
    • Detect and explain local anomalies
    =
    r
    R2 ln(1/ )
    2n
    8

    View Slide

  9. DSPEs
    Live Streams
    Stream 1
    Stream 2
    Stream 3
    PE
    PE
    PE
    PE
    PE
    External
    Persister
    Output 1
    Output 2
    Event
    routing
    9

    View Slide

  10. Example
    status.text:"Introducing #S4: a distributed #stream processing system"
    PE1
    PE2 PE3
    PE4
    RawStatus
    null
    text="Int..."
    EV
    KEY
    VAL
    Topic
    topic="S4"
    count=1
    EV
    KEY
    VAL
    Topic
    topic="stream"
    count=1
    EV
    KEY
    VAL
    Topic
    reportKey="1"
    topic="S4", count=4
    EV
    KEY
    VAL
    TopicExtractorPE (PE1)
    extracts hashtags from status.text
    TopicCountAndReportPE (PE2-3)
    keeps counts for each topic across
    all tweets. Regularly emits report
    event if topic count is above
    a configured threshold.
    TopicNTopicPE (PE4)
    keeps counts for top topics and outputs
    top-N topics to external persister
    10

    View Slide

  11. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    11

    View Slide

  12. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    12

    View Slide

  13. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    12

    View Slide

  14. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    12

    View Slide

  15. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    13

    View Slide

  16. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    13

    View Slide

  17. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    13

    View Slide

  18. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    14

    View Slide

  19. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    14

    View Slide

  20. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    14

    View Slide

  21. Model
    Aggregator
    Learner
    1
    Learner
    2
    Learner
    p
    Predictions
    Instances
    New Rules
    Rule
    Updates
    VAMR
    Vertical AMRules
    Model: rule body + head
    Target mean updated
    continuously

    with covered instances
    for predictions
    Default rule 

    (creates new rules)
    15

    View Slide

  22. VAMR
    Learner: statistics
    Vertical: Learner tracks
    statistics of independent
    subset of rules
    One rule tracked by only
    one Learner
    Model -> Learner: key
    grouping on rule ID
    Model
    Aggregator
    Learner
    1
    Learner
    2
    Learner
    p
    Predictions
    Instances
    New Rules
    Rule
    Updates
    16

    View Slide

  23. HAMR
    VAMR single model is
    bottleneck
    Hybrid AMRules

    (Vertical + Horizontal)
    Shuffle among multiple

    Models for parallelism
    Learners
    Model
    Aggregator
    1
    Model
    Aggregator
    2
    Model
    Aggregator
    r
    Predictions
    Instances
    New Rules
    Rule
    Updates
    Learners
    Learners
    17

    View Slide

  24. HAMR
    Problem: distributed
    default rule decreases
    performance
    Separate dedicate
    Learner for default rule
    Predictions
    Instances
    New Rules
    Rule
    Updates
    Learners
    Learners
    Learners
    Model
    Aggregator
    2
    Model
    Aggregator
    2
    Model
    Aggregators
    Default Rule
    Learner
    New Rules
    18

    View Slide

  25. Task Overview
    Instances, Rules,
    Predictions
    Double line = broadcast
    Source -> Model =
    shuffle grouping
    Model -> Learner = 

    key grouping
    Source
    Default
    Rule
    Learner
    Learner
    Model
    Aggregator Evaluator
    19

    View Slide

  26. Experiments
    10-nodes Samza cluster
    + Kafka
    2VCPUs, 4GB RAM
    Throughput, Accuracy,
    Memory usage
    Compare with sequential
    algorithm in MOA
    #
    instances
    #
    attributes
    Airlines 5.8M 10
    Electricity 2M 12
    Waveform 1M 40
    20

    View Slide

  27. Throughput (Airlines)
    1 2 4 8
    Parallelism Level
    Fig. 5: Throughput of distributed AMRules with electricity.
    0
    5
    10
    15
    20
    25
    30
    35
    1 2 4 8
    Throughput (thousands instances/second)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    Fig. 6: Throughput of distributed AMRules with airlines.
    Fig.
    Al
    compu
    chang
    bottle
    instan
    coveri
    learne
    of the
    instan
    parall
    Th
    scalab
    the th
    model
    when
    throug
    is in t
    21

    View Slide

  28. Throughput (Electricity)
    0
    5
    10
    15
    20
    25
    30
    35
    1 2 4 8
    Throughput (thousands instances/second)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    Fig. 5: Throughput of distributed AMRules with electricity.
    0
    5
    10
    15
    20
    25
    30
    35
    Throughput (thousands instances/second)
    Fig. 7
    22

    View Slide

  29. Throughput (Waveform)
    ctricity.
    0
    5
    10
    15
    20
    25
    30
    35
    1 2 4 8
    Throughput (thousands instances/second)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    Fig. 7: Throughput of distributed AMRules with waveform. 23

    View Slide

  30. Throughput / Message Size
    (a) MAE
    Fig. 9: MAE and RMSE of distributed AMR
    0
    10
    20
    30
    40
    50
    500
    Airlines
    Electricity
    1000
    Waveform
    2000
    Throughput (thousands instances/second)
    Result message size (B)
    Reference
    Max throughput
    Fig. 8: Maximum throughput of HAMR vs message size.
    TABL
    datase
    TABL
    datase
    24

    View Slide

  31. Accuracy (Airlines)
    8
    0
    0.005
    0.01
    0.015
    0.02
    1 2 4 8
    RMSE/(Max-Min)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    (b) RMSE
    25

    View Slide

  32. 8
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    1 2 4 8
    RMSE/(Max-Min)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    (b) RMSE
    Accuracy (Electricity)
    26

    View Slide

  33. Accuracy (Waveform)
    8
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    0.4
    1 2 4 8
    RMSE/(Max-Min)
    Parallelism Level
    MAMR
    VAMR
    HAMR-1
    HAMR-2
    (b) RMSE
    27

    View Slide

  34. Memory Usage
    e.
    the
    ted
    the
    TABLE III: Memory consumption of VAMR for different
    datasets and parallelism levels.
    Dataset Parallelism Memory Consumption (MB)
    Model Aggregator Learner
    Avg. Std. Dev. Avg. Std. Dev.
    Electricity
    1 266.5 5.6 40.1 4.3
    2 264.9 2.8 23.8 3.9
    4 267.4 6.6 20.1 3.2
    8 273.5 3.9 34.7 29
    Airlines
    1 337.6 2.8 83.6 4.1
    2 338.1 1.0 38.7 1.8
    4 337.3 1.0 38.8 7.1
    8 336.4 0.8 31.7 0.2
    Waveform
    1 286.3 5.0 171.7 2.5
    2 286.8 4.3 119.5 10.4
    4 289.1 5.9 46.5 12.1
    8 287.3 3.1 33.8 5.7
    28

    View Slide

  35. Memory Usage
    e.
    the
    ted
    the
    TABLE III: Memory consumption of VAMR for different
    datasets and parallelism levels.
    Dataset Parallelism Memory Consumption (MB)
    Model Aggregator Learner
    Avg. Std. Dev. Avg. Std. Dev.
    Electricity
    1 266.5 5.6 40.1 4.3
    2 264.9 2.8 23.8 3.9
    4 267.4 6.6 20.1 3.2
    8 273.5 3.9 34.7 29
    Airlines
    1 337.6 2.8 83.6 4.1
    2 338.1 1.0 38.7 1.8
    4 337.3 1.0 38.8 7.1
    8 336.4 0.8 31.7 0.2
    Waveform
    1 286.3 5.0 171.7 2.5
    2 286.8 4.3 119.5 10.4
    4 289.1 5.9 46.5 12.1
    8 287.3 3.1 33.8 5.7
    MRules with electricity dataset.
    ABLE II: Memory consumption of MAMR for different
    asets.
    Dataset Memory consumption (MB)
    Avg. Std. Dev.
    Electricity 52.4 2.1
    Airlines 120.7 51.1
    Waveform 223.5 8
    ABLE III: Memory consumption of VAMR for different
    asets and parallelism levels.
    Dataset Parallelism Memory Consumption (MB)
    Model Aggregator Learner
    Avg. Std. Dev. Avg. Std. Dev.
    Electricity
    1 266.5 5.6 40.1 4.3
    2 264.9 2.8 23.8 3.9
    4 267.4 6.6 20.1 3.2
    8 273.5 3.9 34.7 29
    Airlines
    1 337.6 2.8 83.6 4.1
    2 338.1 1.0 38.7 1.8
    4 337.3 1.0 38.8 7.1
    8 336.4 0.8 31.7 0.2
    Waveform
    28

    View Slide

  36. Memory Usage (Learner)
    SAMOA Distributed Streaming Regression Rules Evaluation Conclusions
    Memory Usage
    Memory Usage of Learner
    0
    50
    100
    150
    200
    Airlines Electricity Waveform
    Average Memory Usage (MB)
    P=1
    P=2
    P=4
    P=8
    36 / 38
    29

    View Slide

  37. Conclusions
    Distributed streaming algorithm for regression
    Runs on top of distributed stream processing engines
    Up to ~5x increase in throughput
    Accuracy comparable with sequential algorithm
    Scalable memory usage
    30

    View Slide