$30 off During Our Annual Pro Sale. View Details »

Mining Big Data Streams: Better Algorithms or Faster Systems?

Mining Big Data Streams: Better Algorithms or Faster Systems?

The rate at which the world produces data is growing steadily, thus creating ever larger streams of continuously evolving data. However, current (de-facto standard) solutions for big data analysis are not designed to mine evolving streams. So, should we find better algorithms to mine data streams, or should we focus on building faster systems?

In this talk, we debunk this false dichotomy between algorithms and systems, and we argue that the data mining and distributed systems community need to work together to bring about the next revolution in data analysis. In doing so, we introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza.

As a case study, we present one of SAMOA's main algorithms for classification, the Vertical Hoeffding Tree (VHT). Then, we analyze the algorithm from a distributed systems perspective, highlight the issue of load balancing, and describe a generalizable solution to it. Finally, we conclude by envisioning system-algorithm co-design as a promising direction for the future of big data analytics.

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Mining Big Data Streams
    Better Algorithms or Faster Systems?


    Gianmarco De Francisci Morales

    [email protected]
    @gdfm7

    View Slide

  2. Vision
    Algorithms & Systems
    Distributed stream mining platform
    Development and collaboration framework

    for researchers
    Library of state-of-the-art algorithms

    for practitioners
    2

    View Slide

  3. Agenda
    SAMOA

    (Scalable Advanced
    Massive Online Analysis)
    VHT

    (Vertical Hoeffding Tree)
    PKG

    (Partial Key Grouping)
    3
    System
    Algorithm
    API

    View Slide

  4. Visiting
    Scientist 

    @Aalto DMG
    Scientist @Yahoo Labs
    PPMC @ Apache SAMOA
    Committer @ Apache Pig
    Contributor for Hadoop, 

    Giraph, Storm, S4, Grafos.ml
    4

    View Slide

  5. What do I work on?
    Systems
    Distributed Mining
    News
    Streaming
    Grid Admin
    —2008
    —2009
    —2010
    —2011
    —2012
    —2013
    -—2014
    -—2015
    • IMT Lucca
    • M.Eng
    • Y!R Barcelona
    • PhD
    5
    PhD Student
    Postdoc
    Scientist

    View Slide

  6. “Panta rhei”

    (everything flows)
    -Heraclitus
    6

    View Slide

  7. Importance$of$O
    •  As$spam$trends$change
    retrain$the$model$with
    Importance
    Example: spam
    detection in comments
    on Yahoo News
    Trends change in time
    Need to retrain model
    with new data
    7

    View Slide

  8. Stream
    Batch data is a
    snapshot of
    streaming data
    8

    View Slide

  9. Challenges
    Operational
    Need to rerun the pipeline and redeploy the model
    when new data arrives
    Paradigmatic
    New data lies in storage without generating new
    value until new model is retrained
    9
    Gather
    Clean
    Model
    Deploy

    View Slide

  10. Present of big data
    Too big to handle
    10

    View Slide

  11. Future of big data
    Drinking from a firehose
    11

    View Slide

  12. Evolution of SPEs
    12
    —2003
    —2004
    —2005
    —2006
    —2008
    —2010
    —2011
    —2013
    Aurora
    STREAM
    Borealis
    SPC
    SPADE
    Storm
    S4
    1st generation
    2nd generation
    3rd generation
    Abadi et al., “Aurora: a new model and architecture for
    data stream management,” VLDB Journal, 2003
    Arasu et al., “STREAM: The Stanford Data Stream
    Management System,” Stanford InfoLab, 2004.
    Abadi et al., “The Design of the Borealis Stream
    Processing Engine,” in CIDR ’05
    Amini et al., “SPC: A Distributed, Scalable Platform
    for Data Mining,” in DMSSP ’06
    Gedik et al., “SPADE: The System S Declarative
    Stream Processing Engine,” in SIGMOD ’08
    Neumeyer et al., “S4: Distributed Stream Computing
    Platform,” in ICDMW ’10
    http://storm-project.net
    Samza http://samza.incubator.apache.org

    View Slide

  13. Actor Model
    13
    PE
    PE
    Input
    Stream PEI
    PEI
    PEI
    PEI
    PEI
    Output
    Stream
    Event
    routing

    View Slide

  14. Paradigm Shift
    14
    Gather
    Clean
    Model
    Deploy
    + =

    View Slide

  15. Apache SAMOA
    Scalable Advanced Massive Online Analysis

    G. De Francisci Morales, A. Bifet

    JMLR 2015
    15

    View Slide

  16. Taxonomy
    16
    Data
    Mining
    Distributed
    Batch
    Hadoop
    Mahout
    Stream
    Storm, S4,
    Samza
    SAMOA
    Non
    Distributed
    Batch
    R,
    WEKA,

    Stream
    MOA

    View Slide

  17. What about Mahout?
    SAMOA = Mahout for streaming
    But…
    More than JBoA (just a bunch of algorithms)
    Provides a common platform
    Easy to port to new computing engines
    17

    View Slide

  18. Architecture
    18
    SA
    SAMOA%

    View Slide

  19. Status
    Status
    19
    https://samoa.incubator.apache.org

    View Slide

  20. Status
    Status
    Parallel algorithms
    19
    https://samoa.incubator.apache.org

    View Slide

  21. Status
    Status
    Parallel algorithms
    Classification (Vertical Hoeffding Tree)
    19
    https://samoa.incubator.apache.org

    View Slide

  22. Status
    Status
    Parallel algorithms
    Classification (Vertical Hoeffding Tree)
    Clustering (CluStream)
    19
    https://samoa.incubator.apache.org

    View Slide

  23. Status
    Status
    Parallel algorithms
    Classification (Vertical Hoeffding Tree)
    Clustering (CluStream)
    Regression (Adaptive Model Rules)
    19
    https://samoa.incubator.apache.org

    View Slide

  24. Status
    Status
    Parallel algorithms
    Classification (Vertical Hoeffding Tree)
    Clustering (CluStream)
    Regression (Adaptive Model Rules)
    Execution engines
    19
    https://samoa.incubator.apache.org

    View Slide

  25. Is SAMOA useful for you?
    Only if you need to deal with:
    Large fast data
    Evolving process (model updates)
    What is happening now?
    Use feedback in real-time
    Adapt to changes faster
    20

    View Slide

  26. Advantages (operational)
    Avoid deploy cycle
    No need to choose update frequency
    No system downtime
    No complex backup/update procedures
    Program once, run everywhere
    Reuse existing computational infrastructure
    21

    View Slide

  27. Advantages (paradigmatic)
    Model freshness
    Immediate data value
    No stream/batch impedance mismatch
    22

    View Slide

  28. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    23

    View Slide

  29. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    24

    View Slide

  30. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    24

    View Slide

  31. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    24

    View Slide

  32. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    25

    View Slide

  33. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    25

    View Slide

  34. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    25

    View Slide

  35. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    26

    View Slide

  36. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    26

    View Slide

  37. PE PE
    PEI
    PEI
    PEI
    PEI
    Groupings
    Key Grouping 

    (hashing)
    Shuffle Grouping

    (round-robin)
    All Grouping

    (broadcast)
    26

    View Slide

  38. VHT
    Vertical Hoeffding Tree

    A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis

    (under submission)
    27

    View Slide

  39. Decision Tree
    Nodes are tests on
    attributes
    Branches are possible
    outcomes
    Leafs are class
    assignments


    28
    Class
    Instance
    Attributes
    Road
    Tested?
    Mileage?
    Age?
    No
    Yes
    High


    Low
    Old
    Recent
    ✅ ❌
    Car deal?

    View Slide

  40. Hoeffding Tree
    Sample of stream enough for near optimal decision
    Estimate merit of alternatives from prefix of stream
    Choose sample size based on statistical principles
    When to expand a leaf?
    Let x1 be the most informative attribute,

    x2 the second most informative one
    Hoeffding bound: split if
    29
    G
    (
    x1, x2)
    > ✏
    =
    r
    R
    2 ln(1
    /
    )
    2
    n
    P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

    View Slide

  41. Parallel Decision Trees
    30

    View Slide

  42. Parallel Decision Trees
    Which kind of parallelism?
    30

    View Slide

  43. Parallel Decision Trees
    Which kind of parallelism?
    Task
    30

    View Slide

  44. Parallel Decision Trees
    Which kind of parallelism?
    Task
    Data
    30
    Data
    Attributes
    Instances

    View Slide

  45. Parallel Decision Trees
    Which kind of parallelism?
    Task
    Data
    Horizontal
    30
    Data
    Attributes
    Instances

    View Slide

  46. Parallel Decision Trees
    Which kind of parallelism?
    Task
    Data
    Horizontal
    Vertical
    30
    Data
    Attributes
    Instances

    View Slide

  47. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  48. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  49. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  50. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  51. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  52. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  53. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    31

    View Slide

  54. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    Single attribute
    tracked in
    multiple node
    31

    View Slide

  55. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010
    31
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates
    Aggregation to
    compute splits
    31

    View Slide

  56. Hoeffding Tree Profiling
    32
    Other
    6%
    Split
    24%
    Learn
    70%
    CPU time for training

    100 nominal and 100
    numeric attributes

    View Slide

  57. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  58. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  59. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  60. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  61. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  62. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  63. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  64. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  65. Vertical Parallelism
    33
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  66. Vertical Parallelism
    33
    Single attribute
    tracked in single node
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits

    View Slide

  67. Advantages of Vertical
    High number of attributes => high level of parallelism

    (e.g., documents)
    Vs task parallelism
    Parallelism observed immediately
    Vs horizontal parallelism
    Reduced memory usage (no model replication)
    Parallelized split computation
    34

    View Slide

  68. Vertical Hoeffding Tree
    35
    Control
    Split
    Result
    Source (n) Model (n) Stats (n) Evaluator (1)
    Instance
    Stream
    Shuffle Grouping
    Key Grouping
    All Grouping

    View Slide

  69. Accuracy
    36
    No. Leaf Nodes VHT2 –
    tree-100
    30
    Very close and
    very high accuracy

    View Slide

  70. Performance
    37
    35
    0
    50
    100
    150
    200
    250
    MHT VHT2-par-3
    Execution Time (seconds)
    Classifier
    Profiling Results for text-10000
    with 100000 instances
    t_calc
    t_comm
    t_serial
    Throughput
    VHT2-par-3: 2631 inst/sec
    MHT : 507 inst/sec

    View Slide

  71. PKG
    Partial Key Grouping

    M. A. Uddin Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, 

    M. Serafini, “The Power of Both Choices: Practical Load Balancing for Distributed
    Stream Processing Engines”, ICDE 2015
    38

    View Slide

  72. 10-14
    10-12
    10-10
    10-8
    10-6
    10-4
    10-2
    100
    100 101 102 103 104 105 106 107 108
    CCDF
    key frequency
    words in tweets
    wikipedia links
    Systems Challenges
    Skewed key distribution
    39

    View Slide

  73. Key Grouping and Skew
    40
    Source
    Source
    Worker
    Worker
    Worker
    Stream

    View Slide

  74. Problem Statement
    Input stream of messages
    Load of worker
    Imbalance of the system
    Goal: partitioning function that minimizes imbalance
    41
    m = ht, k, vi
    Li(t) = |{h⌧, k, vi : P⌧ (k) = i ^ ⌧  t}|
    Pt : K ! N
    i 2 W
    I
    (
    t
    ) = max
    i (
    Li(
    t
    )) avg
    i
    (
    Li(
    t
    ))
    ,
    for
    i 2 W

    View Slide

  75. Shuffle Grouping
    42
    Source
    Source
    Worker
    Worker
    Stream
    Aggr.
    Worker

    View Slide

  76. Existing Stream Partitioning
    Key Grouping
    Memory and communication efficient :)
    Load imbalance :(
    Shuffle Grouping
    Load balance :)
    Additional memory and aggregation phase :(
    43

    View Slide

  77. Solution 1: Rebalancing
    At regular intervals move keys around workers
    Issues
    How often? Which keys to move?
    Key migration not supported with Storm/Samza API
    Many large routing tables (consistency and state)
    Hard to implement
    44

    View Slide

  78. Solution 2: PoTC
    Balls and bins problem
    For each ball, pick two bins uniformly at random
    Put the ball in the least loaded one
    Issues
    Consensus and state to remember choice
    Load information in distributed system
    45

    View Slide

  79. Solution 3: PKG
    Fully distributed adaptation of PoTC, handles skew
    Consensus and state to remember choice
    Key splitting:

    assign each key independently with PoTC
    Load information in distributed system
    Local load estimation:

    estimate worker load locally at each source
    46

    View Slide

  80. Power of Both Choices
    47
    Source
    Source
    Worker
    Worker
    Stream
    Aggr.
    Worker

    View Slide

  81. Throw m balls with k colors in n bins with d choices
    Ball = msg, bin = worker, color = key, choice = hash
    Necessary condition:
    Imbalance is
    Chromatic Balls and Bins
    48
    p1

    d
    n
    I(m) =
    (
    O m
    n
    · ln n
    ln ln n
    , if d = 1
    O m
    n
    , if d 2

    View Slide

  82. Comparison
    49
    Stream Grouping Pros Cons
    Key Grouping Memory efficient Load imbalance
    Shuffle Grouping Load balance
    Memory overhead
    Aggregation O(W)
    Partial Key Grouping
    Memory efficient
    Load balance
    Aggregation O(1)

    View Slide

  83. Experimental Design
    What is the effect of key splitting?
    How does local estimation compare to a global oracle?
    How does PKG perform in a real system?
    Measures: imbalance, throughput, latency, memory
    Datasets: Twitter, Wikipedia, graphs, synthetic
    50

    View Slide

  84. Effect of Key Splitting
    51
    Average Imbalance
    0%
    1%
    2%
    3%
    4%
    Number of workers
    5 10 50 100
    PKG Off-Greedy PoTC KG

    View Slide

  85. Local vs Global
    52
    10-10
    10-9
    10-8
    10-7
    10-6
    10-5
    10-4
    10-3
    10-2
    10-1
    5 10 50 100
    Fraction of Imbalance
    workers
    TW
    G L5
    5 10 50 100
    workers
    WP
    L10
    5 10 50 100
    workers
    CT
    L15
    5
    LN1
    Fig. 2: Fraction of average imbalance with respect to total number of messages
    workers and number of sources.
    5 10 50 100
    workers
    W
    L5
    5 10 50 100
    workers
    WP
    L10
    5 10 50 100
    workers
    CT
    L15
    5
    LN1
    L2
    ction of average imbalance with respect to total number of messages fo
    d number of sources.
    100 5 10 50 100
    workers
    WP
    L10
    5 10 50 100
    workers
    CT
    L15
    5 10 50 100
    workers
    LN1
    L20
    rage imbalance with respect to total number of messages for each datas
    sources.
    5 10 50 100
    workers
    P
    L10
    5 10 50 100
    workers
    CT
    L15
    5 10 50 100
    workers
    LN1
    L20
    5
    LN2
    ance with respect to total number of messages for each dataset, for diffe
    0 5 10 50 100
    workers
    CT
    L15
    5 10 50 100
    workers
    LN1
    L20
    5 10 50 100
    workers
    LN2
    H
    ect to total number of messages for each dataset, for different number
    10-10
    10-9
    10-8
    10-7
    10-6
    10-5
    10-4
    10-3
    10-2
    10-1
    5 10 50 100
    workers
    TW
    G L5
    5 10 50 100
    workers
    WP
    L10
    5 10 50 100
    workers
    CT
    L15
    g. 2: Fraction of average imbalance with respect to total number of mes
    rkers and number of sources.
    10-10
    10-9
    10-8
    10-7
    10-6
    10-5
    10-4
    10-3
    10-2
    10-1
    5 10 50 100
    Fraction of Imbalance
    workers
    TW
    G L5
    5 10 50 100
    workers
    WP
    L10
    5 10 50 100
    workers
    CT
    L15
    Fig. 2: Fraction of average imbalance with respect to total number of m
    workers and number of sources.

    View Slide

  86. Throughput vs Memory
    53
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    0 0.2 0.4 0.6 0.8 1
    Throughput (keys/s)
    (a) CPU delay (ms)
    PKG
    SG
    KG
    1000
    1100
    1200
    0 1.105 2.105 3.105 4.105
    (b) Memory (counters)
    10s
    30s
    60s
    300s
    300s
    600s
    600s
    PKG
    SG
    KG
    Fig. 5: (a) Throughput for PKG, SG and KG for different CPU
    delays. (b) Throughput for PKG and SG vs. average memory
    for different aggregation periods.

    View Slide

  87. Latency
    54
    In the second experiment, we fix the CPU delay to 0.4ms
    per key, as it seems to be the limit of saturation for
    kg
    Table 4:
    Complete latency per message (ms) for di↵erent
    techniques, CPU delays and aggregation periods.
    CPU delay D (ms) Aggregation period T (s)
    D=0.1 D=0.5 D=1 T=10 T=30 T=60
    pkg
    3.81 6.24 11.01 6.93 6.79 6.47
    sg
    3.66 6.11 10.82 7.01 6.75 6.58
    kg
    3.65 9.82 19.35

    View Slide

  88. Impact
    Open source
    https://github.com/gdfm/partial-key-grouping
    Integrated in Apache Storm 0.10
    (STORM-632, STORM-637)
    Plan to integrate it in Samza
    55

    View Slide

  89. Conclusions
    Mining big data streams is an open field
    Needs collaboration between 

    algorithms and systems communities
    SAMOA: a platform for mining big data streams
    And for collaboration on distributed stream mining
    Algorithm-system co-design
    Promising future direction
    56
    System
    Algorithm
    API

    View Slide

  90. Future Work
    Algorithms
    Lift assumptions of ideal systems
    Systems
    New primitives targeted to mining algorithms
    Impact
    Open-source involvement with ASF
    57

    View Slide

  91. Thanks!
    58
    https://samoa.incubator.apache.org
    @ApacheSAMOA
    @gdfm7
    [email protected]

    View Slide