Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SAMOA: A Platform for Mining Big Data Streams

SAMOA: A Platform for Mining Big Data Streams

Presented in Chile during Hypertext 2014 (August)

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. SAMOA
    A Platform for Mining Big Data Streams

    Gianmarco De Francisci Morales

    Yahoo Labs Barcelona

    [email protected]
    1

    View Slide

  2. Agenda
    Streams
    Applications, Model, Tools, Advantages
    SAMOA
    Goal, Example, Challenges
    2

    View Slide

  3. Streams
    “Panta rhei” (everything flows)
    Heraclitus
    3

    View Slide

  4. Importance$of$O
    •  As$spam$trends$change
    retrain$the$model$with
    Importance
    Spam detection in
    comments on 

    Yahoo! News
    Trends change in time
    Need to retrain model
    with new data
    4

    View Slide

  5. Spam on Twitter
    5

    View Slide

  6. 6
    Applications

    View Slide

  7. Personalization
    6
    Applications

    View Slide

  8. Personalization
    Spam detection
    6
    Applications

    View Slide

  9. Personalization
    Spam detection
    Recommendation
    6
    Applications

    View Slide

  10. Big Data Stream
    Volume + Velocity (+ Variety)
    Too large for single commodity server main memory
    Too fast for single commodity server CPU
    A solution should be:
    Distributed
    Scalable
    7

    View Slide

  11. Examples
    User clicks
    Search queries
    News
    Emails
    Tumblr posts
    Flickr photos
    Finance stocks
    Credit card transactions
    Wikipedia edit logs
    Facebook statuses
    Twitter updates
    Name your own…
    8

    View Slide

  12. Stream
    Batch data is a
    snapshot of
    streaming data
    9

    View Slide

  13. Data Science Lifecycle
    Old school’s

    data mining
    From data to insight
    From insight to model
    From model to value
    And repeat!
    10
    Gather
    Clean
    Model
    Deploy

    View Slide

  14. Big Data Tools
    11

    View Slide

  15. Problems
    Operational
    Need to rerun the pipeline and redeploy the model
    when new data arrives
    Paradigmatic
    New data lies in storage without generating new
    value until the new model is retrained
    12

    View Slide

  16. Present of big data
    Too big to handle
    13

    View Slide

  17. Future of big data
    Drinking from a firehose
    14

    View Slide

  18. A Tale of Two Tribes
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

    View Slide

  19. A Tale of Two Tribes
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

    View Slide

  20. A Tale of Two Tribes
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

    View Slide

  21. A Tale of Two Tribes
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

    View Slide

  22. Evolution of SPEs
    16
    —2003
    —2004
    —2005
    —2006
    —2008
    —2010
    —2011
    —2013
    Aurora
    STREAM
    Borealis
    SPC
    SPADE
    Storm
    S4
    1st generation
    2nd generation
    3rd generation
    Abadi et al., “Aurora: a new model and architecture for
    data stream management,” VLDB Journal, 2003
    Arasu et al., “STREAM: The Stanford Data Stream
    Management System,” Stanford InfoLab, 2004.
    Abadi et al., “The Design of the Borealis Stream
    Processing Engine,” in CIDR ’05
    Amini et al., “SPC: A Distributed, Scalable Platform
    for Data Mining,” in DMSSP ’06
    Gedik et al., “SPADE: The System S Declarative
    Stream Processing Engine,” in SIGMOD ’08
    Neumeyer et al., “S4: Distributed Stream Computing
    Platform,” in ICDMW ’10
    http://storm-project.net
    Samza http://samza.incubator.apache.org

    View Slide

  23. Actors Model
    Live Streams
    Stream 1
    Stream 2
    Stream 3
    PE
    PE
    PE
    PE
    PE
    External
    Persister
    Output 1
    Output 2
    Event
    routing
    17

    View Slide

  24. S4 Example
    status.text:"Introducing #S4: a distributed #stream processing system"
    PE1
    PE2 PE3
    PE4
    RawStatus
    null
    text="Int..."
    EV
    KEY
    VAL
    Topic
    topic="S4"
    count=1
    EV
    KEY
    VAL
    Topic
    topic="stream"
    count=1
    EV
    KEY
    VAL
    Topic
    reportKey="1"
    topic="S4", count=4
    EV
    KEY
    VAL
    TopicExtractorPE (PE1)
    extracts hashtags from status.text
    TopicCountAndReportPE (PE2-3)
    keeps counts for each topic across
    all tweets. Regularly emits report
    event if topic count is above
    a configured threshold.
    TopicNTopicPE (PE4)
    keeps counts for top topics and outputs
    top-N topics to external persister
    18

    View Slide

  25. But we have Hadoop!
    “Mapreduce is Good Enough? If All You Have is a
    Hammer, Throw Away Everything That’s Not a Nail!”

    [J. Lin, in Big Data, 1(1):28–37, 2013]

    “Data whose characteristics forces us to look beyond
    the traditional methods that are prevalent at the time”

    [A. Jacobs, in ACM Queue, 7(6):10,2009]
    19

    View Slide

  26. Paradigm Shift
    20
    Gather
    Clean
    Model
    Deploy
    + =

    View Slide

  27. Streaming Model
    Sequence is potentially infinite
    High amount of data, high speed of arrival
    Change over time (concept drift)
    Approximation algorithms

    (small error with high probability)
    Single pass, one data item at a time
    Sub-linear space and time per data item
    21

    View Slide

  28. SAMOA
    Scalable Advanced Massive Online Analysis
    22

    View Slide

  29. Concept
    SAMOA is a platform
    Researchers
    Framework for developing 

    distributed stream mining algorithms
    Practitioners
    Library of state-of-the-art 

    distributed stream mining algorithms
    23

    View Slide

  30. Taxonomy
    24
    Data
    Mining
    Distributed
    Batch
    Hadoop
    Mahout
    Stream
    Storm, S4,
    Samza
    SAMOA
    Non
    Distributed
    Batch
    R,
    WEKA,

    Stream
    MOA

    View Slide

  31. What about Mahout?
    Think SAMOA = Mahout for streaming
    But SAMOA…
    More than JBoA (just a bunch of algorithms)
    Provides a common platform
    Easy to port to new computing engines
    25

    View Slide

  32. Architecture
    26
    SA
    SAMOA%

    View Slide

  33. Status
    Parallel algorithms
    Vertical Hoeffding Tree (classification)
    CluStream (clustering)
    Adaptive Model Rules (regression)
    PARMA (frequent pattern mining) [pending]
    Execution engines
    Storm, S4, Samza, (+ Local)
    27
    https://github.com/yahoo/samoa

    View Slide

  34. Is SAMOA useful for you?
    Only if you need to deal with:
    Big fast data
    Evolving data (model updates)
    What is happening now?
    Use feedback in real-time
    Adapt to changes faster
    28

    View Slide

  35. Advantages (operational)
    Program once, run everywhere
    Reuse existing computational infrastructure
    Avoid deploy cycle
    No system downtime
    No complex backup/update procedures
    No need to choose update frequency
    29

    View Slide

  36. Advantages (paradigmatic)
    Model freshness
    Immediate data value
    No stream/batch impedance mismatch
    30

    View Slide

  37. Algorithmic Challenges
    Case study: Vertical Hoeffding Tree
    What kind of parallelism?
    Task
    Data
    Horizontal
    Vertical
    31
    Instance
    Attributes
    Class

    View Slide

  38. Task Parallelism
    32

    View Slide

  39. Horizontal Parallelism
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,”
    The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 33
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Model Updates

    View Slide

  40. Hoeffding Tree Profiling
    34
    Other
    6%
    Split
    24%
    Learn
    70%
    Training CPU time

    100 nominal and 100
    numeric attributes

    View Slide

  41. Vertical Parallelism
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits
    35

    View Slide

  42. Vertical Parallelism
    High number of attributes => high level of parallelism

    (e.g., documents)
    vs. task parallelism
    Parallelism observed immediately
    vs. horizontal parallelism
    Reduced memory usage (no model replication)
    Parallelized split computation
    36

    View Slide

  43. Vertical Hoeffding Tree
    37
    Control
    Split
    Result
    Source (n) Model (n) Stats (n) Evaluator (1)
    Instance
    Stream
    Shuffle Grouping
    Key Grouping
    All Grouping

    View Slide

  44. Accuracy
    38
    No. Leaf Nodes VHT2 –
    tree-100
    30
    Very close and
    very high accuracy

    View Slide

  45. Performance
    39
    35
    0
    50
    100
    150
    200
    250
    MHT VHT2-par-3
    Execution Time (seconds)
    Classifier
    Profiling Results for text-10000
    with 100000 instances
    t_calc
    t_comm
    t_serial
    Throughput
    VHT2-par-3: 2631 inst/sec
    MHT : 507 inst/sec

    View Slide

  46. ML Developer API
    40
    Processing Item
    Processor
    Stream

    View Slide

  47. ML Developer API
    TopologyBuilder builder;
    Processor sourceOne = new SourceProcessor();
    builder.addProcessor(sourceOne);
    Stream streamOne = builder.createStream(sourceOne);
    !
    Processor sourceTwo = new SourceProcessor();
    builder.addProcessor(sourceTwo);
    Stream streamTwo = builder.createStream(sourceTwo);
    !
    Processor join = new JoinProcessor());
    builder.addProcessor(join)
    .connectInputShuffle(streamOne)
    .connectInputKey(streamTwo);
    41

    View Slide

  48. Deployment
    SAMOA-S4.jar
    SAMOA-API.jar
    SAMOA-Storm.jar
    samoa-storm-deployable.jar
    samoa-s4-deployable.s4r
    S4 bindings
    Storm bindings
    API. Algorithm developer
    depends only on this
    To S4 cluster
    To Storm cluster
    42

    View Slide

  49. Conclusions
    Streaming is the future and is happening now
    SAMOA: A Platform for Mining Big Data Streams
    Runs on existing DSPEs (Storm, Samza, S4)
    Algorithms for classification, regression, clustering
    Available and open-source http://samoa-project.net
    A platform for collaboration and research on

    distributed stream mining
    43

    View Slide

  50. Open Challenges
    Distributed stream mining algorithms
    Active & semi-supervised learning + crowdsourcing
    Millions of classes (e.g., Wikipedia pages)
    Multi-target learning
    System issues (load balancing, communication)
    Programming paradigms and abstractions
    44

    View Slide

  51. Thanks!
    45
    !
    [email protected]
    https://github.com/yahoo/samoa
    @samoa_project
    @gdfm7

    View Slide