Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SAMOA @Strata Barcelona 2014

SAMOA @Strata Barcelona 2014

SAMOA: A Platform for Mining Big Data Streams

Gianmarco De Francisci Morales

November 20, 2014
Tweet

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. !
    A Platform for Mining Big Data Streams
    Gianmarco De Francisci Morales

    Yahoo Labs Barcelona

    [email protected]
    @gdfm7
    1
    SAMOA

    View Slide

  2. Agenda
    Streams
    Applications, Model, Tools
    SAMOA
    Goal, Architecture, Avantages
    2

    View Slide

  3. Research
    Scientist @
    Yahoo Labs
    Web mining & 

    data-intensive

    scalable computing
    Committer @ Apache Pig
    Contributor for Hadoop, 

    Giraph, S4, Grafos.ml
    3

    View Slide

  4. –Heraclitus
    “Panta rhei” (everything flows)
    4

    View Slide

  5. Importance$of$On
    •  As$spam$trends$change,$it
    retrain$the$model$with$ne
    •  P
    c
    •  O
    c
    p
    (
    O
    f
    •  O
    $
    Importance
    Spam detection in comments on 

    Yahoo! News
    Trends change in time
    Need to retrain model with new
    data
    5

    View Slide

  6. Spam on Twitter
    6

    View Slide

  7. Applications
    7

    View Slide

  8. Applications
    7
    Personalization

    View Slide

  9. Applications
    7
    Personalization
    Spam detection

    View Slide

  10. Applications
    7
    Personalization
    Spam detection
    Recommendation

    View Slide

  11. Big Data Stream
    Volume + Velocity (+ Variety)
    Too large for single commodity server main memory
    Too fast for single commodity server CPU
    A solution should be:
    Distributed
    Scalable
    8

    View Slide

  12. Examples
    9
    User clicks
    Search queries
    News
    Emails
    Tumblr posts
    Flickr photos
    Finance stocks
    Credit card transactions
    Wikipedia edit logs
    Facebook statuses
    Twitter updates
    Name your own…

    View Slide

  13. Stream
    Batch data is 

    a snapshot of 

    streaming data
    10

    View Slide

  14. Gather
    Clean
    Model
    Deploy
    Data Science Lifecycle
    Old school’s

    data mining
    From data to insight
    From insight to model
    From model to value
    And repeat!
    11

    View Slide

  15. Big Data Tools
    12

    View Slide

  16. Problems
    Operational
    Need to rerun the pipeline and redeploy the model when new data arrives
    !
    Paradigmatic
    New data lies in storage without generating new value until the new model
    is retrained
    13

    View Slide

  17. Present of big data
    Too big to handle
    14

    View Slide

  18. Future of big data
    Drinking from a firehose
    15

    View Slide

  19. A Tale of Two Tribes
    16
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009

    View Slide

  20. A Tale of Two Tribes
    16
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009

    View Slide

  21. A Tale of Two Tribes
    16
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009

    View Slide

  22. A Tale of Two Tribes
    16
    DB
    DB
    DB
    DB
    DB
    DB
    Data
    App App App
    Faster Larger
    Database
    M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05

    A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009

    View Slide

  23. Evolution of SPEs
    17
    —2003
    —2004
    —2005
    —2006
    —2008
    —2010
    —2011
    —2013
    Aurora
    STREAM
    Borealis
    SPC
    SPADE
    Storm
    S4
    1st generation
    2nd generation
    3rd generation
    Abadi et al., “Aurora: a new model and architecture for
    data stream management,” VLDB Journal, 2003
    Arasu et al., “STREAM: The Stanford Data Stream
    Management System,” Stanford InfoLab, 2004.
    Abadi et al., “The Design of the Borealis Stream
    Processing Engine,” in CIDR ’05
    Amini et al., “SPC: A Distributed, Scalable Platform for
    Data Mining,” in DMSSP ’06
    Gedik et al., “SPADE: The System S Declarative Stream
    Processing Engine,” in SIGMOD ’08
    Neumeyer et al., “S4: Distributed Stream Computing
    Platform,” in ICDMW ’10
    http://storm.apache.org
    Samza http://samza.incubator.apache.org

    View Slide

  24. Actors Model
    18
    Live Streams
    Stream 1
    Stream 2
    Stream 3
    PE
    PE
    PE
    PE
    PE
    External
    Persister
    Output 1
    Output 2
    Event
    routing

    View Slide

  25. S4 Example
    19
    status.text:"Introducing #S4: a distributed #stream processing system"
    PE1
    PE2 PE3
    PE4
    RawStatus
    null
    text="Int..."
    EV
    KEY
    VAL
    Topic
    topic="S4"
    count=1
    EV
    KEY
    VAL
    Topic
    topic="stream"
    count=1
    EV
    KEY
    VAL
    Topic
    reportKey="1"
    topic="S4", count=4
    EV
    KEY
    VAL
    TopicExtractorPE (PE1)
    extracts hashtags from status.text
    TopicCountAndReportPE (PE2-3)
    keeps counts for each topic across
    all tweets. Regularly emits report
    event if topic count is above
    a configured threshold.
    TopicNTopicPE (PE4)
    keeps counts for top topics and outputs
    top-N topics to external persister

    View Slide

  26. But we have Hadoop!
    “Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away
    Everything That’s Not a Nail!”

    [J. Lin, in Big Data, 1(1):28–37, 2013]

    “Data whose characteristics forces us to look beyond the traditional
    methods that are prevalent at the time”

    [A. Jacobs, in ACM Queue, 7(6):10,2009]
    20

    View Slide

  27. Paradigm Shift
    21
    Gather
    Clean
    Model
    Deploy
    + =

    View Slide

  28. Streaming Model
    Sequence is potentially infinite
    High amount of data, high speed of arrival
    Change over time (concept drift)
    Approximation algorithms

    (small error with high probability)
    Single pass, one data item at a time
    Sub-linear space and time per data item
    22

    View Slide

  29. SAMOA
    Scalable Advanced Massive Online Analysis

    !
    G. De Francisci Morales, A. Bifet

    Journal of Machine Learning Research, 2014
    23

    View Slide

  30. Concept
    SAMOA is a platform
    Researchers
    Framework for developing 

    distributed stream mining algorithms
    Practitioners
    Library of state-of-the-art 

    distributed stream mining algorithms
    24

    View Slide

  31. Taxonomy
    25
    Data
    Mining
    Distributed
    Batch
    Hadoop
    Mahout
    Stream
    Storm, S4,
    Samza
    SAMOA
    Non
    Distributed
    Batch
    R,
    WEKA,

    Stream
    MOA

    View Slide

  32. What about Mahout?
    Think SAMOA = Mahout for streaming
    But SAMOA…
    More than JBoA (just a bunch of algorithms)
    Provides a common platform
    Easy to port to new computing engines
    26

    View Slide

  33. Parallel algorithms
    Vertical Hoeffding Tree (classification)
    CluStream (clustering)
    Adaptive Model Rules (regression)
    PARMA (frequent pattern mining) [pending]
    Execution engines
    Status
    27
    https://github.com/yahoo/samoa

    View Slide

  34. Parallel algorithms
    Vertical Hoeffding Tree (classification)
    CluStream (clustering)
    Adaptive Model Rules (regression)
    PARMA (frequent pattern mining) [pending]
    Execution engines
    Status
    27
    https://github.com/yahoo/samoa

    View Slide

  35. Parallel algorithms
    Vertical Hoeffding Tree (classification)
    CluStream (clustering)
    Adaptive Model Rules (regression)
    PARMA (frequent pattern mining) [pending]
    Execution engines
    Status
    27
    https://github.com/yahoo/samoa

    View Slide

  36. Parallel algorithms
    Vertical Hoeffding Tree (classification)
    CluStream (clustering)
    Adaptive Model Rules (regression)
    PARMA (frequent pattern mining) [pending]
    Execution engines
    Status
    27
    https://github.com/yahoo/samoa

    View Slide

  37. Architecture
    28
    SA
    SAMOA%

    View Slide

  38. Is SAMOA useful for you?
    Only if you need to deal with:
    Big fast data
    Evolving data (model updates)
    What is happening now?
    Use feedback in real-time
    Adapt to changes faster
    29

    View Slide

  39. Advantages (operational)
    Program once, run everywhere
    Reuse existing computational infrastructure
    Avoid deploy cycle
    No system downtime
    No complex backup/update procedures
    No need to choose update frequency
    30

    View Slide

  40. Advantages (paradigmatic)
    Model freshness
    No retraining
    Immediate data value
    No stream/batch impedance mismatch
    31

    View Slide

  41. ML Developer API
    32
    Processing Item
    Processor
    Stream

    View Slide

  42. ML Developer API
    33
    TopologyBuilder builder;
    Processor sourceOne = new SourceProcessor();
    builder.addProcessor(sourceOne);
    Stream streamOne = builder.createStream(sourceOne);
    !
    Processor sourceTwo = new SourceProcessor();
    builder.addProcessor(sourceTwo);
    Stream streamTwo = builder.createStream(sourceTwo);
    !
    Processor join = new JoinProcessor());
    builder.addProcessor(join)
    .connectInputShuffle(streamOne)
    .connectInputKey(streamTwo);

    View Slide

  43. Deployment
    34
    SAMOA-S4.jar
    SAMOA-API.jar
    SAMOA-Storm.jar
    samoa-storm-deployable.jar
    samoa-s4-deployable.s4r
    S4 bindings
    Storm bindings
    API. Algorithm developer
    depends only on this
    To S4 cluster
    To Storm cluster

    View Slide

  44. Conclusions
    Streaming is the future and is happening now
    SAMOA
    Runs on existing DSPEs (Storm, S4, Samza)
    Algorithms for classification, regression, clustering
    Available and open-source http://samoa-project.net
    A platform for collaboration and research on

    distributed stream mining
    35

    View Slide

  45. The Team
    Albert
    Bifet
    Matthieu
    Morel
    Gianmarco
    De Francisci Morales
    Arinto
    Murdopo
    Nicolas
    Kourtellis
    Olivier
    Van Laere

    View Slide

  46. Thanks!
    !
    [email protected]
    https://github.com/yahoo/samoa
    @samoa_project
    @gdfm7

    View Slide