Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SAMOA: A Platform for Mining Big Data Streams

SAMOA: A Platform for Mining Big Data Streams

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. In this talk, we present SAMOA, an upcoming platform for mining big data streams. SAMOA is a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering.

Gianmarco De Francisci Morales

November 30, 2013
Tweet

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. SAMOA
    A Platform for Mining Big Data Streams

    Gianmarco De Francisci Morales

    Yahoo Labs Barcelona

    [email protected]
    RGB color version - for online/web use
    3D Y-Bang Logo
    1

    View full-size slide

  2. Taxonomy
    Machine
    Learning
    Distributed
    Batch
    Hadoop
    Mahout
    Stream
    S4, Storm
    SAMOA
    Non
    Distributed
    Batch
    R,
    WEKA,

    Stream
    MOA
    2
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  3. Research
    Scientist @
    Yahoo Labs
    Committer for
    Apache Pig.
    Contributor for
    Apache Hadoop,
    Giraph, S4.
    3
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  4. SAMOA Team
    4
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  5. Big Data Stream
    Volume + Velocity (+ Variety)
    Too large for single commodity server main memory
    Too fast for single commodity server CPU
    A solution should be:
    Distributed
    Scalable
    5
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  6. 6
    RGB color version - for online/web use
    3D Y-Bang Logo
    Applications

    View full-size slide

  7. Personalization
    6
    RGB color version - for online/web use
    3D Y-Bang Logo
    Applications

    View full-size slide

  8. Personalization
    Spam detection
    6
    RGB color version - for online/web use
    3D Y-Bang Logo
    Applications

    View full-size slide

  9. Personalization
    Spam detection
    Recommendation
    6
    RGB color version - for online/web use
    3D Y-Bang Logo
    Applications

    View full-size slide

  10. Data Science Lifecycle
    Old school’s

    data mining
    From data to insight
    From insight to model
    From model to value
    And repeat!
    7
    RGB color version - for online/web use
    3D Y-Bang Logo
    Gather
    Clean
    Model
    Deploy

    View full-size slide

  11. Big Data Tools
    8
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  12. Problems
    Operational
    Need to rerun the pipeline and redeploy the model
    when new data arrives
    Paradigmatic
    New data lies in storage without generating new
    value until the new model is retrained
    9
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  13. Stream
    Batch data is a
    snapshot of
    streaming data
    10
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  14. Examples
    User clicks
    Search queries
    News
    Emails
    Tumblr posts
    Flickr photos
    Finance stocks
    Credit card transactions
    Wikipedia edit logs
    Facebook statuses
    Twitter updates
    Name your own…
    11
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  15. But we have Hadoop!
    “Mapreduce is Good Enough? If All You Have is a
    Hammer, Throw Away Everything That’s Not a Nail!”

    [J. Lin, in Big Data, 1(1):28–37, 2013]

    “Data whose characteristics forces us to look beyond
    the traditional methods that are prevalent at the time”

    [A. Jacobs, in ACM Queue, 7(6):10,2009]
    12
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  16. Big Data
    Too big to handle
    13
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  17. Future of big data
    Like drinking from a firehose
    14
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  18. Importance$of$O
    •  As$spam$trends$change
    retrain$the$model$with
    Importance
    Spam detection in
    comments on 

    Yahoo News
    Trends change in time
    Need to retrain model
    with new data
    15
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  19. Streaming
    Sequence is potentially infinite
    High amount of data, high speed of arrival
    Change over time (concept drift)
    Approximation algorithms

    (small error with high probability)
    Single pass, one data item at a time
    Sublinear space and time per data item
    16
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  20. Evolution of SPEs
    17
    —2003
    —2004
    —2005
    —2006
    —2008
    —2010
    —2011
    —2013
    Aurora
    STREAM
    Borealis
    SPC
    SPADE
    Storm
    S4
    1st generation
    2nd generation
    3rd generation
    Abadi et al., “Aurora: a new model and architecture for
    data stream management,” VLDB Journal, 2003
    Arasu et al., “STREAM: The Stanford Data Stream
    Management System,” Stanford InfoLab, 2004.
    Abadi et al., “The Design of the Borealis Stream
    Processing Engine,” in CIDR ’05
    Amini et al., “SPC: A Distributed, Scalable Platform
    for Data Mining,” in DMSSP ’06
    Gedik et al., “SPADE: The System S Declarative
    Stream Processing Engine,” in SIGMOD ’08
    Neumeyer et al., “S4: Distributed Stream Computing
    Platform,” in ICDMW ’10
    http://storm-project.net
    RGB color version - for online/web use
    3D Y-Bang Logo
    Samza http://samza.incubator.apache.org

    View full-size slide

  21. Actors Model
    Live Streams
    Stream 1
    Stream 2
    Stream 3
    PE
    PE
    PE
    PE
    PE
    External
    Persister
    Output 1
    Output 2
    Event
    routing
    18
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  22. S4 Example
    status.text:"Introducing #S4: a distributed #stream processing system"
    PE1
    PE2 PE3
    PE4
    RawStatus
    null
    text="Int..."
    EV
    KEY
    VAL
    Topic
    topic="S4"
    count=1
    EV
    KEY
    VAL
    Topic
    topic="stream"
    count=1
    EV
    KEY
    VAL
    Topic
    reportKey="1"
    topic="S4", count=4
    EV
    KEY
    VAL
    TopicExtractorPE (PE1)
    extracts hashtags from status.text
    TopicCountAndReportPE (PE2-3)
    keeps counts for each topic across
    all tweets. Regularly emits report
    event if topic count is above
    a configured threshold.
    TopicNTopicPE (PE4)
    keeps counts for top topics and outputs
    top-N topics to external persister
    19
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  23. Paradigm Shift
    20
    RGB color version - for online/web use
    3D Y-Bang Logo
    Gather
    Clean
    Model
    Deploy
    + =

    View full-size slide

  24. SAMOA
    Scalable Advanced Massive Online Analysis
    21
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  25. Concept
    SAMOA is a platform
    A framework for developing distributed streaming
    machine learning algorithms for researchers
    A library of state-of-the-art distributed streaming
    machine learning algorithms for practitioners
    22
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  26. Is SAMOA useful for you?
    Only if you need to deal with:
    Big fast data
    Evolving data (model updates)
    What is happening now?
    Use feedback in real-time
    Adapt to changes faster
    23
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  27. Architecture
    SAMOA
    S4 Storm …
    SAMOA
    Classifier
    Methods
    Clustering
    Methods
    Frequent
    Pattern
    Mining
    24
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  28. Advantages
    Program once, run everywhere
    Reuse existing computational infrastructure
    Model is always up to date
    No system downtime
    No complex backup/update procedures
    No need to choose update frequency
    25
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  29. What about Mahout?
    Think SAMOA = Mahout for streaming
    But SAMOA…
    More than JBoA (just a bunch of algorithms)
    Provides a common platform
    Easy to port to new computing engines
    26
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  30. Current Status
    Parallel algorithms
    Vertical Hoeffding Tree (classification)
    Clustream (clustering)
    PARMA (frequent pattern mining) [pending]
    Platforms
    S4 & Storm (Samza coming soon)
    Alpha version at https://github.com/yahoo/samoa
    27
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  31. Long-Term Goals
    Easy to integrate add-ons with packages (like R)
    Most common algorithms implemented (like Mahout)
    Large community in industry & academia (like Hadoop)
    Become reference platform for big data stream mining
    (like Weka)
    Lively open-source project (Apache Incubator)
    28
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  32. Challenges
    Algorithmic
    Platform design
    Implementation
    29
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  33. Algorithmic
    Case study: Vertical Hoeffding Tree
    What kind of parallelism?
    Task
    Data
    Horizontal
    Vertical
    30
    RGB color version - for online/web use
    3D Y-Bang Logo
    Instance
    Attributes
    Class

    View full-size slide

  34. Task Parallelism
    31
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  35. Horizontal Parallelism
    Stats
    Stats
    Stats
    Stream
    Histograms
    Model
    Instances
    Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,”
    The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 32
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  36. Vertical Parallelism
    Stats
    Stats
    Stats
    Stream
    Model
    Attributes
    Splits
    33
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  37. Hoeffding Tree Profiling
    34
    RGB color version - for online/web use
    3D Y-Bang Logo
    Other
    6%
    Split
    24%
    Learn
    70%
    Training CPU time

    100 nominal and 100
    numeric attributes

    View full-size slide

  38. Vertical Parallelism
    High number of attributes (e.g., documents) results in
    high level of parallelism
    Parallelism is observed immediately

    (compared to task parallelism)
    Localized failure handling and model updates

    (model is kept in one node)
    Less memory usage compared to horizontal
    partitioning (no model replication)
    35
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  39. Vertical Hoeffding Tree
    Control
    Split
    Result
    Source (n) Model (1) Stats (n) Evaluator (1)
    Instance
    Stream
    Shuffle Grouping
    Key Grouping
    All Grouping
    36
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  40. Accuracy
    37
    No. Leaf Nodes VHT2 –
    tree-100
    30
    Very close and
    very high accuracy
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  41. Performance
    38
    35
    0
    50
    100
    150
    200
    250
    MHT VHT2-par-3
    Execution Time (seconds)
    Classifier
    Profiling Results for text-10000
    with 100000 instances
    t_calc
    t_comm
    t_serial
    Throughput
    VHT2-par-3: 2631 inst/sec
    MHT : 507 inst/sec
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  42. Platform Design
    What is the right level of abstraction?
    Application building
    Computation
    Communication
    39
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  43. ML Developer API
    40
    RGB color version - for online/web use
    3D Y-Bang Logo
    Processing Item
    Processor
    Stream

    View full-size slide

  44. ML Developer API
    TopologyBuilder builder;
    Processor sourceOne = new SourceProcessor();
    builder.addProcessor(sourceOne);
    Stream streamOne = builder.createStream(sourceOne);
    !
    Processor sourceTwo = new SourceProcessor();
    builder.addProcessor(sourceTwo);
    Stream streamTwo = builder.createStream(sourceTwo);
    !
    Processor join = new JoinProcessor());
    builder.addProcessor(join)
    .connectInputShuffle(streamOne)
    .connectInputKey(streamTwo);
    41
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  45. Implementation
    How to hide platform differences?
    Deployment
    Runtime
    How to isolate platform-related code?
    Build and release architecture
    42
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  46. Deployment
    SAMOA-S4.jar
    SAMOA-API.jar
    SAMOA-Storm.jar
    samoa-storm-deployable.jar
    samoa-s4-deployable.s4r
    S4 bindings
    Storm bindings
    API. Algorithm developer
    depends only on this
    To S4 cluster
    To Storm cluster
    43
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  47. Conclusions
    SAMOA: A Platform for Mining Big Data Streams
    Runs on existing distributed stream processing engines
    Parallel algorithms for mining data streams
    Pluggable architecture, flexible, extensible
    Open source and available (alpha release)
    44
    RGB color version - for online/web use
    3D Y-Bang Logo

    View full-size slide

  48. Thanks!
    45
    RGB color version - for online/web use
    3D Y-Bang Logo
    [email protected]
    https://github.com/yahoo/samoa
    @samoa_project

    View full-size slide