Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SAMOA: A Platform for Mining Big Data Streams

SAMOA: A Platform for Mining Big Data Streams

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. In this talk, we present SAMOA, an upcoming platform for mining big data streams. SAMOA is a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering.

Gianmarco De Francisci Morales

November 30, 2013
Tweet

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. SAMOA A Platform for Mining Big Data Streams 
 Gianmarco

    De Francisci Morales
 Yahoo Labs Barcelona
 [email protected] RGB color version - for online/web use 3D Y-Bang Logo 1
  2. Taxonomy Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm

    SAMOA Non Distributed Batch R, WEKA, … Stream MOA 2 RGB color version - for online/web use 3D Y-Bang Logo
  3. Research Scientist @ Yahoo Labs Committer for Apache Pig. Contributor

    for Apache Hadoop, Giraph, S4. 3 RGB color version - for online/web use 3D Y-Bang Logo
  4. Big Data Stream Volume + Velocity (+ Variety) Too large

    for single commodity server main memory Too fast for single commodity server CPU A solution should be: Distributed Scalable 5 RGB color version - for online/web use 3D Y-Bang Logo
  5. Data Science Lifecycle Old school’s
 data mining From data to

    insight From insight to model From model to value And repeat! 7 RGB color version - for online/web use 3D Y-Bang Logo Gather Clean Model Deploy
  6. Problems Operational Need to rerun the pipeline and redeploy the

    model when new data arrives Paradigmatic New data lies in storage without generating new value until the new model is retrained 9 RGB color version - for online/web use 3D Y-Bang Logo
  7. Stream Batch data is a snapshot of streaming data 10

    RGB color version - for online/web use 3D Y-Bang Logo
  8. Examples User clicks Search queries News Emails Tumblr posts Flickr

    photos Finance stocks Credit card transactions Wikipedia edit logs Facebook statuses Twitter updates Name your own… 11 RGB color version - for online/web use 3D Y-Bang Logo
  9. But we have Hadoop! “Mapreduce is Good Enough? If All

    You Have is a Hammer, Throw Away Everything That’s Not a Nail!”
 [J. Lin, in Big Data, 1(1):28–37, 2013]
 “Data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”
 [A. Jacobs, in ACM Queue, 7(6):10,2009] 12 RGB color version - for online/web use 3D Y-Bang Logo
  10. Big Data Too big to handle 13 RGB color version

    - for online/web use 3D Y-Bang Logo
  11. Future of big data Like drinking from a firehose 14

    RGB color version - for online/web use 3D Y-Bang Logo
  12. Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Spam detection in comments on

    
 Yahoo News Trends change in time Need to retrain model with new data 15 RGB color version - for online/web use 3D Y-Bang Logo
  13. Streaming Sequence is potentially infinite High amount of data, high

    speed of arrival Change over time (concept drift) Approximation algorithms
 (small error with high probability) Single pass, one data item at a time Sublinear space and time per data item 16 RGB color version - for online/web use 3D Y-Bang Logo
  14. Evolution of SPEs 17 —2003 —2004 —2005 —2006 —2008 —2010

    —2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net RGB color version - for online/web use 3D Y-Bang Logo Samza http://samza.incubator.apache.org
  15. Actors Model Live Streams Stream 1 Stream 2 Stream 3

    PE PE PE PE PE External Persister Output 1 Output 2 Event routing 18 RGB color version - for online/web use 3D Y-Bang Logo
  16. S4 Example status.text:"Introducing #S4: a distributed #stream processing system" PE1

    PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 19 RGB color version - for online/web use 3D Y-Bang Logo
  17. Paradigm Shift 20 RGB color version - for online/web use

    3D Y-Bang Logo Gather Clean Model Deploy + =
  18. Concept SAMOA is a platform A framework for developing distributed

    streaming machine learning algorithms for researchers A library of state-of-the-art distributed streaming machine learning algorithms for practitioners 22 RGB color version - for online/web use 3D Y-Bang Logo
  19. Is SAMOA useful for you? Only if you need to

    deal with: Big fast data Evolving data (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 23 RGB color version - for online/web use 3D Y-Bang Logo
  20. Architecture SAMOA S4 Storm … SAMOA Classifier Methods Clustering Methods

    Frequent Pattern Mining 24 RGB color version - for online/web use 3D Y-Bang Logo
  21. Advantages Program once, run everywhere Reuse existing computational infrastructure Model

    is always up to date No system downtime No complex backup/update procedures No need to choose update frequency 25 RGB color version - for online/web use 3D Y-Bang Logo
  22. What about Mahout? Think SAMOA = Mahout for streaming But

    SAMOA… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 26 RGB color version - for online/web use 3D Y-Bang Logo
  23. Current Status Parallel algorithms Vertical Hoeffding Tree (classification) Clustream (clustering)

    PARMA (frequent pattern mining) [pending] Platforms S4 & Storm (Samza coming soon) Alpha version at https://github.com/yahoo/samoa 27 RGB color version - for online/web use 3D Y-Bang Logo
  24. Long-Term Goals Easy to integrate add-ons with packages (like R)

    Most common algorithms implemented (like Mahout) Large community in industry & academia (like Hadoop) Become reference platform for big data stream mining (like Weka) Lively open-source project (Apache Incubator) 28 RGB color version - for online/web use 3D Y-Bang Logo
  25. Algorithmic Case study: Vertical Hoeffding Tree What kind of parallelism?

    Task Data Horizontal Vertical 30 RGB color version - for online/web use 3D Y-Bang Logo Instance Attributes Class
  26. Horizontal Parallelism Stats Stats Stats Stream Histograms Model Instances Y.

    Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 32 RGB color version - for online/web use 3D Y-Bang Logo
  27. Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits 33

    RGB color version - for online/web use 3D Y-Bang Logo
  28. Hoeffding Tree Profiling 34 RGB color version - for online/web

    use 3D Y-Bang Logo Other 6% Split 24% Learn 70% Training CPU time
 100 nominal and 100 numeric attributes
  29. Vertical Parallelism High number of attributes (e.g., documents) results in

    high level of parallelism Parallelism is observed immediately
 (compared to task parallelism) Localized failure handling and model updates
 (model is kept in one node) Less memory usage compared to horizontal partitioning (no model replication) 35 RGB color version - for online/web use 3D Y-Bang Logo
  30. Vertical Hoeffding Tree Control Split Result Source (n) Model (1)

    Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping 36 RGB color version - for online/web use 3D Y-Bang Logo
  31. Accuracy 37 No. Leaf Nodes VHT2 – tree-100 30 Very

    close and very high accuracy RGB color version - for online/web use 3D Y-Bang Logo
  32. Performance 38 35 0 50 100 150 200 250 MHT

    VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec RGB color version - for online/web use 3D Y-Bang Logo
  33. Platform Design What is the right level of abstraction? Application

    building Computation Communication 39 RGB color version - for online/web use 3D Y-Bang Logo
  34. ML Developer API 40 RGB color version - for online/web

    use 3D Y-Bang Logo Processing Item Processor Stream
  35. ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor();

    builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); 41 RGB color version - for online/web use 3D Y-Bang Logo
  36. Implementation How to hide platform differences? Deployment Runtime How to

    isolate platform-related code? Build and release architecture 42 RGB color version - for online/web use 3D Y-Bang Logo
  37. Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings

    API. Algorithm developer depends only on this To S4 cluster To Storm cluster 43 RGB color version - for online/web use 3D Y-Bang Logo
  38. Conclusions SAMOA: A Platform for Mining Big Data Streams Runs

    on existing distributed stream processing engines Parallel algorithms for mining data streams Pluggable architecture, flexible, extensible Open source and available (alpha release) 44 RGB color version - for online/web use 3D Y-Bang Logo
  39. Thanks! 45 RGB color version - for online/web use 3D

    Y-Bang Logo [email protected] https://github.com/yahoo/samoa @samoa_project