SAMOA: A Platform for Mining Big Data Streams

SAMOA: A Platform for Mining Big Data Streams

Presented in Chile during Hypertext 2014 (August)

Transcript

  1. SAMOA A Platform for Mining Big Data Streams 
 Gianmarco

    De Francisci Morales
 Yahoo Labs Barcelona
 gdfm@yahoo-inc.com 1
  2. Agenda Streams Applications, Model, Tools, Advantages SAMOA Goal, Example, Challenges

    2
  3. Streams “Panta rhei” (everything flows) Heraclitus 3

  4. Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Spam detection in comments on

    
 Yahoo! News Trends change in time Need to retrain model with new data 4
  5. Spam on Twitter 5

  6. 6 Applications

  7. Personalization 6 Applications

  8. Personalization Spam detection 6 Applications

  9. Personalization Spam detection Recommendation 6 Applications

  10. Big Data Stream Volume + Velocity (+ Variety) Too large

    for single commodity server main memory Too fast for single commodity server CPU A solution should be: Distributed Scalable 7
  11. Examples User clicks Search queries News Emails Tumblr posts Flickr

    photos Finance stocks Credit card transactions Wikipedia edit logs Facebook statuses Twitter updates Name your own… 8
  12. Stream Batch data is a snapshot of streaming data 9

  13. Data Science Lifecycle Old school’s
 data mining From data to

    insight From insight to model From model to value And repeat! 10 Gather Clean Model Deploy
  14. Big Data Tools 11

  15. Problems Operational Need to rerun the pipeline and redeploy the

    model when new data arrives Paradigmatic New data lies in storage without generating new value until the new model is retrained 12
  16. Present of big data Too big to handle 13

  17. Future of big data Drinking from a firehose 14

  18. A Tale of Two Tribes DB DB DB DB DB

    DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
  19. A Tale of Two Tribes DB DB DB DB DB

    DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
  20. A Tale of Two Tribes DB DB DB DB DB

    DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
  21. A Tale of Two Tribes DB DB DB DB DB

    DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
  22. Evolution of SPEs 16 —2003 —2004 —2005 —2006 —2008 —2010

    —2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org
  23. Actors Model Live Streams Stream 1 Stream 2 Stream 3

    PE PE PE PE PE External Persister Output 1 Output 2 Event routing 17
  24. S4 Example status.text:"Introducing #S4: a distributed #stream processing system" PE1

    PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 18
  25. But we have Hadoop! “Mapreduce is Good Enough? If All

    You Have is a Hammer, Throw Away Everything That’s Not a Nail!”
 [J. Lin, in Big Data, 1(1):28–37, 2013]
 “Data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”
 [A. Jacobs, in ACM Queue, 7(6):10,2009] 19
  26. Paradigm Shift 20 Gather Clean Model Deploy + =

  27. Streaming Model Sequence is potentially infinite High amount of data,

    high speed of arrival Change over time (concept drift) Approximation algorithms
 (small error with high probability) Single pass, one data item at a time Sub-linear space and time per data item 21
  28. SAMOA Scalable Advanced Massive Online Analysis 22

  29. Concept SAMOA is a platform Researchers Framework for developing 


    distributed stream mining algorithms Practitioners Library of state-of-the-art 
 distributed stream mining algorithms 23
  30. Taxonomy 24 Data Mining Distributed Batch Hadoop Mahout Stream Storm,

    S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  31. What about Mahout? Think SAMOA = Mahout for streaming But

    SAMOA… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 25
  32. Architecture 26 SA SAMOA%

  33. Status Parallel algorithms Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive

    Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines Storm, S4, Samza, (+ Local) 27 https://github.com/yahoo/samoa
  34. Is SAMOA useful for you? Only if you need to

    deal with: Big fast data Evolving data (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 28
  35. Advantages (operational) Program once, run everywhere Reuse existing computational infrastructure

    Avoid deploy cycle No system downtime No complex backup/update procedures No need to choose update frequency 29
  36. Advantages (paradigmatic) Model freshness Immediate data value No stream/batch impedance

    mismatch 30
  37. Algorithmic Challenges Case study: Vertical Hoeffding Tree What kind of

    parallelism? Task Data Horizontal Vertical 31 Instance Attributes Class
  38. Task Parallelism 32

  39. Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel

    Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 33 Stats Stats Stats Stream Histograms Model Instances Model Updates
  40. Hoeffding Tree Profiling 34 Other 6% Split 24% Learn 70%

    Training CPU time
 100 nominal and 100 numeric attributes
  41. Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits 35

  42. Vertical Parallelism High number of attributes => high level of

    parallelism
 (e.g., documents) vs. task parallelism Parallelism observed immediately vs. horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 36
  43. Vertical Hoeffding Tree 37 Control Split Result Source (n) Model

    (n) Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping
  44. Accuracy 38 No. Leaf Nodes VHT2 – tree-100 30 Very

    close and very high accuracy
  45. Performance 39 35 0 50 100 150 200 250 MHT

    VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  46. ML Developer API 40 Processing Item Processor Stream

  47. ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor();

    builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); 41
  48. Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings

    API. Algorithm developer depends only on this To S4 cluster To Storm cluster 42
  49. Conclusions Streaming is the future and is happening now SAMOA:

    A Platform for Mining Big Data Streams Runs on existing DSPEs (Storm, Samza, S4) Algorithms for classification, regression, clustering Available and open-source http://samoa-project.net A platform for collaboration and research on
 distributed stream mining 43
  50. Open Challenges Distributed stream mining algorithms Active & semi-supervised learning

    + crowdsourcing Millions of classes (e.g., Wikipedia pages) Multi-target learning System issues (load balancing, communication) Programming paradigms and abstractions 44
  51. Thanks! 45 ! gdfm@yahoo-inc.com https://github.com/yahoo/samoa @samoa_project @gdfm7