SAMOA: A Platform for Mining Big Data Streams

SAMOA A Platform for Mining Big Data Streams   Gianmarco
De Francisci Morales  Yahoo Labs Barcelona  [email protected] 1

Agenda Streams Applications, Model, Tools, Advantages SAMOA Goal, Example, Challenges
2

Streams “Panta rhei” (everything ﬂows) Heraclitus 3

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Spam detection in comments on
  Yahoo! News Trends change in time Need to retrain model with new data 4

Spam on Twitter 5

6 Applications

Personalization 6 Applications

Personalization Spam detection 6 Applications

Personalization Spam detection Recommendation 6 Applications

Big Data Stream Volume + Velocity (+ Variety) Too large
for single commodity server main memory Too fast for single commodity server CPU A solution should be: Distributed Scalable 7

Examples User clicks Search queries News Emails Tumblr posts Flickr
photos Finance stocks Credit card transactions Wikipedia edit logs Facebook statuses Twitter updates Name your own… 8

Stream Batch data is a snapshot of streaming data 9

Data Science Lifecycle Old school’s  data mining From data to
insight From insight to model From model to value And repeat! 10 Gather Clean Model Deploy

Big Data Tools 11

Problems Operational Need to rerun the pipeline and redeploy the
model when new data arrives Paradigmatic New data lies in storage without generating new value until the new model is retrained 12

Present of big data Too big to handle 13

Future of big data Drinking from a ﬁrehose 14

A Tale of Two Tribes DB DB DB DB DB
DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

Evolution of SPEs 16 —2003 —2004 —2005 —2006 —2008 —2010
—2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org

Actors Model Live Streams Stream 1 Stream 2 Stream 3
PE PE PE PE PE External Persister Output 1 Output 2 Event routing 17

S4 Example status.text:"Introducing #S4: a distributed #stream processing system" PE1
PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 18

But we have Hadoop! “Mapreduce is Good Enough? If All
You Have is a Hammer, Throw Away Everything That’s Not a Nail!”  [J. Lin, in Big Data, 1(1):28–37, 2013]  “Data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”  [A. Jacobs, in ACM Queue, 7(6):10,2009] 19

Paradigm Shift 20 Gather Clean Model Deploy + =

Streaming Model Sequence is potentially inﬁnite High amount of data,
high speed of arrival Change over time (concept drift) Approximation algorithms  (small error with high probability) Single pass, one data item at a time Sub-linear space and time per data item 21

SAMOA Scalable Advanced Massive Online Analysis 22

Concept SAMOA is a platform Researchers Framework for developing  
distributed stream mining algorithms Practitioners Library of state-of-the-art   distributed stream mining algorithms 23

Taxonomy 24 Data Mining Distributed Batch Hadoop Mahout Stream Storm,
S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA

What about Mahout? Think SAMOA = Mahout for streaming But
SAMOA… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 25

Architecture 26 SA SAMOA%

Status Parallel algorithms Vertical Hoeffding Tree (classiﬁcation) CluStream (clustering) Adaptive
Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines Storm, S4, Samza, (+ Local) 27 https://github.com/yahoo/samoa

Is SAMOA useful for you? Only if you need to
deal with: Big fast data Evolving data (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 28

Advantages (operational) Program once, run everywhere Reuse existing computational infrastructure
Avoid deploy cycle No system downtime No complex backup/update procedures No need to choose update frequency 29

Advantages (paradigmatic) Model freshness Immediate data value No stream/batch impedance
mismatch 30

Algorithmic Challenges Case study: Vertical Hoeffding Tree What kind of
parallelism? Task Data Horizontal Vertical 31 Instance Attributes Class

Task Parallelism 32

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel
Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 33 Stats Stats Stats Stream Histograms Model Instances Model Updates

Hoeffding Tree Proﬁling 34 Other 6% Split 24% Learn 70%
Training CPU time  100 nominal and 100 numeric attributes

Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits 35

Vertical Parallelism High number of attributes => high level of
parallelism  (e.g., documents) vs. task parallelism Parallelism observed immediately vs. horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 36

Vertical Hoeffding Tree 37 Control Split Result Source (n) Model
(n) Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping

Accuracy 38 No. Leaf Nodes VHT2 – tree-100 30 Very
close and very high accuracy

Performance 39 35 0 50 100 150 200 250 MHT
VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec

ML Developer API 40 Processing Item Processor Stream

ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); 41

Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings
API. Algorithm developer depends only on this To S4 cluster To Storm cluster 42

Conclusions Streaming is the future and is happening now SAMOA:
A Platform for Mining Big Data Streams Runs on existing DSPEs (Storm, Samza, S4) Algorithms for classiﬁcation, regression, clustering Available and open-source http://samoa-project.net A platform for collaboration and research on  distributed stream mining 43

Open Challenges Distributed stream mining algorithms Active & semi-supervised learning
+ crowdsourcing Millions of classes (e.g., Wikipedia pages) Multi-target learning System issues (load balancing, communication) Programming paradigms and abstractions 44

Thanks! 45 ! [email protected] https://github.com/yahoo/samoa @samoa_project @gdfm7

SAMOA: A Platform for Mining Big Data Streams

SAMOA: A Platform for Mining Big Data Streams

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript