Slide 1

Slide 1 text

SAMOA A Platform for Mining Big Data Streams 
 Gianmarco De Francisci Morales
 Yahoo Labs Barcelona
 [email protected] 1

Slide 2

Slide 2 text

Agenda Streams Applications, Model, Tools, Advantages SAMOA Goal, Example, Challenges 2

Slide 3

Slide 3 text

Streams “Panta rhei” (everything flows) Heraclitus 3

Slide 4

Slide 4 text

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Spam detection in comments on 
 Yahoo! News Trends change in time Need to retrain model with new data 4

Slide 5

Slide 5 text

Spam on Twitter 5

Slide 6

Slide 6 text

6 Applications

Slide 7

Slide 7 text

Personalization 6 Applications

Slide 8

Slide 8 text

Personalization Spam detection 6 Applications

Slide 9

Slide 9 text

Personalization Spam detection Recommendation 6 Applications

Slide 10

Slide 10 text

Big Data Stream Volume + Velocity (+ Variety) Too large for single commodity server main memory Too fast for single commodity server CPU A solution should be: Distributed Scalable 7

Slide 11

Slide 11 text

Examples User clicks Search queries News Emails Tumblr posts Flickr photos Finance stocks Credit card transactions Wikipedia edit logs Facebook statuses Twitter updates Name your own… 8

Slide 12

Slide 12 text

Stream Batch data is a snapshot of streaming data 9

Slide 13

Slide 13 text

Data Science Lifecycle Old school’s
 data mining From data to insight From insight to model From model to value And repeat! 10 Gather Clean Model Deploy

Slide 14

Slide 14 text

Big Data Tools 11

Slide 15

Slide 15 text

Problems Operational Need to rerun the pipeline and redeploy the model when new data arrives Paradigmatic New data lies in storage without generating new value until the new model is retrained 12

Slide 16

Slide 16 text

Present of big data Too big to handle 13

Slide 17

Slide 17 text

Future of big data Drinking from a firehose 14

Slide 18

Slide 18 text

A Tale of Two Tribes DB DB DB DB DB DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

Slide 19

Slide 19 text

A Tale of Two Tribes DB DB DB DB DB DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

Slide 20

Slide 20 text

A Tale of Two Tribes DB DB DB DB DB DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

Slide 21

Slide 21 text

A Tale of Two Tribes DB DB DB DB DB DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15

Slide 22

Slide 22 text

Evolution of SPEs 16 —2003 —2004 —2005 —2006 —2008 —2010 —2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org

Slide 23

Slide 23 text

Actors Model Live Streams Stream 1 Stream 2 Stream 3 PE PE PE PE PE External Persister Output 1 Output 2 Event routing 17

Slide 24

Slide 24 text

S4 Example status.text:"Introducing #S4: a distributed #stream processing system" PE1 PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 18

Slide 25

Slide 25 text

But we have Hadoop! “Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!”
 [J. Lin, in Big Data, 1(1):28–37, 2013]
 “Data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”
 [A. Jacobs, in ACM Queue, 7(6):10,2009] 19

Slide 26

Slide 26 text

Paradigm Shift 20 Gather Clean Model Deploy + =

Slide 27

Slide 27 text

Streaming Model Sequence is potentially infinite High amount of data, high speed of arrival Change over time (concept drift) Approximation algorithms
 (small error with high probability) Single pass, one data item at a time Sub-linear space and time per data item 21

Slide 28

Slide 28 text

SAMOA Scalable Advanced Massive Online Analysis 22

Slide 29

Slide 29 text

Concept SAMOA is a platform Researchers Framework for developing 
 distributed stream mining algorithms Practitioners Library of state-of-the-art 
 distributed stream mining algorithms 23

Slide 30

Slide 30 text

Taxonomy 24 Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA

Slide 31

Slide 31 text

What about Mahout? Think SAMOA = Mahout for streaming But SAMOA… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 25

Slide 32

Slide 32 text

Architecture 26 SA SAMOA%

Slide 33

Slide 33 text

Status Parallel algorithms Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines Storm, S4, Samza, (+ Local) 27 https://github.com/yahoo/samoa

Slide 34

Slide 34 text

Is SAMOA useful for you? Only if you need to deal with: Big fast data Evolving data (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 28

Slide 35

Slide 35 text

Advantages (operational) Program once, run everywhere Reuse existing computational infrastructure Avoid deploy cycle No system downtime No complex backup/update procedures No need to choose update frequency 29

Slide 36

Slide 36 text

Advantages (paradigmatic) Model freshness Immediate data value No stream/batch impedance mismatch 30

Slide 37

Slide 37 text

Algorithmic Challenges Case study: Vertical Hoeffding Tree What kind of parallelism? Task Data Horizontal Vertical 31 Instance Attributes Class

Slide 38

Slide 38 text

Task Parallelism 32

Slide 39

Slide 39 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 33 Stats Stats Stats Stream Histograms Model Instances Model Updates

Slide 40

Slide 40 text

Hoeffding Tree Profiling 34 Other 6% Split 24% Learn 70% Training CPU time
 100 nominal and 100 numeric attributes

Slide 41

Slide 41 text

Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits 35

Slide 42

Slide 42 text

Vertical Parallelism High number of attributes => high level of parallelism
 (e.g., documents) vs. task parallelism Parallelism observed immediately vs. horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 36

Slide 43

Slide 43 text

Vertical Hoeffding Tree 37 Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) Instance Stream Shuffle Grouping Key Grouping All Grouping

Slide 44

Slide 44 text

Accuracy 38 No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy

Slide 45

Slide 45 text

Performance 39 35 0 50 100 150 200 250 MHT VHT2-par-3 Execution Time (seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec

Slide 46

Slide 46 text

ML Developer API 40 Processing Item Processor Stream

Slide 47

Slide 47 text

ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); 41

Slide 48

Slide 48 text

Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings API. Algorithm developer depends only on this To S4 cluster To Storm cluster 42

Slide 49

Slide 49 text

Conclusions Streaming is the future and is happening now SAMOA: A Platform for Mining Big Data Streams Runs on existing DSPEs (Storm, Samza, S4) Algorithms for classification, regression, clustering Available and open-source http://samoa-project.net A platform for collaboration and research on
 distributed stream mining 43

Slide 50

Slide 50 text

Open Challenges Distributed stream mining algorithms Active & semi-supervised learning + crowdsourcing Millions of classes (e.g., Wikipedia pages) Multi-target learning System issues (load balancing, communication) Programming paradigms and abstractions 44

Slide 51

Slide 51 text

Thanks! 45 ! [email protected] https://github.com/yahoo/samoa @samoa_project @gdfm7