SAMOA: A Platform for Mining Big Data Streams

Slide 1

Slide 1 text

SAMOA A Platform for Mining Big Data Streams Gianmarco De Francisci Morales Yahoo! Research Barcelona [email protected] RGB color version - for online/web use 3D Y-Bang Logo 1

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Web Mining Yahoo! Research Barcelona 3 RGB color version - for online/web use 3D Y-Bang Logo

Slide 4

Slide 4 text

Taxonomy Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA 4 RGB color version - for online/web use 3D Y-Bang Logo

Slide 5

Slide 5 text

Agenda Stream processing engine retrospective MapReduce for stream processing SAMOA Motivation Challenges 5 RGB color version - for online/web use 3D Y-Bang Logo

Slide 6

Slide 6 text

Streaming Sequence is potentially inﬁnite High amount of data, high speed of arrival Change over time (concept drift) Approximation algorithms (small error with high probability) Single pass, one data item at a time Sublinear space and time per data item 6 RGB color version - for online/web use 3D Y-Bang Logo

Slide 7

Slide 7 text

Big Data Volume + Velocity (+ Variety) Too large for single commodity server main memory Too fast for single commodity server CPU A solution should be: Distributed Scalable 7 RGB color version - for online/web use 3D Y-Bang Logo

Slide 8

Slide 8 text

In the beginning… …it was the Database 8 RGB color version - for online/web use 3D Y-Bang Logo

Slide 9

Slide 9 text

A tale of two tribes DB DB DB DB DB DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 9 RGB color version - for online/web use 3D Y-Bang Logo

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Evolution of SPEs 10 —2003 —2004 —2005 —2006 —2008 —2010 —2011 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net RGB color version - for online/web use 3D Y-Bang Logo

Slide 14

Slide 14 text

S4 & Storm Top-k word count example 11 RGB color version - for online/web use 3D Y-Bang Logo

Slide 15

Slide 15 text

S4 Overview Processing Element Zookeeper Event Streams Coordination Events Input Events Output Business logic goes here Processing Node 1 PE PE PE PE Processing Node 2 PE PE PE PE Processing Node 3 PE PE PE PE 12 RGB color version - for online/web use 3D Y-Bang Logo L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,” in ICDMW ’10: 10th International Conference on Data Mining Workshops, 2010, pp. 170–177.

Slide 16

Slide 16 text

S4 Example status.text:"Introducing #S4: a distributed #stream processing system" PE1 PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 13 RGB color version - for online/web use 3D Y-Bang Logo

Slide 17

Slide 17 text

Storm Overview 14 RGB color version - for online/web use 3D Y-Bang Logo

Slide 18

Slide 18 text

Storm Example http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/ 15 RGB color version - for online/web use 3D Y-Bang Logo

Slide 19

Slide 19 text

Actors Model (Active DHTs) Live Streams Stream 1 Stream 2 Stream 3 PE PE PE PE PE External Persister Output 1 Output 2 Event routing 16 RGB color version - for online/web use 3D Y-Bang Logo

Slide 20

Slide 20 text

MapReduce DFS Input 1 Input 2 Input 3 MAP MAP MAP REDUCE REDUCE DFS Output 1 Output 2 Shufﬂe Merge & Group Partition & Sort 17 RGB color version - for online/web use 3D Y-Bang Logo

Slide 21

Slide 21 text

Shoehorning “Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!” [J. Lin, in Big Data, 1(1):28–37, 2013] Can we reuse the MapReduce programming model for stream mining? A review of online, streaming, and incremental computation on MapReduce 18 RGB color version - for online/web use 3D Y-Bang Logo

Slide 22

Slide 22 text

HOP Pipelining within and across jobs (however map from job2 cannot start until reduce from job1 has ﬁnished) Online aggregation for interactive queries Compute reduce function on data received so far at predetermined milestones (every 20%) However cannot reuse partial computation across reduce invocations Continuous queries by combining the 2 techniques 19 RGB color version - for online/web use 3D Y-Bang Logo T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, “MapReduce Online,” in NSDI ’10

Slide 23

Slide 23 text

Incoop Task level memoization (save results of function calls, similar to dynamic programming) Need to have MapReduce calls with same input Map - Incremental HDFS for stable input partitioning Reduce - Contraction (combiner) to reuse partial results (small change in input changes all output) Tree-aggregation avoids linear dependencies among different contractions (more reuse) 20 RGB color version - for online/web use 3D Y-Bang Logo P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquin, “Incoop: MapReduce for Incremental Computations,” in SOCC ’11

Slide 24

Slide 24 text

StreamMapReduce Backward-compatible extension of MapReduce API Map = stateless function Reduce Windowed = deﬁned over a window (tumbling, sliding) Stateful = custom deﬁnition of state (time triggered) Actors model in disguise! 21 RGB color version - for online/web use 3D Y-Bang Logo A. Brito, A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, and C. Fetzer, “Scalable and Low-Latency Data Processing with Stream MapReduce,” in CloudCom ’11

Slide 25

Slide 25 text

Muppet MapUpdate = streaming version of MapReduce Map = stateless function, Update = stateful function Slate = external memory for Update, keyed on events, lazily allocated, persisted in a KV storage Workﬂow is a DAG, nodes are MapUpdate functions, edges are event streams Actors model in disguise! 22 RGB color version - for online/web use 3D Y-Bang Logo W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan, “Muppet: MapReduce-Style Processing of Fast Data,” VLDB 5(12):1814–1825, 2012.

Slide 26

Slide 26 text

MapReduce for streams? Can be done, but most approaches reinvent the actors model 3rd gen SPEs are the natural choice Need to rethink algorithms :( Opportunities for new algorithms :) 23 RGB color version - for online/web use 3D Y-Bang Logo

Slide 27

Slide 27 text

SAMOA Scalable Advanced Massive Online Analysis Albert Bifet Gianmarco De Francisci Morales Nicolas Kourtellis Matthieu Morel Arinto Murdopo Antonio Severien 24 RGB color version - for online/web use 3D Y-Bang Logo

Slide 28

Slide 28 text

Motivation Big Data + Streaming What is happening now? Use feedback in real-time Update models faster: from weeks to seconds Adapt to changes, concept drift Resist adversarial interactions 25 RGB color version - for online/web use 3D Y-Bang Logo

Slide 29

Slide 29 text

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Spam detection in comments on Yahoo! News Trends change in time Need to retrain model with new data 26 RGB color version - for online/web use 3D Y-Bang Logo

Slide 30

Slide 30 text

Is SAMOA useful for you? Only if you need to deal with: Big fast data Evolving data (model updates) Concept drift: discriminative features or class distribution change Example: Twitter spam detection. Hashtags and their co-occurrences change dramatically and very fast over time 27 RGB color version - for online/web use 3D Y-Bang Logo

Slide 31

Slide 31 text

Architecture SAMOA S4 Storm … SAMOA Classifier Methods Clustering Methods Frequent Pattern Mining 28 RGB color version - for online/web use 3D Y-Bang Logo

Slide 32

Slide 32 text

Advantages Program once, run everywhere Reuse existing computational infrastructure Model is always up to date No system downtime No complex backup/update procedures No need to choose update frequency 29 RGB color version - for online/web use 3D Y-Bang Logo

Slide 33

Slide 33 text

What about Mahout? Think SAMOA = Mahout for streaming But SAMOA… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 30 RGB color version - for online/web use 3D Y-Bang Logo

Slide 34

Slide 34 text

Taxonomy Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA 31 RGB color version - for online/web use 3D Y-Bang Logo

Slide 35

Slide 35 text

Short-Term Goals Parallel algorithms Hoeffding tree K-means-based clustering Gradient Boosted Decision Trees Platforms: S4 & Storm First release in July 32 RGB color version - for online/web use 3D Y-Bang Logo

Slide 36

Slide 36 text

Long-Term Goals Easy to integrate add-ons with packages (like R) Most common algorithms implemented (like Mahout) Large community in industry & academia (like Hadoop) Become reference platform for big data stream mining (like Weka) Lively open-source project (Apache Incubator) 33 RGB color version - for online/web use 3D Y-Bang Logo

Slide 37

Slide 37 text

Challenges Algorithmic Platform design Implementation 34 RGB color version - for online/web use 3D Y-Bang Logo

Slide 38

Slide 38 text

Algorithmic Case study: Hoeffding tree What kind of parallelism? Task Data Horizontal Vertical 35 RGB color version - for online/web use 3D Y-Bang Logo

Slide 39

Slide 39 text

Task Parallelism 36 RGB color version - for online/web use 3D Y-Bang Logo

Slide 40

Slide 40 text

Horizontal Parallelism Stats Stats Stats Stream Histograms Model Instances Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 37 RGB color version - for online/web use 3D Y-Bang Logo

Slide 41

Slide 41 text

Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits 38 RGB color version - for online/web use 3D Y-Bang Logo

Slide 42

Slide 42 text

Other 6% Split 24% Learn 70% Training CPU Time breakdown, 100 nominal 100 numeric attributes Hoeffding Tree Proﬁling 39 RGB color version - for online/web use 3D Y-Bang Logo

Slide 43

Slide 43 text

Vertical Parallelism High number of attributes (e.g., documents) results in high level of parallelism Parallelism is observed immediately (compared to task parallelism) Localized failure handling and model updates (model is kept in one node) Less memory usage compared to horizontal partitioning (no model replication) 40 RGB color version - for online/web use 3D Y-Bang Logo

Slide 44

Slide 44 text

Platform Design What is the right level of abstraction? Application building Computation Communication 41 RGB color version - for online/web use 3D Y-Bang Logo

Slide 45

Slide 45 text

ML Developer API 42 RGB color version - for online/web use 3D Y-Bang Logo Processing Item Processor Stream

Slide 46

Slide 46 text

ML Developer API ProcessingItem sourceOnePi = builder.createProcessingItem(new SourceProcessor()); Stream streamOne = builder.createStream(sourceOnePi); ProcessingItem sourceTwoPi = builder.createProcessingItem(new SourceProcessor()); Stream streamTwo = builder.createStream(sourceTwoPi); String key = "record_id"; ProcessingItem joinPi = builder.createProcessingItem(new IntermediateProcessor()) .connectInputShuffle(streamOne); .connectInputKey(streamTwo, key); 43 RGB color version - for online/web use 3D Y-Bang Logo

Slide 47

Slide 47 text

Implementation How to hide platform differences? Deployment Runtime How to isolate platform-related code? Build and release architecture 44 RGB color version - for online/web use 3D Y-Bang Logo

Slide 48

Slide 48 text

Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings API. Algorithm developer depends only on this To S4 cluster To Storm cluster 45 RGB color version - for online/web use 3D Y-Bang Logo

Slide 49

Slide 49 text

Runtime 46 RGB color version - for online/web use 3D Y-Bang Logo SAMOA EPI EPI PI PI PI PI S4 PE PE PE PE PE PE Storm Spout Spout Bolt Bolt Bolt Bolt

Slide 50

Slide 50 text

Conclusions SAMOA: A Platform for Mining Big Data Streams Runs on existing distributed stream processing engines Parallel algorithms for machine learning on streams Pluggable architecture, ﬂexible, extensible, open source Available soon! 47 RGB color version - for online/web use 3D Y-Bang Logo

Slide 51

Slide 51 text

Thanks! 48 RGB color version - for online/web use 3D Y-Bang Logo [email protected]

Slide 52

Slide 52 text

References [1] D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: a new model and architecture for data stream management,” VLDB Journal, vol. 12, no. 2, pp. 120–139, Aug. 2003. [2] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: Massive Online Analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–1604, 2010. [3] Gartner, “Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data,” 2011. [Online]. Available: http://www.gartner.com/it/page.jsp?id=1731916. [4] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo, “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08: 34th International Conference on Management of Data, 2008, pp. 1123– 1134. [5] V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu, “DEDUCE: At the Intersection of MapReduce and Stream Processing,” in EDBT ’10: 13th International Conference on Extending Database Technology, 2010, pp. 657– 662. [6] C. Olston, S. Seth, C. Tian, T. ZiCornell, X. Wang, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, and V. Sankarasubramanian, “Nova: Continuous Pig/Hadoop Workﬂows,” in SIGMOD ’11: 37th International Conference on Management of Data, 2011, pp. 1081–1090. [7] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,” in ICDMW ’10: 10th International Conference on Data Mining Workshops, 2010, pp. 170–177. 49

Slide 53

Slide 53 text

[8] D. J. Abadi, Y. Ahmad, M. Balazinska, M. Cherniack, J. Hwang, W. Lindner, A. S. Maskey, E. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, “The Design of the Borealis Stream Processing Engine,” in CIDR ’05: 1st Conference on Innovative Data Systems Research, 2005, pp. 277–289. [9] L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani, “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06: 4th international Workshop on Data Mining Standards, Services and Platforms, 2006, pp. 27–37. [10] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom, “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. [11] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquin, “Incoop: MapReduce for Incremental Computations,” in SOCC ’11: 2nd ACM Symposium on Cloud Computing, 2011, pp. 1–14. [12] A. Brito, A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, and C. Fetzer, “Scalable and Low-Latency Data Processing with Stream MapReduce,” in CloudCom ’11: 3rd International Conference on Cloud Computing Technology and Science, 2011, pp. 48–58. [13] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “HaLoop: efﬁcient iterative data processing on large clusters,” VLDB Endowment, vol. 3, no. 1–2, pp. 285–296, Sep. 2010. [14] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, “MapReduce Online,” in NSDI ’10: 7th Conference on Networked Systems Design and Implementation, 2010, p. 21. [15] J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed Data processing on Large Clusters,” in OSDI ’04: 6th Symposium on Opearting Systems Design and Implementation, 2004, pp. 137–150. 50 References

Slide 54

Slide 54 text

51 References [16] J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina, “On distributing symmetric streaming computations,” ACM Transactions on Algorithms, vol. 6, no. 4, pp. 1–19, 2010. [17] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software,” SIGKDD Explorations, vol. 11, no. 1, p. 10, 2009. [18] J. Lin, “Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!,” Big Data, vol. 1, no. 1, pp. 28–37, Mar. 2013. [19] J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer, T. Condie, and R. Ramakrishnan, “Iterative MapReduce for Large Scale Machine Learning,” Arxiv, Mar. 2013. [20] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized Streams: an Efﬁcient and Fault-Tolerant Model for Stream Processing on Large Clusters,” in HotCloud ’12: 4th Conference on Hot Topics in Cloud Ccomputing, 2012, p. 10. [21] M. Stonebraker, U. Çetintemel, and S. Zdonik, “The 8 requirements of real-time stream processing,” ACM SIGMOD Record, vol. 34, no. 4, pp. 42–47, Dec. 2005. [22] W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan, “Muppet: MapReduce-Style Processing of Fast Data,” VLDB Endowment, vol. 5, no. 12, pp. 1814–1825, Aug. 2012.