Big Data Streams: The Next Frontier

Slide 1

Slide 1 text

Big Data Streams The Next Frontier   Gianmarco De Francisci Morales  Aalto University, Helsinki  [email protected]  @gdfm7

Slide 2

Slide 2 text

2 The Frontier

Slide 3

Slide 3 text

Vision Algorithms & Systems Distributed stream mining platform Development and collaboration framework  for researchers Library of state-of-the-art algorithms  for practitioners 3

Slide 4

Slide 4 text

Full Stack SAMOA  (Scalable Advanced Massive Online Analysis) VHT + EVL  (Vertical Hoeffding Tree)  (Online Evaluation) PKG  (Partial Key Grouping) 4 System Algorithm API

Slide 5

Slide 5 text

“Panta rhei”  (everything ﬂows) -Heraclitus 5

Slide 6

Slide 6 text

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Example: spam detection in comments on Yahoo News Trends change in time Need to retrain model with new data 6

Slide 7

Slide 7 text

Stream Batch data is a snapshot of streaming data 7

Slide 8

Slide 8 text

Present of big data Too big to handle 8

Slide 9

Slide 9 text

Future of big data Drinking from a ﬁrehose 9

Slide 10

Slide 10 text

Evolution of SPEs 10 —2003 —2004 —2005 —2006 —2008 —2010 —2011 —2013 —2014 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.apache.org Flink http://ﬂink.apache.org

Slide 11

Slide 11 text

Actor Model 11 PE PE Input Stream PEI PEI PEI PEI PEI Output Stream Event routing

Slide 12

Slide 12 text

Paradigm Shift 12 Gather Clean Model Deploy + =

Slide 13

Slide 13 text

System Algorithm API Apache SAMOA Scalable Advanced Massive Online Analysis  G. De Francisci Morales, A. Bifet  JMLR 2015 13

Slide 14

Slide 14 text

Taxonomy 14 Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA

Slide 15

Slide 15 text

Architecture 15 SA SAMOA%

Slide 16

Slide 16 text

Status Status 16 https://samoa.incubator.apache.org

Slide 17

Slide 17 text

Parallel algorithms Status Status 16 https://samoa.incubator.apache.org

Slide 18

Slide 18 text

Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Status Status 16 https://samoa.incubator.apache.org

Slide 19

Slide 19 text

Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) Status Status 16 https://samoa.incubator.apache.org

Slide 20

Slide 20 text

Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Status Status 16 https://samoa.incubator.apache.org

Slide 21

Slide 21 text

Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines  Status Status 16 https://samoa.incubator.apache.org

Slide 22

Slide 22 text

Is SAMOA useful for you? Only if you need to deal with: Large fast data Evolving process (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 17

Slide 23

Slide 23 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 18

Slide 24

Slide 24 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 19

Slide 25

Slide 25 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 19

Slide 26

Slide 26 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 19

Slide 27

Slide 27 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 20

Slide 28

Slide 28 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 20

Slide 29

Slide 29 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 20

Slide 30

Slide 30 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 21

Slide 31

Slide 31 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 21

Slide 32

Slide 32 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 21

Slide 33

Slide 33 text

VHT Vertical Hoeffding Tree  A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis  (under submission) 22 System Algorithm API

Slide 34

Slide 34 text

Decision Tree Nodes are tests on attributes Branches are possible outcomes Leafs are class assignments    23 Class Instance Attributes Road Tested? Mileage? Age? No Yes High ✅ ❌ Low Old Recent ✅ ❌ Car deal?

Slide 35

Slide 35 text

Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from preﬁx of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute,  x2 the second most informative one Hoeffding bound: split if 24 G ( x1, x2) > ✏ = r R 2 ln(1 / ) 2 n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

Slide 36

Slide 36 text

Parallel Decision Trees 25

Slide 37

Slide 37 text

Parallel Decision Trees Which kind of parallelism? 25

Slide 38

Slide 38 text

Parallel Decision Trees Which kind of parallelism? Task 25

Slide 39

Slide 39 text

Parallel Decision Trees Which kind of parallelism? Task Data 25 Data Attributes Instances

Slide 40

Slide 40 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal 25 Data Attributes Instances

Slide 41

Slide 41 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal Vertical 25 Data Attributes Instances

Slide 42

Slide 42 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 26 Stats Stats Stats Stream Histograms Model Instances Model Updates 26

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Hoeffding Tree Proﬁling 27 Other 6% Split 24% Learn 70% CPU time for training  100 nominal and 100 numeric attributes