Mining Big Data Streams: Better Algorithms or Faster Systems?

Slide 1

Slide 1 text

Mining Big Data Streams Better Algorithms or Faster Systems?     Gianmarco De Francisci Morales  [email protected]  @gdfm7

Slide 2

Slide 2 text

Vision Algorithms & Systems Distributed stream mining platform Development and collaboration framework  for researchers Library of state-of-the-art algorithms  for practitioners 2

Slide 3

Slide 3 text

Agenda SAMOA  (Scalable Advanced Massive Online Analysis) VHT  (Vertical Hoeffding Tree) PKG  (Partial Key Grouping) 3 System Algorithm API

Slide 4

Slide 4 text

Visiting Scientist   @Aalto DMG Scientist @Yahoo Labs PPMC @ Apache SAMOA Committer @ Apache Pig Contributor for Hadoop,   Giraph, Storm, S4, Grafos.ml 4

Slide 5

Slide 5 text

What do I work on? Systems Distributed Mining News Streaming Grid Admin —2008 —2009 —2010 —2011 —2012 —2013 -—2014 -—2015 • IMT Lucca • M.Eng • Y!R Barcelona • PhD 5 PhD Student Postdoc Scientist

Slide 6

Slide 6 text

“Panta rhei”  (everything ﬂows) -Heraclitus 6

Slide 7

Slide 7 text

Importance$of$O •  As$spam$trends$change retrain$the$model$with Importance Example: spam detection in comments on Yahoo News Trends change in time Need to retrain model with new data 7

Slide 8

Slide 8 text

Stream Batch data is a snapshot of streaming data 8

Slide 9

Slide 9 text

Challenges Operational Need to rerun the pipeline and redeploy the model when new data arrives Paradigmatic New data lies in storage without generating new value until new model is retrained 9 Gather Clean Model Deploy

Slide 10

Slide 10 text

Present of big data Too big to handle 10

Slide 11

Slide 11 text

Future of big data Drinking from a ﬁrehose 11

Slide 12

Slide 12 text

Evolution of SPEs 12 —2003 —2004 —2005 —2006 —2008 —2010 —2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org

Slide 13

Slide 13 text

Actor Model 13 PE PE Input Stream PEI PEI PEI PEI PEI Output Stream Event routing

Slide 14

Slide 14 text

Paradigm Shift 14 Gather Clean Model Deploy + =

Slide 15

Slide 15 text

Apache SAMOA Scalable Advanced Massive Online Analysis  G. De Francisci Morales, A. Bifet  JMLR 2015 15

Slide 16

Slide 16 text

Taxonomy 16 Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA

Slide 17

Slide 17 text

What about Mahout? SAMOA = Mahout for streaming But… More than JBoA (just a bunch of algorithms) Provides a common platform Easy to port to new computing engines 17

Slide 18

Slide 18 text

Architecture 18 SA SAMOA%

Slide 19

Slide 19 text

Status Status 19 https://samoa.incubator.apache.org

Slide 20

Slide 20 text

Status Status Parallel algorithms 19 https://samoa.incubator.apache.org

Slide 21

Slide 21 text

Status Status Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) 19 https://samoa.incubator.apache.org

Slide 22

Slide 22 text

Status Status Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) 19 https://samoa.incubator.apache.org

Slide 23

Slide 23 text

Status Status Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) 19 https://samoa.incubator.apache.org

Slide 24

Slide 24 text

Status Status Parallel algorithms Classiﬁcation (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines 19 https://samoa.incubator.apache.org

Slide 25

Slide 25 text

Is SAMOA useful for you? Only if you need to deal with: Large fast data Evolving process (model updates) What is happening now? Use feedback in real-time Adapt to changes faster 20

Slide 26

Slide 26 text

Advantages (operational) Avoid deploy cycle No need to choose update frequency No system downtime No complex backup/update procedures Program once, run everywhere Reuse existing computational infrastructure 21

Slide 27

Slide 27 text

Advantages (paradigmatic) Model freshness Immediate data value No stream/batch impedance mismatch 22

Slide 28

Slide 28 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 23

Slide 29

Slide 29 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 24

Slide 30

Slide 30 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 24

Slide 31

Slide 31 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 24

Slide 32

Slide 32 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 25

Slide 33

Slide 33 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 25

Slide 34

Slide 34 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shuﬄe Grouping  (round-robin) All Grouping  (broadcast) 25

Slide 35

Slide 35 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 26

Slide 36

Slide 36 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 26

Slide 37

Slide 37 text

PE PE PEI PEI PEI PEI Groupings Key Grouping   (hashing) Shufﬂe Grouping  (round-robin) All Grouping  (broadcast) 26

Slide 38

Slide 38 text

VHT Vertical Hoeffding Tree  A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis  (under submission) 27

Slide 39

Slide 39 text

Decision Tree Nodes are tests on attributes Branches are possible outcomes Leafs are class assignments    28 Class Instance Attributes Road Tested? Mileage? Age? No Yes High ✅ ❌ Low Old Recent ✅ ❌ Car deal?

Slide 40

Slide 40 text

Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from preﬁx of stream Choose sample size based on statistical principles When to expand a leaf? Let x1 be the most informative attribute,  x2 the second most informative one Hoeffding bound: split if 29 G ( x1, x2) > ✏ = r R 2 ln(1 / ) 2 n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

Slide 41

Slide 41 text

Parallel Decision Trees 30

Slide 42

Slide 42 text

Parallel Decision Trees Which kind of parallelism? 30

Slide 43

Slide 43 text

Parallel Decision Trees Which kind of parallelism? Task 30

Slide 44

Slide 44 text

Parallel Decision Trees Which kind of parallelism? Task Data 30 Data Attributes Instances

Slide 45

Slide 45 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal 30 Data Attributes Instances

Slide 46

Slide 46 text

Parallel Decision Trees Which kind of parallelism? Task Data Horizontal Vertical 30 Data Attributes Instances

Slide 47

Slide 47 text

Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 31 Stats Stats Stats Stream Histograms Model Instances Model Updates 31

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Hoeffding Tree Proﬁling 32 Other 6% Split 24% Learn 70% CPU time for training  100 nominal and 100 numeric attributes