Regression - Speaker Deck

Slide 1

Slide 1 text

Regression Albert Bifet May 2012

Slide 2

Slide 2 text

COMP423A/COMP523A Data Stream Mining Outline 1. Introduction 2. Stream Algorithmics 3. Concept drift 4. Evaluation 5. Classiﬁcation 6. Ensemble Methods 7. Regression 8. Clustering 9. Frequent Pattern Mining 10. Distributed Streaming

Slide 3

Slide 3 text

Data Streams Big Data & Real Time

Slide 4

Slide 4 text

Regression Deﬁnition Given a numeric class attribute, a regression algorithm builds a model that predicts for every unlabelled instance I a numeric value with accuracy. y = f(x) Example Stock-Market price prediction Example Airplane delays

Slide 5

Slide 5 text

Evaluation 1. Error estimation: Hold-out or Prequential 2. Evaluation performance measures: MSE or MAE 3. Statistical signiﬁcance validation: Nemenyi test Evaluation Framework

Slide 6

Slide 6 text

2. Performance Measures Regression mean measures Mean square error: MSE = (f(xi) − yi)2/N Root mean square error: RMSE = √ MSE = (f(xi) − yi)2/N Forgetting mechanism for estimating measures Sliding window of size w with the most recent observations

Slide 7

Slide 7 text

2. Performance Measures Regression relative measures Relative Square error: RSE = (f(xi) − yi)2/ (¯ yi − yi)2 Root relative square error: RRSE = √ RSE = (f(xi) − yi)2/ (¯ yi) − yi)2 Forgetting mechanism for estimating measures Sliding window of size w with the most recent observations

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Linear Methods for Regression Linear Least Squares ﬁtting Linear Regression Model f(x) = β0 + p j=1 βj xj = Xβ Minimize residual sum of squares RSS(β) = N i=1 (yi − f(xi))2/N = (y − Xβ) (y − Xβ) Solution: ˆ β = (X X)−1X y

Slide 10

Slide 10 text

Perceptron Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5 Output hw (xi) w1 w2 w3 w4 w5 Data stream: xi, yi Classical perceptron: hw (xi) = wT xi , Minimize Mean-square error: J(w) = 1 2 (yi − hw (xi))2

Slide 11

Slide 11 text

Perceptron Minimize Mean-square error: J(w) = 1 2 (yi − hw (xi))2 Stochastic Gradient Descent: w = w − η∇Jxi Gradient of the error function: ∇J = − i (yi − hw (xi)) Weight update rule w = w + η i (yi − hw (xi))xi

Slide 12

Slide 12 text

Fast Incremental Model Tree with Drift Detection FIMT-DD FIMT-DD differences with HT: 1. Splitting Criterion 2. Numeric attribute handling using BINTREE 3. Linear model at the leaves 4. Concept Drift Handling: Page-Hinckley 5. Alternate Tree adaption strategy

Slide 13

Slide 13 text

Splitting Criterion Standard Deviation Reduction Measure Classiﬁcation Information Gain = Entropy(before Split) − Entropy(after split) Entropy = − c pi · log pi Gini Index = c pi(1 − pi) = 1 − c p2 i Regression Gain = SD(before Split) − SD(after split) StandardDeviation (SD) = (¯ y − yi)2/N

Slide 14

Slide 14 text

Numeric Handling Methods Exhaustive Binary Tree (BINTREE – Gama et al, 2003) Closest implementation of a batch method Incrementally update a binary tree as data is observed Issues: high memory cost, high cost of split search, data order

Slide 15

Slide 15 text

Page Hinckley Test The CUSUM test g0 = 0, gt = max (0, gt−1 + t − υ) if gt > h then alarm and gt = 0 The Page Hinckley Test g0 = 0, gt = gt−1 + ( t − υ) Gt = min(gt ) if gt − Gt > h then alarm and gt = 0

Slide 16

Slide 16 text

Lazy Methods kNN Nearest Neighbours: 1. Mean value of the k nearest neighbours ˆ f(xq) = k i=1 f(xi) k 2. Depends on distance function