Masters Thesis Defense Slides

Incremental Learning of Concept Drift from Imbalanced Data Master’s Thesis
Defense Gregory Ditzler Thesis Committee Robi Polikar, Ph.D. Shreekanth Mandayam, Ph.D. Nancy Tinkham, Ph.D. Dept. of Electrical & Computer Engineering

Contents • Introduction • Approach • Experiments • Conclusions Introduction
Approach Experiments Conclusions

Issues Addressed in this Work • Incremental Learning: learning data
over time without access to old datasets o OCR classifier trained on the English language applied to learning different languages • Identify characters not in the English language: ç, â, ê, î, ô, û, ë, ï, ü, ÿ, æ • Concept Drift: underlying data distribution change with time o Consumer ad relevance, spam detection, weather prediction • Class Imbalance: one or more classes are under-represented in the training data o Credit card fraud detection, cancer detection, financial data • Incremental Learning + Concept Drift + Class Imbalance o Many concept drift scenarios contain class imbalance • weather prediction, credit card fraud detection … Introduction Approach Experiments Conclusions

Definitions Concept Drift: joint probability distribution, (, Ω), changes over
time, i.e. , Ω ≠ + , Ω . • Drift can be caused by changes in ( ), | , or (| ) • Real vs. Virtual (perceived) drift • Drift severity o Slow, fast, abrupt, random,… o We would like an algorithm robust to any change regardless of the severity Introduction Approach Experiments Conclusions Bayes Theorem | posterior = (| ) likelihood ( ) prior () evidence

Definitions • Types of concept drift (1 & 2 are
sources that generate data) o Sudden Drift (Concept Change): Occurs at a point in time when source changes from 1 to 2. o Gradual Drift: Data are sampled from multiple sources within a single time stamp. Generally, as time passes the probability of sampling from 1 decreases as the probability of sampling from 2 increases. o Incremental Drift: Data are sampled from a single source at each time stamp and the sources can slightly different between multiple time stamps. Drift can be observed globally o Reoccurring Drift: Reoccurring concepts appear when several different sources are used to generate data over time (similar to incremental and gradual drift) • Concept drift is the combination of a few different research areas Introduction Approach Experiments Conclusions Learning from Time Series Data (time- dependent) Knowledge transfer / Transfer learning Model Adaptation Concept Drift

Definitions Class Imbalance: one (or more) classes are severely under-represented
in the training data • Minority class is typically of more importance Incremental Learning: learn new knowledge, preserve old knowledge and no access to old data • Desired algorithm should find a balance between prior knowledge (stability) and new knowledge (plasticity) [2] • Ensembles have been shown to provide a good balance between stability and plasticity Introduction Approach Experiments Conclusions 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 -600 -400 -200 0 200 400 600 800 1000 feature 1 feature 2 Benign Malignant

• Traditional Machine Learning Algorithms o Assume data are drawn
from a fixed yet unknown distribution and a balanced dataset is available in its entirety • Concept Drift o Old knowledge can become irrelevant at a future point in time o Learners must dynamically adapt to the environment to remain strong predictors on new and future environments • Class Imbalance o Learners tend to bias themselves towards the majority class • Minority class is typically of great importance o Many concept drift algorithms tend to use error or a figure of merit derived from error to adapt to a nonstationary environment • Incremental Learning o If old data become irrelevant, how will the ensemble adapt to new data (environments)? o Existing approaches do not adhere to the incremental learning assumption • Combined Problem o Individual components have been addressed, but the combination of incremental learning, concept drift and class imbalance has been sparsely researched Introduction Approach Experiments Conclusions Challenges in Machine Learning

Prior Work • Learn++.NSE: incremental learning algorithm for concept drift
[3,4] o Generate a classifier with each new batch of data, compute a pseudo error for each classifier, apply time-adjusted weighting mechanism, and call a weighted majority vote for an ensemble decision • Recent pseudo errors are weighted heavier than old errors o Works very well on a broad range of concept drift problems o Shortcomings of Learn++.NSE • No mechanism to learn a minority class • Uncorrelated Bagging (UCB): bagging inspired approach for learning concept drift from unbalanced data [5] o Accumulate old minority data and train classifiers using all the old minority data with a subset of the newest majority class data. Call a majority vote for the ensemble decision. o Shortcomings of UCB • What happens when there are accumulated data begins to become a “majority” class? • Explicit assumption the minority class does not drift • Violates the one-pass learning requirement of incremental learning [9] Introduction Approach Experiments Conclusions

Prior Work • Selectively Recursive Approaches: select old minority data
that are most “similar” to population of minority data [6-8] o Like UCB, selectively recursive approaches accumulates old minority class data. Accumulated minority instances are placed into a training set by selecting the instances that are the most similar to the newest minority data. Classifiers are trained and combined using a combination rule set forth by the specific approach. • The Mahalanobis distance is selected as the measure to quantify similarity o Shortcomings of SERA • What happens if the mean of the minority data does not change over time? • Mahalanobis distance works well for a Gaussian distribution, but what about non-Gaussian data? • Violates the one-pass learning requirement of incremental learning [9] Introduction Approach Experiments Conclusions

Learn++ Solution • Batch-based incremental learning approaches for learning new
knowledge and preserving old knowledge o Retaining classifiers increases stability without the need to require old data be accumulated • Learn++.CDS[10] {Concept Drift with SMOTE} o Apply SMOTE to Learn++.NSE • Learn++.NSE works well on problems involving concept drift • SMOTE works well at increasing the recall of a minority class • Learn++.NIE[11] {Nonstationary and Imbalanced Environments} o Classifiers are replaced with sub-ensembles • Sub-ensemble is applied to learn a minority class o Voting weights are assigned based on figures of merit besides a class independent error • All Learn++ based approaches use weighted majority voting Introduction Approach Experiments Conclusions

Learn++.CDS Introduction Approach Experiments Conclusions () Call BaseClassifier Compute a
pseudo error, () for = 1,2, … , Determine time- adjusted weights WMV Call SMOTE Evaluate (−1) and form a penalty distribution over ()

Learn++.CDS • Evaluate ensemble when new labeled data are presented
o Determine instances that have not been learned from past experience o Maintain a penalty distribution over the new data • Call SMOTE with the minority data in o SMOTE percentage and number nearest neighbors o SMOTE reduces imbalance can provide more robust predictors on the minority class • SMOTE can also increase other figures of merit like F-measure or AUC • Train a new classifier using and the synthetic data generated with SMOTE Introduction Approach Experiments Conclusions Input: Training data () = { ∈ ; ∈ Ω} where = 1,2, … , () Supervised learning algorithm BaseClassifier Sigmoid parameters: & for = 1,2, … do 1. Compute error of the existing ensemble () = 1 () ⋅ (−1)( ) ≠ () =1 (1) 2. Update and normalize instance weights ()() = 1 () ⋅ () (−1)( ) = 1 (2) ()() = ()() ()() () =1 (3) 3. Call SMOTE on () minority instances to obtian () 4. Call BaseClassifier with () and () to obtain: : → 5. Evaluate existing classifiers on () and obtain pseudo error () = ()() ⋅ ( ) ≠ () =1 (4) if = () > 1 2 , Generate new , end if if < () > 1 2 , Set < () = 1 2 end if () = () 1 − () (5) 6. Compute weighted sum of normalized error where = 1,2, … , () = 1/(1 + exp(−( − − ))) (6) () = () (− ) − =0 (7) () = (− ) (−) − =0 (8) 7. Calculate voting weight () = log 1 () (9) 8. Compute ensemble decision ()() = arg max ∈Ω () ( ) = =1 (10) end for Output: Call WMV to compute ()()

Learn++.CDS • Evaluate ensemble when new labeled data are presented
o Determine instances that have not been learned from past experience o Maintain a penalty distribution over the new data • Call SMOTE with the minority data in o SMOTE percentage and number nearest neighbors o SMOTE reduces imbalance can provide more robust predictors on the minority class • SMOTE can also increase other figures of merit like F-measure or AUC • Train a new classifier using and the synthetic data generated with SMOTE Introduction Approach Experiments Conclusions for = 1,2, … do 1. Compute error of the existing ensemble () = 1 () ⋅ (−1)( ) ≠ () =1 (1) 2. Update and normalize instance weights ()() = 1 () ⋅ () (−1)( ) = 1 (2) ()() = ()() ()() () =1 (3) 3. Call SMOTE on () minority instances to obtian () 4. Call BaseClassifier with () and () to obtain: : →

SMOTE • Synthetic Minority Over-sampling TEchnique [19] o Generate “synthetic”
instances on the line segment connect two neighboring minority class instances o Avoids issues commonly encountered with random under/over-sampling of the majority/minority class • Select one of the -nearest neighbors of a minority class instance o Generate a synthetic instance given by + − where is the nearest neighbor of and is the “gap” parameter o Gap controls where the synthetic instance lies on the segment between two nearest neighbors • Synthetic samples lie within the convex hull of the original minority class sample Introduction Approach Experiments Conclusions Input: Minority data () = { ∈ } where = 1,2, … , Number of minority instances (), SMOTE percentage (), number of nearest neighbors () for = 1,2, … , do 1. Find the nearest (minority class) neighbors of 2. = /100 while ≠ 0 do 1. Select one of the nearest neighbors, call this 2. Select a random number ∈ [0,1] 3. = + ( − ) 4. Append to 5. = − 1 end while end for Output: Return synthetic data

Learn++.CDS • Evaluate all classifiers on the new data and
compute a pseudo error o Apply the penalty distribution () to compute pseudo error • Some instances incur more of a misclassification penalty than others o If a new classifier’s error is greater than ½  generate new classifier o If an old classifier’s error is greater than ½  set to ½ o Normalize pseudo error • Compute age-adjusted weighted sum of a classifiers errors  Apply normalized logistic sigmoid o Recent weighted errors are weighted heavier o Voting weight is proportional to the weighted sum • Final Hypothesis is made with WMV Introduction Approach Experiments Conclusions 5. Evaluate existing classifiers on () and obtain pseudo error () = ()() ⋅ ( ) ≠ () =1 (4) if = () > 1 2 , Generate new , end if if < () > 1 2 , Set < () = 1 2 end if () = () 1 − () (5) 6. Compute weighted sum of normalized error where = 1,2, … , () = 1/(1 + exp(−( − − ))) (6) () = () (−) − =0 (7) () = (−) (−) − =0 (8) 7. Calculate voting weight () = log 1 () (9) 8. Compute ensemble decision ()() = arg max ∈Ω () ( ) = =1 (10) end for Output: Call WMV to compute ()()

Learn++.CDS • Evaluate all classifiers on the new data and
compute a pseudo error o Apply the penalty distribution () to compute pseudo error • Some instances incur more of a misclassification penalty than others o If a new classifier’s error is greater than ½  generate new classifier o If an old classifier’s error is greater than ½  set to ½ o Normalize pseudo error • Compute age-adjusted weighted sum of a classifiers errors  Apply normalized logistic sigmoid o Recent weighted errors are weighted heavier o Voting weight is proportional to the weighted sum • Final Hypothesis is made with WMV Introduction Approach Experiments Conclusions Input: Training data () = { ∈ ; ∈ Ω} where = 1,2, … , () Supervised learning algorithm BaseClassifier Sigmoid parameters: & for = 1,2, … do 1. Compute error of the existing ensemble () = 1 () ⋅ (−1)( ) ≠ () =1 (1) 2. Update and normalize instance weights ()() = 1 () ⋅ () (−1)( ) = 1 (2) ()() = ()() ()() () =1 (3) 3. Call SMOTE on () minority instances to obtian () 4. Call BaseClassifier with () and () to obtain: : → 5. Evaluate existing classifiers on () and obtain pseudo error () = ()() ⋅ ( ) ≠ () =1 (4) if = () > 1 2 , Generate new , end if if < () > 1 2 , Set < () = 1 2 end if () = () 1 − () (5) 6. Compute weighted sum of normalized error where = 1,2, … , () = 1/(1 + exp(−( − − ))) (6) () = () (− ) − =0 (7) () = (− ) (−) − =0 (8) 7. Calculate voting weight () = log 1 () (9) 8. Compute ensemble decision ()() = arg max ∈Ω () ( ) = =1 (10) end for Output: Call WMV to compute ()()

Learn++.NIE • Ensembles have been popular for learning unbalanced data
o Ensemble approaches can increase the recall and several other figures of merit when facing an unbalanced data problem o BEV[12], SMOTEBoost[13], DataBoost-IM[14], and RAMOBoost[15] • Like Learn++.CDS, Learn++.NIE uses many of the fundamental principles to learn in nonstationary environments o Ensemble classifier approach o Time-adjusted weighting mechanism o Classifiers are combined with a weighted majority vote • Unlike Learn++.CDS, Learn++.NIE uses several new components to learn concept drift from unbalanced data o Multiple classifiers are generated at each time stamp o New figures of merit are applied to determine a sub-ensemble voting weight • Strategy: track concept drift using figures of merit other than class independent error to combine sub-ensembles using a time-adjusted weighting scheme Introduction Approach Experiments Conclusions

Learn++.NIE Introduction Approach Experiments Conclusions () 1, 2, , SMV
1 , … , , … , Compute () for = 1,2, … , Determine time- adjusted weights WMV

Learn++.NIE • Ensemble of classifiers are created at each time
step o Train classifiers all minority data + randomly sampled subsets of the newest majority data o Sub-ensemble combination rule is a majority vote • Compute () as a figure of merit for each sub-ensemble on () o Replacement of the pseudo error o () should reflect the performance on all classes • Learn++.NIE follows Learn++.CDS from this point Introduction Approach Experiments Conclusions Input: Training data () = { ∈ ; ∈ Ω} where = 1,2, … , () Supervised learning algorithm BaseClassifier Sigmoid parameters: & Ensemble size: for = 1,2, … do 1. Call = , (), 2. Evaluate all existing sub-ensembles on () to produce instance labels, () where = 1,2, … , . Determine classifier weight measure () using (17), (18), or (19). if = () > 1 2 , Generate new sub-ensemble; end if < () > 1 2 , Set < () = 1 2 end if () = () 1 − () (11) 3. Compute weighted sum of normalized error where = 1,2, … , () = 1/(1 + exp(−( − − ))) (12) () = () (−) − =0 (13) () = (−) (−) − =0 (14) 4. Calculate voting weight () = log 1 () (15) 5. Compute ensemble decision ()() = arg max ∈Ω () ( ) = =1 (16) end for Output: Call WMV to compute ()

Computing () • F-measure {Learn++.NIE (fm)} o Combination of precision
and recall. • Precision: fraction of retrieved documents relevant to the search • Recall: fraction of relevant documents that were successfully retrieved o 1-measure is implied with F-measure = 1 − 2 precision × recall precision + recall = 1 − 1 • Weight Recall Measure {Learn++.NIE (wavg)} o Convex combination of the majority class error, ,, and minority class error, ,. = , + 1 − , o ∈ [0,1] controls the weight given to the majority and minority class • Geometric Mean {Learn++.NIE (gm)} o Classifiers performing poorly on one or more classes will have a low G-mean to reflect this performance = 1 − 1 − , 1 =1 Introduction Approach Experiments Conclusions (17) (18) (19)

Figures of Merit • Raw Classification Accuracy = 1 =
=1 = + + + + • Precision precision = + • Recall recall = + • Geometric Mean = 1 =1 • F-measure 1 = 2 precision × recall precision + recall Introduction Approach Experiments Conclusions

Figures of Merit • Area Under the ROC Curve (AUC)
o ROC curves depict the tradeoff between false positives and true positives o AUC is equivalent to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [16] • AUC=0.5  randomly computing labels Introduction Approach Experiments Conclusions Fig.: A naïve Bayes classifier with a Gaussian kernel was generated on 10,000 random instances drawn from a standardized Gaussian distribution. The class labels are produced by computing the sign(N(0, 1)). The AUC for w1 (left) is 0.50185 and w2 (right) is 0.50295. Fig.: A naïve Bayes classifier with a Gaussian kernel was generated on 10,000 randomly selected instances and tested on 6,000 randomly selected instances. The ROC curve was generated using 200 thresholds. The AUC for w1 (right) is 0.7905 and w3 (left) is 0.9229.

Figures of Merit • Overall Performance Measure (OPM) o OPM
is a convex combination of RCA, -measure, AUC and recall o For the purpose of this study, 1 = 2 = 3 = 4 = 1 4 = 1 × RCA + 2 1 + 3 × AUC + 4 × recall • Ranking Algorithms o The average of RCA, -measure, AUC and recall is computed over the entire experiment o Classifiers are ranked from (1) to (k), where k is the number of classifiers used in the comparison • Fractional based ranks are applied in the scenario of a tie • (1)  best performing • (k)  worst performing Introduction Approach Experiments Conclusions Measure 1 Measure 2 … Algorithm 1 90±1.2 (1) 85±1.2 (1.5) … Algorithm 2 85±1.0 (k) 85±1.0 (1.5) … ⋮ ⋮ ⋮ … Algorithm k 89±1.5 (2) 60±1.5 (k) …

Datasets Used in Experiments Synthetic Datasets • Rotating Spiral •
Rotating Checkerboard • Drifting Gaussian Data • Shifting Hyperplane Real-World Datasets • Australia Electricity Pricing • NOAA Weather Data Introduction Approach Experiments Conclusions [DA] http://hottavainen.deviantart.com/art/Rainy-day-gif-animation-182893258 [TMI] http://en.wikipedia.org/wiki/File:Three_Mile_Island_(color)-2.jpg [DA] [TMI]

Synthetic Datasets Rotating Spiral Dataset • Generated with four spirals
belonging to one of two classes o Data are generated for 300 time stamps with a reoccurring environment beginning at t=150 o Interesting properties: mean of the data are not changing, reoccurring environments • Data are generated such that ≈5% class imbalance is present Introduction Approach Experiments Conclusions

Synthetic Datasets Rotating Checkerboard Dataset • Two class problem with
a reoccurring environment and a constant drift rate o Experiment is carried out over 200 time stamps with the reoccurring environment beginning at t=100 • Data are generated such that ≈5% class imbalance is present Introduction Approach Experiments Conclusions

Synthetic Datasets Drifting Gaussian Dataset • Linear combination of four
Gaussian components o 3 majority + 1 minority o Drift is found in the mean and covariance throughout the duration of the experiment • Data are generated such that ≈3% class imbalance is present Introduction Approach Experiments Conclusions

Synthetic Datasets Shifting Hyperplane • Hyperplane changes location at three
points in time o Three features only two of which are relevant o Class imbalance changes as the plane shift. Thus, change in | and changes. • Dual change • Data are generated such that ≈7-25% class imbalance is present Introduction Approach Experiments Conclusions

Real-World Datasets • Nebraska Weather Dataset o Predict whether it
rained on any given day o ≈50 years of daily recordings o Features: minimum/average/maximum temperature, average/maximum wind speed, visibility, sea level pressure, and dew point o Imbalance: ≈ 30% with a minimum of ≈ 10% • Australia Electricity Pricing Dataset o Predict whether the price in electricity went up or down o Features: day, period, NSW demand, VIC demand and the scheduled transfer between the two states o Imbalance: ≈ 5% (achieve through under sampling) Introduction Approach Experiments Conclusions

Algorithm Comparisons • Proposed Approaches o Learn++.NIE(fm), Learn++.NIE(gm), Learn++.NIE(wavg), and
Learn++.CDS • Streaming Ensemble Algorithm (SEA)[17] • Learn++.NSE[3] • Selectively Recursive Approach[6] • Uncorrelated Bagging[5] • Making the comparisons o Base classifier is a CART decision tree algorithm for all algorithms o All algorithm parameters are selected in the same manner and remain constant, unless the parameter must be adjusted for each dataset, i.e. SMOTE depends on level of imbalance o Specific algorithm parameters have been selected based on conclusions reached in the authors comments Introduction Approach Experiments Conclusions

Key Observations 1. Learn++.NIE (fm) and Learn++.CDS consistently provide ranks
near the top three for OPM on nearly all datasets tested. a) Results are significant compared to Learn++.NSE, SERA, and SEA 2. Learn++.NIE (fm) and Learn++.CDS typically provide a significant increase in recall, AUC, FM, and OPM compared to their predecessors. 3. UCB’s increase in recall comes at the cost of the OA and FM. 4. Learn++.CDS improves the OPM rank over Learn++.NSE on every dataset tested 5. Learn++.NIE (fm) typically provides better results than the (gm) or (wavg). Introduction Approach Experiments Conclusions

Rotating Spiral Dataset Introduction Approach Experiments Conclusions RCA F-measure AUC
Recall OPM Mean Rank Learn++.NSE 97.76±0.11 (1) 86.13±0.76 (1) 91.33±0.49 (6) 76.96±1.17 (6) 88.05±0.63 (6) 4.0 SEA 96.65±0.12 (4) 78.97±0.84 (7) 88.91±0.50 (7) 69.49±1.15 (7) 83.51±0.65 (7) 6.4 Learn++.NIE(fm) 97.30±0.13 (2) 85.87±0.65 (2) 97.34±0.26 (2) 89.87±0.73 (3) 92.60±0.44 (1) 2.0 Learn++.NIE(gm) 96.11±0.16 (6) 80.57±0.70 (5) 93.11±0.38 (4) 87.21±0.80 (4) 89.25±0.51 (4) 4.6 Learn++.NIE(wavg) 96.08±0.16 (7) 80.46±0.70 (6) 93.09±0.39 (5) 87.20±0.80 (5) 89.21±0.51 (5) 5.6 Learn++.CDS 96.81±0.15 (3) 84.15±0.65 (3) 96.15±0.31 (3) 91.77±0.71 (2) 92.22±0.46 (3) 2.8 SERA 92.73±0.32 (8) 62.67±1.66 (8) 80.96±1.10 (8) 66.57±2.17 (8) 75.73±1.45 (8) 8.0 UCB 96.42±0.16 (5) 82.57±0.69 (4) 98.18±0.19 (1) 92.74±0.65 (1) 92.48±0.42 (2) 2.6 0 50 100 150 200 250 300 0.88 0.9 0.92 0.94 0.96 0.98 1 AUC 0 50 100 150 200 250 300 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 Recall time step 0 50 100 150 200 250 300 0.65 0.7 0.75 0.8 0.85 0.9 F-measure time step (c) (d) (b) 0 50 100 150 200 250 300 0.93 0.94 0.95 0.96 0.97 0.98 0.99 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 50 100 150 200 250 300 0.5 0.6 0.7 0.8 0.9 1 F-measure time step 0 50 100 150 200 250 300 0.75 0.8 0.85 0.9 0.95 1 AUC 0 50 100 150 200 250 300 0.5 0.6 0.7 0.8 0.9 1 Recall time step (c) (d) (b) 0 50 100 150 200 250 300 0.88 0.9 0.92 0.94 0.96 0.98 1 RCA UCB SERA Learn++.CDS Learn++.NIE (FM) (a)

Rotating Checkerboard Dataset Introduction Approach Experiments Conclusions RCA F-measure AUC
Recall OPM Mean Rank Learn++.NSE 97.45±0.17 (1) 68.25±2.14 (2) 83.76±1.17 (4) 56.55±2.48 (7) 76.50±1.49 (3) 3.4 SEA 87.41±0.63 (7) 21.93±1.63 (8) 65.75±1.29 (8) 31.87±2.18 (8) 51.74±1.43 (8) 7.8 Learn++.NIE(fm) 95.06±0.47 (3) 61.45±2.51 (3) 92.62±0.85 (1) 74.32±2.20 (3) 80.86±1.51 (2) 2.4 Learn++.NIE(gm) 90.02±0.51 (5) 42.11±1.94 (5) 83.37±1.13 (5) 66.76±2.20 (5) 70.57±1.45 (6) 5.2 Learn++.NIE(wavg) 89.89±0.51 (6) 41.15±1.86 (6) 82.75±1.12 (6) 65.91±2.16 (6) 69.93±1.41 (7) 6.2 Learn++.CDS 97.18±0.21 (2) 72.93±1.82 (1) 90.89±0.96 (3) 74.50±2.19 (2) 83.88±1.30 (1) 1.8 SERA 92.89±0.43 (4) 52.57±2.29 (4) 80.80±1.29 (7) 67.39±2.55 (4) 73.41±1.64 (5) 4.8 UCB 85.78±0.51 (8) 38.26±1.44 (7) 91.89±0.70 (2) 82.33±1.75 (1) 74.57±1.10 (4) 4.4 0 50 100 150 200 0.7 0.75 0.8 0.85 0.9 0.95 1 AUC 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 50 100 150 200 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (b) (c) (d) 0 50 100 150 200 0.8 0.85 0.9 0.95 1 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 50 100 150 200 0.7 0.75 0.8 0.85 0.9 0.95 1 AUC 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 F-measure time step (b) (c) (d) 0 50 100 150 200 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 RCA UCB SERA Learn++.CDS Learn++.NIE (FM) (a)

Drifting Gaussian Dataset RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 97.63±0.18 (1) 66.30±2.62 (4) 83.65±1.43 (7) 58.33±3.15 (7) 76.48±1.85 (7) 5.2 SEA 97.46±0.18 (3) 64.39±2.44 (5) 82.97±1.31 (8) 56.40±2.84 (8) 75.31±1.69 (8) 6.4 Learn++.NIE(fm) 96.11±0.27 (5) 67.30±1.95 (3) 95.80±0.67 (2) 86.74±2.01 (2) 86.45±0.99 (2) 2.8 Learn++.NIE(gm) 95.24±0.27 (6) 63.37±1.86 (7) 92.12±0.89 (4) 86.51±1.90 (3) 84.31±1.23 (4) 4.8 Learn++.NIE(wavg) 95.20±0.28 (8) 62.93±1.91 (8) 91.60±0.94 (5) 85.42±1.97 (4) 83.79±1.28 (5) 6.0 Learn++.CDS 97.50±0.20 (2) 74.21±1.90 (1) 92.19±1.07 (3) 80.85±2.45 (5) 86.19±1.41 (3) 2.8 SERA 97.37±0.22 (4) 70.76±2.28 (2) 85.99±1.46 (6) 73.52±2.96 (6) 81.91±1.73 (6) 4.8 UCB 95.22±0.30 (7) 63.74±1.94 (6) 96.84±0.54 (1) 92.02±1.56 (1) 86.96±1.09 (1) 3.2 Introduction Approach Experiments Conclusions 0 20 40 60 80 100 0.75 0.8 0.85 0.9 0.95 1 AUC 0 20 40 60 80 100 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 20 40 60 80 100 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (c) (d) (b) 0 20 40 60 80 100 0.88 0.9 0.92 0.94 0.96 0.98 1 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 20 40 60 80 100 0.9 0.92 0.94 0.96 0.98 1 Performance UCB SERA Learn++.CDS Learn++.NIE (FM) 0 20 40 60 80 100 0.75 0.8 0.85 0.9 0.95 1 AUC 0 20 40 60 80 100 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 20 40 60 80 100 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (c) (d) (a) (b)

Shifting Hyperplane Dataset RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 94.98±0.26 (1) 71.98±1.57 (2) 83.30±0.90 (6) 62.87±1.96 (7) 78.28±1.17 (5) 4.2 SEA 94.00±0.26 (3) 68.13±1.48 (3) 82.00±0.85 (7) 60.28±1.77 (8) 76.10±1.09 (7) 5.6 Learn++.NIE(fm) 92.38±0.46 (7) 67.27±1.62 (6) 85.93±0.90 (1) 74.83±1.60 (1) 80.10±1.15 (2) 3.4 Learn++.NIE(gm) 93.03±0.31 (5) 67.90±1.36 (5) 84.51±0.81 (4) 72.17±1.61 (3) 79.40±1.02 (3) 4.0 Learn++.NIE(wavg) 93.25±0.30 (4) 67.94±1.39 (4) 84.08±0.83 (5) 70.65±1.65 (4) 78.98±1.07 (4) 4.2 Learn++.CDS 94.75±0.28 (2) 72.24±1.46 (1) 85.16±0.84 (3) 68.80±1.79 (5) 80.24±1.09 (1) 2.4 SERA 92.47±0.44 (6) 63.01±1.84 (7) 80.11±1.08 (8) 64.68±2.17 (6) 75.07±1.38 (8) 7.0 UCB 90.77±0.45 (8) 62.05±1.44 (8) 85.84±0.95 (2) 73.34±1.66 (2) 78.00±1.13 (6) 5.2 Introduction Approach Experiments Conclusions 0 50 100 150 200 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 AUC 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 50 100 150 200 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (b) (c) (d) 0 50 100 150 200 0.75 0.8 0.85 0.9 0.95 1 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 50 100 150 200 0.7 0.75 0.8 0.85 0.9 0.95 AUC 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 Recall time step 0 50 100 150 200 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (c) (d) (b) 0 50 100 150 200 0.75 0.8 0.85 0.9 0.95 1 RCA UCB SERA Learn++.CDS Learn++.NIE (FM) (a)

Electricity Pricing Dataset RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 90.75±0.86 (2) 15.40±3.05 (7) 59.66±2.04 (7) 16.87±3.31 (7) 45.67±2.32 (7) 6.0 SEA 92.15±0.60 (1) 9.37±2.15 (8) 58.48±1.55 (8) 10.53±2.19 (8) 42.63±1.62 (8) 6.6 Learn++.NIE(fm) 82.60±1.80 (6) 20.79±2.55 (3) 72.45±2.15 (1) 38.72±4.93 (3) 53.64±2.86 (3) 3.2 Learn++.NIE(gm) 83.60±1.30 (5) 22.29±2.64 (1) 70.70±2.34 (2) 38.37±4.68 (4) 53.74±2.74 (2) 2.8 Learn++.NIE(wavg) 84.70±1.15 (4) 21.88±2.61 (2) 69.54±2.23 (4) 35.61±4.28 (5) 52.93±2.57 (4) 3.8 Learn++.CDS 88.48±1.12 (3) 18.09±3.05 (6) 60.58±2.27 (6) 22.91±4.07 (6) 47.52±2.63 (6) 5.4 SERA 76.42±1.70 (7) 19.91±2.06 (4) 62.42±2.22 (2) 46.46±4.70 (2) 51.30±2.67 (5) 4.6 UCB 68.23±1.72 (8) 18.68±1.75 (5) 69.74±2.34 (3) 58.87±4.47 (1) 53.88±2.57 (1) 3.6 Introduction Approach Experiments Conclusions 0 10 20 30 40 50 60 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 AUC 0 10 20 30 40 50 60 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall time step 0 10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 F-measure time step (c) (d) (b) 0 10 20 30 40 50 60 0.7 0.75 0.8 0.85 0.9 0.95 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 10 20 30 40 50 60 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 AUC 0 10 20 30 40 50 60 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall time step 0 10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 F-measure time step (b) (c) (d) 0 10 20 30 40 50 60 0.5 0.6 0.7 0.8 0.9 1 RCA UCB SERA Learn++.CDS Learn++.NIE (FM) (a)

Weather Dataset RCA F-measure AUC Recall OPM Mean Rank Learn++.NSE
73.35±0.00 (4) 51.27±0.00 (5) 72.08±0.00 (6) 49.38±0.00 (6) 61.52±0.00 (5) 5.2 SEA 75.81±0.00 (1) 50.43±0.00 (6) 73.37±0.00 (4) 42.86±0.00 (8) 60.62±0.00 (6) 5.0 Learn++.NIE(fm) 70.54±1.08 (7) 59.19±1.31 (3) 77.84±0.79 (1) 72.48±2.19 (1) 70.01±1.34 (2) 2.8 Learn++.NIE(gm) 73.53±0.80 (3) 60.78±1.12 (2) 76.83±0.69 (2) 69.27±1.84 (2) 70.10±1.11 (1) 2.0 Learn++.NIE(wavg) 74.07±0.74 (2) 60.94±1.04 (1) 76.42±0.66 (3) 68.04±1.71 (3) 69.87±1.04 (3) 2.4 Learn++.CDS 73.05±0.93 (5) 52.89±1.74 (4) 72.91±1.03 (5) 53.75±2.69 (4) 63.15±1.60 (4) 4.6 SERA 65.17±1.83 (8) 48.38±2.30 (7) 63.54±1.48 (8) 58.49±4.16 (7) 58.90±2.44 (7) 6.8 UCB 70.82±1.43 (6) 46.40±3.18 (8) 71.07±1.57 (7) 45.54±4.77 (8) 58.46±2.74 (8) 7.2 Introduction Approach Experiments Conclusions 0 50 100 150 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 AUC 0 50 100 150 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 50 100 150 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 F-measure time step (b) (c) (d) 0 50 100 150 0.4 0.5 0.6 0.7 0.8 0.9 RCA Learn++.NIE (FM) Learn++.NIE (GM) Learn++.NIE (WAVG) (a) 0 50 100 150 0.4 0.5 0.6 0.7 0.8 0.9 AUC 0 50 100 150 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall time step 0 50 100 150 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F-measure time step (b) (c) (d) 0 50 100 150 0.4 0.5 0.6 0.7 0.8 0.9 RCA UCB SERA Learn++.CDS Learn++.NIE (FM) (a)

Overall Results Introduction Approach Experiments Conclusions gauss checker spiral hyper
elec noaa mean Learn++.NSE 7 3 5 6 7 5 5.50 SEA 8 8 7 7 8 6 7.33 Learn++.NIE (fm) 2 2 2 1 3 2 2.00 Learn++.NIE (gm) 4 6 3 4 2 1 3.33 Learn++.NIE (wavg) 5 7 4 5 4 3 4.67 Learn++.CDS 3 1 1 3 6 4 3.00 SERA 6 5 8 8 5 7 6.50 UCB 1 4 6 2 1 8 3.67 gauss checker spiral hyper elec noaa mean Learn++.NSE 7 4 6 6 7 6 6.00 SEA 8 8 7 7 8 4 7.00 Learn++.NIE (fm) 2 1 2 1 1 1 1.33 Learn++.NIE (gm) 4 5 4 4 2 2 3.50 Learn++.NIE (wavg) 5 6 5 5 4 3 4.67 Learn++.CDS 3 3 3 3 6 5 3.83 SERA 6 7 8 8 5 8 7.00 UCB 1 2 1 2 3 7 2.67 Table: OPM ranks over all datasets Table: AUC ranks over all datasets gauss checker spiral hyper elec noaa mean Learn++.NSE 4 2 1 2 7 5 3.50 SEA 5 8 7 3 8 6 6.17 Learn++.NIE (fm) 3 3 2 6 3 3 3.33 Learn++.NIE (gm) 7 5 5 5 1 2 4.17 Learn++.NIE (wavg) 8 6 6 4 2 1 4.50 Learn++.CDS 1 1 3 1 6 4 2.67 SERA 2 4 8 7 4 7 5.33 UCB 6 7 4 8 5 8 6.33 Table: FM ranks over all datasets (1) (2) (3) (1) (2) (3) (1) (2) (3)

Comparing Multiple Classifiers • Comparing multiple classifiers on multiple datasets
is not a trivial problem o Confidence intervals will only allow for the comparison of multiple classifiers on a single dataset • The rank based Friedman test can determine if classifiers are performing equally across multiple dataset [18] o Apply ranks to the average of each measure on a dataset o Standard deviation of the measure is not used in the Friedman test 2 = 12 + 1 2 =1 − + 1 2 4 = − 1 2 − 1 − 2 • z-scores can be computed from the ranks in the Friedman test o The -level or critical value must be adjusted based on the multiple comparisons being made o Bonferroni-Dunn procedure adjusts to − 1 [18] , = − () + 1 6 Introduction Approach Experiments Conclusions

Friedman Test Results Hypothesis test comparing NIE(fm) [◊] and CDS
[•] to other algorithms (only significant improvement is marked) • Friedman test rejects the null hypothesis on all figures of merit o Good! But which algorithm(s) are performing better/worse than others? • Learn++.CDS and Learn++.NIE(fm) provide a significant improvement over SERA and UCB o UCB lacks rejection in several measures; however, UCB does not offer significant improvement over Learn++.CDS or Learn++.NIE • Learn++.CDS and Learn++.NIE(fm) offer improvement on several measures compared to concept drift algorithms Introduction Approach Experiments Conclusions L++.NSE SEA SERA UCB RCA • • FM ◊• ◊• AUC ◊• ◊• ◊• Recall ◊• ◊• ◊ OPM ◊• ◊• ◊•

Conclusions • Learn++.NIE(fm) and Learn++.CDS provide significant improvement in several
figures of merit over concept drift algorithms o Boost in recall, AUC and OPM o No surprise that Learn++.NSE and SEA have a strong raw classification accuracy • Learn++.NIE framework improves a few figures of merit compared to SERA and UCB o Learn++.NIE improves the F-measure and Learn++.CDS improves F-measure and RCA over UCB • Existing literature requires access to old data in order to learn concept drift from imbalanced data o Using old data for training can be detrimental to certain performance metrics • UCB: train on all accumulated minority class data • SERA: train on a selected subset of accumulated minority class data, which is the most similar to the newest minority class distribution • Proposed approaches consistently perform well as demonstrated on a variety of problems Introduction Approach Experiments Conclusions

Future Work • Data Stream Mining o Learning massive data
streams with imbalanced classes • The theory of learning harsh environments o Less heuristics more statistics** • Semi-supervised learning in nonstationary environments o How can we best utilize unlabeled data to learn from an unknown source? o What SSL theory can be applied to help use learn in nonstationary environments? Introduction Approach Experiments Conclusions ** Inspired by a recent plenary lecture by Dr. Gavin Brown

Publications Publications in Submission 1. G. Ditzler and R. Polikar.
“Incremental Learning of Concept Drift from Streaming Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering Publications in Press 1. G. Ditzler and R. Polikar. “Semi-Supervised Learning in Nonstationary Environments." IEEE/INNS International Joint Conference on Neural Networks. to appear. 2011. 2. G. Ditzler and R. Polikar. "Hellinger Distance Based Drift Detection Algorithm." in Proceedings of IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments. pp. 41-48. 2011. 3. G. Ditzler, J. Ethridge, R. Polikar, and R. Ramachandran. "Fusion Methods for Boosting Performance of Speaker Identification Systems." in Proceedings of the Asia Pacific Conference on Circuits and Systems. pp. 116-119. 2010. 4. G. Ditzler, R. Polikar, and N. V. Chawla. "An Incremental Learning Algorithm for Non-stationary Environments and Imbalanced Data." in Proceedings of the International Conference on Pattern Recognition. pp. 2997-3000. 2010. 5. J. Ethridge, G. Ditzler, and R. Polikar. "Optimal -SVM Parameter Estimation using Multi-Objective Evolutionary Algorithms." in Proceedings of the IEEE Congress on Evolutionary Computing. pp. 3570-3577. 2010. 6. G. Ditzler, and R. Polikar. "An Incremental Learning Framework for Concept Drift and Class Imbalance." in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks. pp. 736-743. 2010. 7. G. Ditzler, M. Muhlbaier and R. Polikar. "Incremental Learning of New Classes in Unbalanced Data: Learn++.UDNC." in International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science. vol 5997. pp. 33-42. 2010. Introduction Approach Experiments Conclusions

Acknowledgements Special thanks go out to Robi Polikar, Shreekanth Mandayam,
Nancy Tinkham, Loretta Brewer, Ravi Ramachandran, Ryan Elwell, James Ethridge, Mike Russell, George Lecakes, Karl Dyer, Metin Ahiskali, Richard Calvert, my family, Rowan’s ECE faculty, the NSF, the anonymous reviewers of my conference publications, and all the other people I forgot to mention Introduction Approach Experiments Conclusions

References 1. R. Duda, P. Hart, and D. Stork, Pattern
Classification, 2nd ed. John Wiley & Sons, Inc., 2001 2. S. Grossberg, “Nonlinear neural networks: Principles, mechanisms and architectures,” Neural Networks, vol. 1, no. 1, pp. 17-61, 1988. 3. M. Muhlbaier and R. Polikar, “Multiple classifiers based incremental learning algorithm for learning nonstationary environments,” in IEEE International Conference on Machine Learning and Cybernetics, 2007, pp. 3618–3623. 4. R. Elwell, “An ensemble-based computational approach for incremental learning in non-stationary environments related to schema and scaffolding-based human learning,” Master’s thesis, Rowan University, 2010. 5. J. Gao, W. Fan, J. Han, and P. S. Yu, “A general framework for mining concept-drifting data streams with skewed distributions,” in SIAM International Conference on Data Mining, 2007, pp. 203–208. 6. S. Chen and H. He, “SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining,” in International Joint Conference on Neural Networks, 2009, pp. 552–529. 7. S. Chen, H. He, K. Li, and S. Sesai, “MuSERA: Multiple selectively recursive approach towards imbalanced stream data mining,” in International Joint Conference on Neural Networks, 2010, pp. 2857–2864. 8. S. Chen and H. He, “Towards incremental learning of nonstationary imbalanced data streams: a multiple selectively recursive approach,” Evolving Systems, in press, 2011. 9. R. Polikar, L. Udpa, S.S. Udpa and V. Honavar, “Learn++: an incremental learning algorithm for supervised neural networks,” IEEE Transactions on Systems, Man and Cybernetics, vol. 31, no. 4, pp. 497–508, 2001. 10. G. Ditzler, N. V. Chawla, and R. Polikar, “An incremental learning algorithm for nonstationary environments and class imbalance,” in International Conference on Pattern Recognition, 2010, pp. 2997–3000. 11. G. Ditzler and R. Polikar, “An incremental learning framework for concept drift and class imbalance,” in International Joint Conference on Neural Networks, 2010, pp. 736–473. 12. C. Li, “Classifying imbalanced data using a bagging ensemble variation (BEV),” in ACMSE, 2007, pp. 203–208. 13. N. V. Chawla, A. Lazarevic, L. O. Hall and K. W. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” in 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 1–10. 14. H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The Databoost-IM approach,” Sigkdd Explorations, vol. 6, no. 1, pp. 30–39, 2004. 15. S. Chen and H. He, “RAMOBoost: Ranked Minority Oversampling in Boosting,” IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1624-1642, 2010. 16. T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, pp. 861–874, 2006. 17. W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large scale classification,” in Proceedings to the 7th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2001, pp. 377–382. 18. J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1– 30, 2006. 19. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. Introduction Approach Experiments Conclusions

Questions

Masters Thesis Defense Slides

Masters Thesis Defense Slides

More Decks by Gregory Ditzler

Other Decks in Research

Featured

Transcript