Gregory Ditzler
July 12, 2013
79

# Masters Thesis Defense Slides

I defended my MSc thesis on April 20, 2011.

July 12, 2013

## Transcript

1. Incremental Learning of Concept
Drift from Imbalanced Data
Master’s Thesis Defense
Gregory Ditzler
Thesis Committee
Robi Polikar, Ph.D.
Shreekanth Mandayam, Ph.D.
Nancy Tinkham, Ph.D.
Dept. of Electrical & Computer Engineering

2. Contents
• Introduction
• Approach
• Experiments
• Conclusions
Introduction Approach Experiments Conclusions

3. Contents
• Introduction
• Approach
• Experiments
• Conclusions
Introduction Approach Experiments Conclusions

4. Issues Addressed in this Work
datasets
o OCR classifier trained on the English language applied to learning different languages
• Identify characters not in the English language: ç, â, ê, î, ô, û, ë, ï, ü, ÿ, æ
• Concept Drift: underlying data distribution change with time
o Consumer ad relevance, spam detection, weather prediction
• Class Imbalance: one or more classes are under-represented in the
training data
o Credit card fraud detection, cancer detection, financial data
• Incremental Learning + Concept Drift + Class Imbalance
o Many concept drift scenarios contain class imbalance
• weather prediction, credit card fraud detection …
Introduction Approach Experiments Conclusions

5. Definitions
Concept Drift: joint probability distribution, (, Ω), changes over
time, i.e.
, Ω ≠ +
, Ω .
• Drift can be caused by changes in (
),
| , or (|
)
• Real vs. Virtual (perceived) drift
• Drift severity
o Slow, fast, abrupt, random,…
o We would like an algorithm robust to any change regardless of the severity
Introduction Approach Experiments Conclusions
Bayes Theorem

|
posterior
=
(|
)
likelihood
(
)
prior
()
evidence

6. Definitions
• Types of concept drift (1
& 2
are sources that generate data)
o Sudden Drift (Concept Change): Occurs at a point in time when source changes from 1
to 2.
o Gradual Drift: Data are sampled from multiple sources within a single time stamp.
Generally, as time passes the probability of sampling from 1 decreases as the probability
of sampling from 2 increases.
o Incremental Drift: Data are sampled from a single source at each time stamp and the
sources can slightly different between multiple time stamps. Drift can be observed
globally
o Reoccurring Drift: Reoccurring concepts appear when several different sources are used
to generate data over time (similar to incremental and gradual drift)
• Concept drift is the combination of a few different research areas
Introduction Approach Experiments Conclusions
Learning from Time
Series Data (time-
dependent)
Knowledge
transfer /
Transfer learning
Concept
Drift

7. Definitions
Class Imbalance: one (or more) classes are severely under-represented
in the training data
• Minority class is typically of more importance
Incremental Learning: learn new knowledge, preserve old knowledge
• Desired algorithm should find a balance between prior knowledge
(stability) and new knowledge (plasticity) [2]
• Ensembles have been shown to provide a good balance between
stability and plasticity
Introduction Approach Experiments Conclusions
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-600
-400
-200
0
200
400
600
800
1000
feature 1
feature 2
Benign
Malignant

8. • Traditional Machine Learning Algorithms
o Assume data are drawn from a fixed yet unknown distribution and a balanced dataset is
available in its entirety
• Concept Drift
o Old knowledge can become irrelevant at a future point in time
o Learners must dynamically adapt to the environment to remain strong predictors on new
and future environments
• Class Imbalance
o Learners tend to bias themselves towards the majority class
• Minority class is typically of great importance
o Many concept drift algorithms tend to use error or a figure of merit derived from error to
• Incremental Learning
o If old data become irrelevant, how will the ensemble adapt to new data (environments)?
o Existing approaches do not adhere to the incremental learning assumption
• Combined Problem
o Individual components have been addressed, but the combination of incremental
learning, concept drift and class imbalance has been sparsely researched
Introduction Approach Experiments Conclusions
Challenges in Machine Learning

9. Contents
• Introduction
• Approach
• Experiments
• Conclusions
Introduction Approach Experiments Conclusions

10. Prior Work
• Learn++.NSE: incremental learning algorithm for concept drift [3,4]
o Generate a classifier with each new batch of data, compute a pseudo error for each
classifier, apply time-adjusted weighting mechanism, and call a weighted majority vote
for an ensemble decision
• Recent pseudo errors are weighted heavier than old errors
o Works very well on a broad range of concept drift problems
o Shortcomings of Learn++.NSE
• No mechanism to learn a minority class
• Uncorrelated Bagging (UCB): bagging inspired approach for
learning concept drift from unbalanced data [5]
o Accumulate old minority data and train classifiers using all the old minority data with a
subset of the newest majority class data. Call a majority vote for the ensemble decision.
o Shortcomings of UCB
• What happens when there are accumulated data begins to become a
“majority” class?
• Explicit assumption the minority class does not drift
• Violates the one-pass learning requirement of incremental learning [9]
Introduction Approach Experiments Conclusions

11. Prior Work
• Selectively Recursive Approaches: select old minority data that are
most “similar” to population of minority data [6-8]
o Like UCB, selectively recursive approaches accumulates old minority class data.
Accumulated minority instances are placed into a training set by selecting the instances
that are the most similar to the newest minority data. Classifiers are trained and
combined using a combination rule set forth by the specific approach.
• The Mahalanobis distance is selected as the measure to quantify similarity
o Shortcomings of SERA
• What happens if the mean of the minority data does not change over time?
• Mahalanobis distance works well for a Gaussian distribution, but what about
non-Gaussian data?
• Violates the one-pass learning requirement of incremental learning [9]
Introduction Approach Experiments Conclusions

12. Learn++ Solution
• Batch-based incremental learning approaches for learning new
knowledge and preserving old knowledge
o Retaining classifiers increases stability without the need to require old data be
accumulated
• Learn++.CDS[10] {Concept Drift with SMOTE}
o Apply SMOTE to Learn++.NSE
• Learn++.NSE works well on problems involving concept drift
• SMOTE works well at increasing the recall of a minority class
• Learn++.NIE[11] {Nonstationary and Imbalanced Environments}
o Classifiers are replaced with sub-ensembles
• Sub-ensemble is applied to learn a minority class
o Voting weights are assigned based on figures of merit besides a class independent error
• All Learn++ based approaches use weighted majority voting
Introduction Approach Experiments Conclusions

13. Learn++.CDS
Introduction Approach Experiments Conclusions
()
Call BaseClassifier
Compute a pseudo error,

()
for = 1,2, … ,
Determine time-
WMV
Call SMOTE
Evaluate (−1) and
form a penalty
distribution over ()

14. Learn++.CDS
• Evaluate ensemble when new labeled
data are presented
o Determine instances that have not been learned
from past experience
o Maintain a penalty distribution over the new data
• Call SMOTE with the minority data in

o SMOTE percentage and number nearest
neighbors
o SMOTE reduces imbalance can provide more
robust predictors on the minority class
• SMOTE can also increase other figures
of merit like F-measure or AUC
• Train a new classifier using and
the synthetic data generated with
SMOTE
Introduction Approach Experiments Conclusions
Input: Training data () = {
∈ ;
∈ Ω} where = 1,2, … , ()
Supervised learning algorithm BaseClassifier
Sigmoid parameters: &
for = 1,2, … do
1. Compute error of the existing ensemble
() =
1
()
⋅ (−1)(
) ≠
()
=1
(1)
2. Update and normalize instance weights
()() =
1
()
⋅ () (−1)(
) =
1
(2)
()() = ()() ()()
()
=1
(3)
3. Call SMOTE on () minority instances to obtian ()
4. Call BaseClassifier with () and () to obtain:
: →
5. Evaluate existing classifiers on () and obtain pseudo error

() = ()() ⋅
(
) ≠
()
=1
(4)
if
=
() > 1 2 , Generate new
, end if
if
<
() > 1 2 , Set
<
() = 1 2 end if

() =

() 1 −

() (5)
6. Compute weighted sum of normalized error where
= 1,2, … ,

() = 1/(1 + exp(−( − − ))) (6)

() =

()

(− )

=0
(7)

() =

(− )

(−)

=0
(8)
7. Calculate voting weight

() = log
1

()
(9)
8. Compute ensemble decision
()() = arg max
∈Ω

()
(
) =

=1
(10)
end for
Output: Call WMV to compute ()()

15. Learn++.CDS
• Evaluate ensemble when new labeled
data are presented
o Determine instances that have not been learned
from past experience
o Maintain a penalty distribution over the new data
• Call SMOTE with the minority data in

o SMOTE percentage and number nearest
neighbors
o SMOTE reduces imbalance can provide more
robust predictors on the minority class
• SMOTE can also increase other figures
of merit like F-measure or AUC
• Train a new classifier using and
the synthetic data generated with
SMOTE
Introduction Approach Experiments Conclusions
for = 1,2, … do
1. Compute error of the existing ensemble
() =
1
()
⋅ (−1)(
) ≠
()
=1
(1)
2. Update and normalize instance weights
()() =
1
()
⋅ () (−1)(
) =
1
(2)
()() = ()() ()()
()
=1
(3)
3. Call SMOTE on () minority instances to
obtian ()
4. Call BaseClassifier with () and () to
obtain:
: →

16. SMOTE
• Synthetic Minority Over-sampling
TEchnique [19]
o Generate “synthetic” instances on the line
segment connect two neighboring minority
class instances
o Avoids issues commonly encountered with
random under/over-sampling of the
majority/minority class
• Select one of the -nearest neighbors
of a minority class instance
o Generate a synthetic instance given by

+ − where is the nearest neighbor
of and is the “gap” parameter
o Gap controls where the synthetic instance lies
on the segment between two nearest neighbors
• Synthetic samples lie within the
convex hull of the original minority
class sample
Introduction Approach Experiments Conclusions
Input: Minority data () = {
∈ } where =
1,2, … ,
Number of minority instances (), SMOTE
percentage (), number of nearest neighbors ()
for = 1,2, … , do
1. Find the nearest (minority class) neighbors of

2. = /100
while ≠ 0 do
1. Select one of the nearest neighbors, call this

2. Select a random number ∈ [0,1]
3. =
+ ( −
)
4. Append to
5. = − 1
end while
end for
Output: Return synthetic data

17. Learn++.CDS
• Evaluate all classifiers on the new data
and compute a pseudo error
o Apply the penalty distribution () to compute
pseudo error
• Some instances incur more of a
misclassification penalty than others
o If a new classifier’s error is greater than ½
 generate new classifier
o If an old classifier’s error is greater than ½
 set to ½
o Normalize pseudo error
• Compute age-adjusted weighted sum of a
classifiers errors
 Apply normalized logistic sigmoid
o Recent weighted errors are weighted heavier
o Voting weight is proportional to the weighted
sum
• Final Hypothesis is made with WMV
Introduction Approach Experiments Conclusions
5. Evaluate existing classifiers on () and
obtain pseudo error

() = ()() ⋅
(
) ≠
()
=1
(4)
if
=
() > 1 2 , Generate new
, end if
if
<
() > 1 2 , Set
<
() = 1 2 end if

() =

() 1 −

() (5)
6. Compute weighted sum of normalized error
where = 1,2, … ,

() = 1/(1 + exp(−( − − ))) (6)

() =

()

(−)

=0
(7)

() =

(−)

(−)

=0
(8)
7. Calculate voting weight

() = log
1

()
(9)
8. Compute ensemble decision
()() = arg max
∈Ω

()
(
) =

=1
(10)
end for
Output: Call WMV to compute ()()

18. Learn++.CDS
• Evaluate all classifiers on the new data
and compute a pseudo error
o Apply the penalty distribution () to compute
pseudo error
• Some instances incur more of a
misclassification penalty than others
o If a new classifier’s error is greater than ½
 generate new classifier
o If an old classifier’s error is greater than ½
 set to ½
o Normalize pseudo error
• Compute age-adjusted weighted sum of a
classifiers errors
 Apply normalized logistic sigmoid
o Recent weighted errors are weighted heavier
o Voting weight is proportional to the weighted
sum
• Final Hypothesis is made with WMV
Introduction Approach Experiments Conclusions
Input: Training data () = {
∈ ;
∈ Ω} where = 1,2, … , ()
Supervised learning algorithm BaseClassifier
Sigmoid parameters: &
for = 1,2, … do
1. Compute error of the existing ensemble
() =
1
()
⋅ (−1)(
) ≠
()
=1
(1)
2. Update and normalize instance weights
()() =
1
()
⋅ () (−1)(
) =
1
(2)
()() = ()() ()()
()
=1
(3)
3. Call SMOTE on () minority instances to obtian ()
4. Call BaseClassifier with () and () to obtain:
: →
5. Evaluate existing classifiers on () and obtain pseudo error

() = ()() ⋅
(
) ≠
()
=1
(4)
if
=
() > 1 2 , Generate new
, end if
if
<
() > 1 2 , Set
<
() = 1 2 end if

() =

() 1 −

() (5)
6. Compute weighted sum of normalized error where
= 1,2, … ,

() = 1/(1 + exp(−( − − ))) (6)

() =

()

(− )

=0
(7)

() =

(− )

(−)

=0
(8)
7. Calculate voting weight

() = log
1

()
(9)
8. Compute ensemble decision
()() = arg max
∈Ω

()
(
) =

=1
(10)
end for
Output: Call WMV to compute ()()

19. Learn++.NIE
• Ensembles have been popular for learning unbalanced data
o Ensemble approaches can increase the recall and several other figures of merit when
facing an unbalanced data problem
o BEV[12], SMOTEBoost[13], DataBoost-IM[14], and RAMOBoost[15]
• Like Learn++.CDS, Learn++.NIE uses many of the fundamental
principles to learn in nonstationary environments
o Ensemble classifier approach
o Classifiers are combined with a weighted majority vote
• Unlike Learn++.CDS, Learn++.NIE uses several new components to
learn concept drift from unbalanced data
o Multiple classifiers are generated at each time stamp
o New figures of merit are applied to determine a sub-ensemble voting weight
• Strategy: track concept drift using figures of merit other than class
independent error to combine sub-ensembles using a time-adjusted
weighting scheme
Introduction Approach Experiments Conclusions

20. Learn++.NIE
Introduction Approach Experiments Conclusions
()
1,
2,
,
SMV 1
, … ,
, … ,
Compute

()
for = 1,2, … ,
Determine time-
WMV

21. Learn++.NIE
• Ensemble of classifiers are created at
each time step
o Train classifiers all minority data +
randomly sampled subsets of the newest
majority data
o Sub-ensemble combination rule is a
majority vote
• Compute

()
as a figure of merit for
each sub-ensemble on ()
o Replacement of the pseudo error
o

()
should reflect the performance on all
classes
• Learn++.NIE follows Learn++.CDS
from this point
Introduction Approach Experiments Conclusions
Input: Training data () = {
∈ ;
∈ Ω} where = 1,2, … , ()
Supervised learning algorithm BaseClassifier
Sigmoid parameters: &
Ensemble size:
for = 1,2, … do
1. Call
= , (),
2. Evaluate all existing sub-ensembles on () to produce
instance labels,

()
where = 1,2, … , . Determine classifier
weight measure

()
using (17), (18), or (19).
if
=
() > 1 2 , Generate new sub-ensemble; end if
<
() > 1 2 ,
Set
<
() = 1 2 end if

() =

() 1 −

() (11)
3. Compute weighted sum of normalized error where =
1,2, … ,

() = 1/(1 + exp(−( − − ))) (12)

() =

()

(−)

=0
(13)

() =

(−)

(−)

=0
(14)
4. Calculate voting weight

() = log
1

()
(15)
5. Compute ensemble decision
()() = arg max
∈Ω

()
(
) =

=1
(16)
end for
Output: Call WMV to compute
()

22. Computing

()
• F-measure {Learn++.NIE (fm)}
o Combination of precision and recall.
• Precision: fraction of retrieved documents relevant to the search
• Recall: fraction of relevant documents that were successfully retrieved
o 1-measure is implied with F-measure

= 1 − 2
precision × recall
precision + recall
= 1 − 1
• Weight Recall Measure {Learn++.NIE (wavg)}
o Convex combination of the majority class error, ,, and minority class error, ,.

= ,
+ 1 − ,
o ∈ [0,1] controls the weight given to the majority and minority class
• Geometric Mean {Learn++.NIE (gm)}
o Classifiers performing poorly on one or more classes will have a low G-mean to reflect this
performance

= 1 − 1 − ,
1

=1
Introduction Approach Experiments Conclusions
(17)
(18)
(19)

23. Contents
• Introduction
• Approach
• Experiments
• Conclusions
Introduction Approach Experiments Conclusions

24. Figures of Merit
• Raw Classification Accuracy
=
1

=

=1
=
+
+ + +
• Precision
precision =

+
• Recall
recall =

+
• Geometric Mean
=
1

=1
• F-measure
1
= 2
precision × recall
precision + recall
Introduction Approach Experiments Conclusions

25. Figures of Merit
• Area Under the ROC Curve (AUC)
o ROC curves depict the tradeoff between false positives and true positives
o AUC is equivalent to the probability that a classifier will rank a randomly chosen
positive instance higher than a randomly chosen negative instance [16]
• AUC=0.5  randomly computing labels
Introduction Approach Experiments Conclusions
Fig.: A naïve Bayes classifier with a Gaussian kernel was
generated on 10,000 random instances drawn from a
standardized Gaussian distribution. The class labels are produced
by computing the sign(N(0, 1)). The AUC for w1 (left) is 0.50185
and w2 (right) is 0.50295.
Fig.: A naïve Bayes classifier with a Gaussian kernel was
generated on 10,000 randomly selected instances and tested on
6,000 randomly selected instances. The ROC curve was generated
using 200 thresholds. The AUC for w1 (right) is 0.7905 and w3
(left) is 0.9229.

26. Figures of Merit
• Overall Performance Measure (OPM)
o OPM is a convex combination of RCA, -measure, AUC and recall
o For the purpose of this study, 1
= 2
= 3
= 4
= 1
4
= 1
× RCA + 2
1
+ 3
× AUC + 4
× recall
• Ranking Algorithms
o The average of RCA, -measure, AUC and recall is computed over the entire experiment
o Classifiers are ranked from (1) to (k), where k is the number of classifiers used in the
comparison
• Fractional based ranks are applied in the scenario of a tie
• (1)  best performing
• (k)  worst performing
Introduction Approach Experiments Conclusions
Measure 1 Measure 2 …
Algorithm 1 90±1.2 (1) 85±1.2 (1.5) …
Algorithm 2 85±1.0 (k) 85±1.0 (1.5) …
⋮ ⋮ ⋮ …
Algorithm k 89±1.5 (2) 60±1.5 (k) …

27. Datasets Used in Experiments
Synthetic Datasets
• Rotating Spiral
• Rotating Checkerboard
• Drifting Gaussian Data
• Shifting Hyperplane
Real-World Datasets
• Australia Electricity Pricing
• NOAA Weather Data
Introduction Approach Experiments Conclusions
[DA] http://hottavainen.deviantart.com/art/Rainy-day-gif-animation-182893258
[TMI] http://en.wikipedia.org/wiki/File:Three_Mile_Island_(color)-2.jpg
[DA]
[TMI]

28. Synthetic Datasets
Rotating Spiral Dataset
• Generated with four spirals belonging to one of two classes
o Data are generated for 300 time stamps with a reoccurring environment beginning at
t=150
o Interesting properties: mean of the data are not changing, reoccurring environments
• Data are generated such that ≈5% class imbalance is present
Introduction Approach Experiments Conclusions

29. Synthetic Datasets
Rotating Checkerboard Dataset
• Two class problem with a reoccurring environment and a constant
drift rate
o Experiment is carried out over 200 time stamps with the reoccurring environment
beginning at t=100
• Data are generated such that ≈5% class imbalance is present
Introduction Approach Experiments Conclusions

30. Synthetic Datasets
Drifting Gaussian Dataset
• Linear combination of four Gaussian components
o 3 majority + 1 minority
o Drift is found in the mean and covariance throughout the duration of the experiment
• Data are generated such that ≈3% class imbalance is present
Introduction Approach Experiments Conclusions

31. Synthetic Datasets
Shifting Hyperplane
• Hyperplane changes location at three points in time
o Three features only two of which are relevant
o Class imbalance changes as the plane shift. Thus, change in | and changes.
• Dual change
• Data are generated such that ≈7-25% class imbalance is present
Introduction Approach Experiments Conclusions

32. Real-World Datasets
o Predict whether it rained on any given day
o ≈50 years of daily recordings
o Features: minimum/average/maximum temperature, average/maximum wind speed,
visibility, sea level pressure, and dew point
o Imbalance: ≈ 30% with a minimum of ≈ 10%
• Australia Electricity Pricing Dataset
o Predict whether the price in electricity went up or down
o Features: day, period, NSW demand, VIC demand and the scheduled transfer between
the two states
o Imbalance: ≈ 5% (achieve through under sampling)
Introduction Approach Experiments Conclusions

33. Algorithm Comparisons
• Proposed Approaches
o Learn++.NIE(fm), Learn++.NIE(gm), Learn++.NIE(wavg), and Learn++.CDS
• Streaming Ensemble Algorithm (SEA)[17]
• Learn++.NSE[3]
• Selectively Recursive Approach[6]
• Uncorrelated Bagging[5]
• Making the comparisons
o Base classifier is a CART decision tree algorithm for all algorithms
o All algorithm parameters are selected in the same manner and remain constant, unless
the parameter must be adjusted for each dataset, i.e. SMOTE depends on level of
imbalance
o Specific algorithm parameters have been selected based on conclusions reached in the
Introduction Approach Experiments Conclusions

34. Key Observations
1. Learn++.NIE (fm) and Learn++.CDS consistently provide ranks near
the top three for OPM on nearly all datasets tested.
a) Results are significant compared to Learn++.NSE, SERA, and SEA
2. Learn++.NIE (fm) and Learn++.CDS typically provide a significant
increase in recall, AUC, FM, and OPM compared to their
predecessors.
3. UCB’s increase in recall comes at the cost of the OA and FM.
4. Learn++.CDS improves the OPM rank over Learn++.NSE on every
dataset tested
5. Learn++.NIE (fm) typically provides better results than the (gm) or
(wavg).
Introduction Approach Experiments Conclusions

35. Rotating Spiral Dataset
Introduction Approach Experiments Conclusions
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 97.76±0.11 (1) 86.13±0.76 (1) 91.33±0.49 (6) 76.96±1.17 (6) 88.05±0.63 (6) 4.0
SEA 96.65±0.12 (4) 78.97±0.84 (7) 88.91±0.50 (7) 69.49±1.15 (7) 83.51±0.65 (7) 6.4
Learn++.NIE(fm) 97.30±0.13 (2) 85.87±0.65 (2) 97.34±0.26 (2) 89.87±0.73 (3) 92.60±0.44 (1) 2.0
Learn++.NIE(gm) 96.11±0.16 (6) 80.57±0.70 (5) 93.11±0.38 (4) 87.21±0.80 (4) 89.25±0.51 (4) 4.6
Learn++.NIE(wavg) 96.08±0.16 (7) 80.46±0.70 (6) 93.09±0.39 (5) 87.20±0.80 (5) 89.21±0.51 (5) 5.6
Learn++.CDS 96.81±0.15 (3) 84.15±0.65 (3) 96.15±0.31 (3) 91.77±0.71 (2) 92.22±0.46 (3) 2.8
SERA 92.73±0.32 (8) 62.67±1.66 (8) 80.96±1.10 (8) 66.57±2.17 (8) 75.73±1.45 (8) 8.0
UCB 96.42±0.16 (5) 82.57±0.69 (4) 98.18±0.19 (1) 92.74±0.65 (1) 92.48±0.42 (2) 2.6
0 50 100 150 200 250 300
0.88
0.9
0.92
0.94
0.96
0.98
1
AUC
0 50 100 150 200 250 300
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
Recall
time step
0 50 100 150 200 250 300
0.65
0.7
0.75
0.8
0.85
0.9
F-measure
time step
(c) (d)
(b)
0 50 100 150 200 250 300
0.93
0.94
0.95
0.96
0.97
0.98
0.99
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 50 100 150 200 250 300
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
0 50 100 150 200 250 300
0.75
0.8
0.85
0.9
0.95
1
AUC
0 50 100 150 200 250 300
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
(c) (d)
(b)
0 50 100 150 200 250 300
0.88
0.9
0.92
0.94
0.96
0.98
1
RCA
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
(a)

36. Rotating Checkerboard Dataset
Introduction Approach Experiments Conclusions
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 97.45±0.17 (1) 68.25±2.14 (2) 83.76±1.17 (4) 56.55±2.48 (7) 76.50±1.49 (3) 3.4
SEA 87.41±0.63 (7) 21.93±1.63 (8) 65.75±1.29 (8) 31.87±2.18 (8) 51.74±1.43 (8) 7.8
Learn++.NIE(fm) 95.06±0.47 (3) 61.45±2.51 (3) 92.62±0.85 (1) 74.32±2.20 (3) 80.86±1.51 (2) 2.4
Learn++.NIE(gm) 90.02±0.51 (5) 42.11±1.94 (5) 83.37±1.13 (5) 66.76±2.20 (5) 70.57±1.45 (6) 5.2
Learn++.NIE(wavg) 89.89±0.51 (6) 41.15±1.86 (6) 82.75±1.12 (6) 65.91±2.16 (6) 69.93±1.41 (7) 6.2
Learn++.CDS 97.18±0.21 (2) 72.93±1.82 (1) 90.89±0.96 (3) 74.50±2.19 (2) 83.88±1.30 (1) 1.8
SERA 92.89±0.43 (4) 52.57±2.29 (4) 80.80±1.29 (7) 67.39±2.55 (4) 73.41±1.64 (5) 4.8
UCB 85.78±0.51 (8) 38.26±1.44 (7) 91.89±0.70 (2) 82.33±1.75 (1) 74.57±1.10 (4) 4.4
0 50 100 150 200
0.7
0.75
0.8
0.85
0.9
0.95
1
AUC
0 50 100 150 200
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(b)
(c) (d)
0 50 100 150 200
0.8
0.85
0.9
0.95
1
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 50 100 150 200
0.7
0.75
0.8
0.85
0.9
0.95
1
AUC
0 50 100 150 200
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 50 100 150 200
0
0.2
0.4
0.6
0.8
1
F-measure
time step
(b)
(c) (d)
0 50 100 150 200
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
RCA
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
(a)

37. Drifting Gaussian Dataset
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 97.63±0.18 (1) 66.30±2.62 (4) 83.65±1.43 (7) 58.33±3.15 (7) 76.48±1.85 (7) 5.2
SEA 97.46±0.18 (3) 64.39±2.44 (5) 82.97±1.31 (8) 56.40±2.84 (8) 75.31±1.69 (8) 6.4
Learn++.NIE(fm) 96.11±0.27 (5) 67.30±1.95 (3) 95.80±0.67 (2) 86.74±2.01 (2) 86.45±0.99 (2) 2.8
Learn++.NIE(gm) 95.24±0.27 (6) 63.37±1.86 (7) 92.12±0.89 (4) 86.51±1.90 (3) 84.31±1.23 (4) 4.8
Learn++.NIE(wavg) 95.20±0.28 (8) 62.93±1.91 (8) 91.60±0.94 (5) 85.42±1.97 (4) 83.79±1.28 (5) 6.0
Learn++.CDS 97.50±0.20 (2) 74.21±1.90 (1) 92.19±1.07 (3) 80.85±2.45 (5) 86.19±1.41 (3) 2.8
SERA 97.37±0.22 (4) 70.76±2.28 (2) 85.99±1.46 (6) 73.52±2.96 (6) 81.91±1.73 (6) 4.8
UCB 95.22±0.30 (7) 63.74±1.94 (6) 96.84±0.54 (1) 92.02±1.56 (1) 86.96±1.09 (1) 3.2
Introduction Approach Experiments Conclusions
0 20 40 60 80 100
0.75
0.8
0.85
0.9
0.95
1
AUC
0 20 40 60 80 100
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 20 40 60 80 100
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(c) (d)
(b)
0 20 40 60 80 100
0.88
0.9
0.92
0.94
0.96
0.98
1
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 20 40 60 80 100
0.9
0.92
0.94
0.96
0.98
1
Performance
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
0 20 40 60 80 100
0.75
0.8
0.85
0.9
0.95
1
AUC
0 20 40 60 80 100
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 20 40 60 80 100
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(c) (d)
(a) (b)

38. Shifting Hyperplane Dataset
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 94.98±0.26 (1) 71.98±1.57 (2) 83.30±0.90 (6) 62.87±1.96 (7) 78.28±1.17 (5) 4.2
SEA 94.00±0.26 (3) 68.13±1.48 (3) 82.00±0.85 (7) 60.28±1.77 (8) 76.10±1.09 (7) 5.6
Learn++.NIE(fm) 92.38±0.46 (7) 67.27±1.62 (6) 85.93±0.90 (1) 74.83±1.60 (1) 80.10±1.15 (2) 3.4
Learn++.NIE(gm) 93.03±0.31 (5) 67.90±1.36 (5) 84.51±0.81 (4) 72.17±1.61 (3) 79.40±1.02 (3) 4.0
Learn++.NIE(wavg) 93.25±0.30 (4) 67.94±1.39 (4) 84.08±0.83 (5) 70.65±1.65 (4) 78.98±1.07 (4) 4.2
Learn++.CDS 94.75±0.28 (2) 72.24±1.46 (1) 85.16±0.84 (3) 68.80±1.79 (5) 80.24±1.09 (1) 2.4
SERA 92.47±0.44 (6) 63.01±1.84 (7) 80.11±1.08 (8) 64.68±2.17 (6) 75.07±1.38 (8) 7.0
UCB 90.77±0.45 (8) 62.05±1.44 (8) 85.84±0.95 (2) 73.34±1.66 (2) 78.00±1.13 (6) 5.2
Introduction Approach Experiments Conclusions
0 50 100 150 200
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
AUC
0 50 100 150 200
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(b)
(c) (d)
0 50 100 150 200
0.75
0.8
0.85
0.9
0.95
1
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 50 100 150 200
0.7
0.75
0.8
0.85
0.9
0.95
AUC
0 50 100 150 200
0.4
0.5
0.6
0.7
0.8
0.9
Recall
time step
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(c) (d)
(b)
0 50 100 150 200
0.75
0.8
0.85
0.9
0.95
1
RCA
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
(a)

39. Electricity Pricing Dataset
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 90.75±0.86 (2) 15.40±3.05 (7) 59.66±2.04 (7) 16.87±3.31 (7) 45.67±2.32 (7) 6.0
SEA 92.15±0.60 (1) 9.37±2.15 (8) 58.48±1.55 (8) 10.53±2.19 (8) 42.63±1.62 (8) 6.6
Learn++.NIE(fm) 82.60±1.80 (6) 20.79±2.55 (3) 72.45±2.15 (1) 38.72±4.93 (3) 53.64±2.86 (3) 3.2
Learn++.NIE(gm) 83.60±1.30 (5) 22.29±2.64 (1) 70.70±2.34 (2) 38.37±4.68 (4) 53.74±2.74 (2) 2.8
Learn++.NIE(wavg) 84.70±1.15 (4) 21.88±2.61 (2) 69.54±2.23 (4) 35.61±4.28 (5) 52.93±2.57 (4) 3.8
Learn++.CDS 88.48±1.12 (3) 18.09±3.05 (6) 60.58±2.27 (6) 22.91±4.07 (6) 47.52±2.63 (6) 5.4
SERA 76.42±1.70 (7) 19.91±2.06 (4) 62.42±2.22 (2) 46.46±4.70 (2) 51.30±2.67 (5) 4.6
UCB 68.23±1.72 (8) 18.68±1.75 (5) 69.74±2.34 (3) 58.87±4.47 (1) 53.88±2.57 (1) 3.6
Introduction Approach Experiments Conclusions
0 10 20 30 40 50 60
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
AUC
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Recall
time step
0 10 20 30 40 50 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
F-measure
time step
(c) (d)
(b)
0 10 20 30 40 50 60
0.7
0.75
0.8
0.85
0.9
0.95
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 10 20 30 40 50 60
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
AUC
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Recall
time step
0 10 20 30 40 50 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
F-measure
time step
(b)
(c) (d)
0 10 20 30 40 50 60
0.5
0.6
0.7
0.8
0.9
1
RCA
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
(a)

40. Weather Dataset
RCA F-measure AUC Recall OPM Mean Rank
Learn++.NSE 73.35±0.00 (4) 51.27±0.00 (5) 72.08±0.00 (6) 49.38±0.00 (6) 61.52±0.00 (5) 5.2
SEA 75.81±0.00 (1) 50.43±0.00 (6) 73.37±0.00 (4) 42.86±0.00 (8) 60.62±0.00 (6) 5.0
Learn++.NIE(fm) 70.54±1.08 (7) 59.19±1.31 (3) 77.84±0.79 (1) 72.48±2.19 (1) 70.01±1.34 (2) 2.8
Learn++.NIE(gm) 73.53±0.80 (3) 60.78±1.12 (2) 76.83±0.69 (2) 69.27±1.84 (2) 70.10±1.11 (1) 2.0
Learn++.NIE(wavg) 74.07±0.74 (2) 60.94±1.04 (1) 76.42±0.66 (3) 68.04±1.71 (3) 69.87±1.04 (3) 2.4
Learn++.CDS 73.05±0.93 (5) 52.89±1.74 (4) 72.91±1.03 (5) 53.75±2.69 (4) 63.15±1.60 (4) 4.6
SERA 65.17±1.83 (8) 48.38±2.30 (7) 63.54±1.48 (8) 58.49±4.16 (7) 58.90±2.44 (7) 6.8
UCB 70.82±1.43 (6) 46.40±3.18 (8) 71.07±1.57 (7) 45.54±4.77 (8) 58.46±2.74 (8) 7.2
Introduction Approach Experiments Conclusions
0 50 100 150
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
AUC
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 50 100 150
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
F-measure
time step
(b)
(c) (d)
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
RCA
Learn++.NIE (FM)
Learn++.NIE (GM)
Learn++.NIE (WAVG)
(a)
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
AUC
0 50 100 150
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
time step
0 50 100 150
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F-measure
time step
(b)
(c) (d)
0 50 100 150
0.4
0.5
0.6
0.7
0.8
0.9
RCA
UCB
SERA
Learn++.CDS
Learn++.NIE (FM)
(a)

41. Overall Results
Introduction Approach Experiments Conclusions
gauss checker spiral hyper elec noaa mean
Learn++.NSE 7 3 5 6 7 5 5.50
SEA 8 8 7 7 8 6 7.33
Learn++.NIE (fm) 2 2 2 1 3 2 2.00
Learn++.NIE (gm) 4 6 3 4 2 1 3.33
Learn++.NIE (wavg) 5 7 4 5 4 3 4.67
Learn++.CDS 3 1 1 3 6 4 3.00
SERA 6 5 8 8 5 7 6.50
UCB 1 4 6 2 1 8 3.67
gauss checker spiral hyper elec noaa mean
Learn++.NSE 7 4 6 6 7 6 6.00
SEA 8 8 7 7 8 4 7.00
Learn++.NIE (fm) 2 1 2 1 1 1 1.33
Learn++.NIE (gm) 4 5 4 4 2 2 3.50
Learn++.NIE (wavg) 5 6 5 5 4 3 4.67
Learn++.CDS 3 3 3 3 6 5 3.83
SERA 6 7 8 8 5 8 7.00
UCB 1 2 1 2 3 7 2.67
Table: OPM ranks over all datasets
Table: AUC ranks over all datasets
gauss checker spiral hyper elec noaa mean
Learn++.NSE 4 2 1 2 7 5 3.50
SEA 5 8 7 3 8 6 6.17
Learn++.NIE (fm) 3 3 2 6 3 3 3.33
Learn++.NIE (gm) 7 5 5 5 1 2 4.17
Learn++.NIE (wavg) 8 6 6 4 2 1 4.50
Learn++.CDS 1 1 3 1 6 4 2.67
SERA 2 4 8 7 4 7 5.33
UCB 6 7 4 8 5 8 6.33
Table: FM ranks over all datasets
(1)
(2)
(3)
(1)
(2)
(3)
(1)
(2)
(3)

42. Comparing Multiple Classifiers
• Comparing multiple classifiers on multiple datasets is not a trivial
problem
o Confidence intervals will only allow for the comparison of multiple classifiers on a single
dataset
• The rank based Friedman test can determine if classifiers are
performing equally across multiple dataset [18]
o Apply ranks to the average of each measure on a dataset
o Standard deviation of the measure is not used in the Friedman test

2 =
12
+ 1

2

=1

+ 1 2
4

=
− 1
2
− 1 −
2
• z-scores can be computed from the ranks in the Friedman test
o The -level or critical value must be adjusted based on the multiple comparisons being made
o Bonferroni-Dunn procedure adjusts to − 1 [18]

, =

()
+ 1
6
Introduction Approach Experiments Conclusions

43. Friedman Test Results
Hypothesis test comparing NIE(fm) [◊] and CDS [●] to other algorithms (only significant improvement is marked)
• Friedman test rejects the null hypothesis on all figures of merit
o Good! But which algorithm(s) are performing better/worse than others?
• Learn++.CDS and Learn++.NIE(fm) provide a significant improvement
over SERA and UCB
o UCB lacks rejection in several measures; however, UCB does not offer significant
improvement over Learn++.CDS or Learn++.NIE
• Learn++.CDS and Learn++.NIE(fm) offer improvement on several
measures compared to concept drift algorithms
Introduction Approach Experiments Conclusions
L++.NSE SEA SERA UCB
RCA ● ●
FM ◊● ◊●
AUC ◊● ◊● ◊●
Recall ◊● ◊● ◊
OPM ◊● ◊● ◊●

44. Contents
• Introduction
• Approach
• Experiments
• Conclusions
Introduction Approach Experiments Conclusions

45. Conclusions
• Learn++.NIE(fm) and Learn++.CDS provide significant improvement
in several figures of merit over concept drift algorithms
o Boost in recall, AUC and OPM
o No surprise that Learn++.NSE and SEA have a strong raw classification accuracy
• Learn++.NIE framework improves a few figures of merit compared
to SERA and UCB
o Learn++.NIE improves the F-measure and Learn++.CDS improves F-measure and RCA
over UCB
• Existing literature requires access to old data in order to learn
concept drift from imbalanced data
o Using old data for training can be detrimental to certain performance metrics
• UCB: train on all accumulated minority class data
• SERA: train on a selected subset of accumulated minority class data, which
is the most similar to the newest minority class distribution
• Proposed approaches consistently perform well as demonstrated on
a variety of problems
Introduction Approach Experiments Conclusions

46. Future Work
• Data Stream Mining
o Learning massive data streams with imbalanced classes
• The theory of learning harsh environments
o Less heuristics more statistics**
• Semi-supervised learning in nonstationary environments
o How can we best utilize unlabeled data to learn from an unknown source?
o What SSL theory can be applied to help use learn in nonstationary environments?
Introduction Approach Experiments Conclusions
** Inspired by a recent plenary lecture by Dr. Gavin Brown

47. Publications
Publications in Submission
1. G. Ditzler and R. Polikar. “Incremental Learning of Concept Drift from Streaming Imbalanced Data." IEEE
Transactions on Knowledge and Data Engineering
Publications in Press
1. G. Ditzler and R. Polikar. “Semi-Supervised Learning in Nonstationary Environments." IEEE/INNS
International Joint Conference on Neural Networks. to appear. 2011.
2. G. Ditzler and R. Polikar. "Hellinger Distance Based Drift Detection Algorithm." in Proceedings of IEEE
Symposium on Computational Intelligence in Dynamic and Uncertain Environments. pp. 41-48. 2011.
3. G. Ditzler, J. Ethridge, R. Polikar, and R. Ramachandran. "Fusion Methods for Boosting Performance of Speaker
Identification Systems." in Proceedings of the Asia Pacific Conference on Circuits and Systems. pp. 116-119. 2010.
4. G. Ditzler, R. Polikar, and N. V. Chawla. "An Incremental Learning Algorithm for Non-stationary Environments
and Imbalanced Data." in Proceedings of the International Conference on Pattern Recognition. pp. 2997-3000. 2010.
5. J. Ethridge, G. Ditzler, and R. Polikar. "Optimal -SVM Parameter Estimation using Multi-Objective
Evolutionary Algorithms." in Proceedings of the IEEE Congress on Evolutionary Computing. pp. 3570-3577. 2010.
6. G. Ditzler, and R. Polikar. "An Incremental Learning Framework for Concept Drift and Class Imbalance."
in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks. pp. 736-743. 2010.
7. G. Ditzler, M. Muhlbaier and R. Polikar. "Incremental Learning of New Classes in Unbalanced Data:
Learn++.UDNC." in International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science. vol
5997. pp. 33-42. 2010.
Introduction Approach Experiments Conclusions

48. Acknowledgements
Special thanks go out to Robi Polikar, Shreekanth Mandayam,
Nancy Tinkham, Loretta Brewer, Ravi Ramachandran, Ryan Elwell,
James Ethridge, Mike Russell, George Lecakes, Karl Dyer, Metin
Ahiskali, Richard Calvert, my family, Rowan’s ECE faculty, the
NSF, the anonymous reviewers of my conference publications, and
all the other people I forgot to mention
Introduction Approach Experiments Conclusions

49. References
1. R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. John Wiley & Sons, Inc., 2001
2. S. Grossberg, “Nonlinear neural networks: Principles, mechanisms and architectures,” Neural Networks, vol. 1, no. 1, pp.
17-61, 1988.
3. M. Muhlbaier and R. Polikar, “Multiple classifiers based incremental learning algorithm for learning nonstationary
environments,” in IEEE International Conference on Machine Learning and Cybernetics, 2007, pp. 3618–3623.
4. R. Elwell, “An ensemble-based computational approach for incremental learning in non-stationary environments related
to schema and scaffolding-based human learning,” Master’s thesis, Rowan University, 2010.
5. J. Gao, W. Fan, J. Han, and P. S. Yu, “A general framework for mining concept-drifting data streams with skewed
distributions,” in SIAM International Conference on Data Mining, 2007, pp. 203–208.
6. S. Chen and H. He, “SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining,” in
International Joint Conference on Neural Networks, 2009, pp. 552–529.
7. S. Chen, H. He, K. Li, and S. Sesai, “MuSERA: Multiple selectively recursive approach towards imbalanced stream data
mining,” in International Joint Conference on Neural Networks, 2010, pp. 2857–2864.
8. S. Chen and H. He, “Towards incremental learning of nonstationary imbalanced data streams: a multiple selectively
recursive approach,” Evolving Systems, in press, 2011.
9. R. Polikar, L. Udpa, S.S. Udpa and V. Honavar, “Learn++: an incremental learning algorithm for supervised neural
networks,” IEEE Transactions on Systems, Man and Cybernetics, vol. 31, no. 4, pp. 497–508, 2001.
10. G. Ditzler, N. V. Chawla, and R. Polikar, “An incremental learning algorithm for nonstationary environments and class
imbalance,” in International Conference on Pattern Recognition, 2010, pp. 2997–3000.
11. G. Ditzler and R. Polikar, “An incremental learning framework for concept drift and class imbalance,” in International Joint
Conference on Neural Networks, 2010, pp. 736–473.
12. C. Li, “Classifying imbalanced data using a bagging ensemble variation (BEV),” in ACMSE, 2007, pp. 203–208.
13. N. V. Chawla, A. Lazarevic, L. O. Hall and K. W. Bowyer, “SMOTEBoost: Improving prediction of the minority class in
boosting,” in 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 1–10.
14. H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The Databoost-IM
approach,” Sigkdd Explorations, vol. 6, no. 1, pp. 30–39, 2004.
15. S. Chen and H. He, “RAMOBoost: Ranked Minority Oversampling in Boosting,” IEEE Transactions on Neural Networks,
vol. 21, no. 10, pp. 1624-1642, 2010.
16. T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, pp. 861–874, 2006.
17. W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large scale classification,” in Proceedings to the 7th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2001, pp. 377–382.
18. J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–
30, 2006.
19. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,”
Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
Introduction Approach Experiments Conclusions

50. Questions