and Robi Polikar Rowan University Signal Processing & Pattern Recognition Laboratory Dept. of Electrical & Computer Engineering gregory.ditzler@gmail.com polikar@rowan.edu This material is based upon work supported by the National Science Foundation under Grant No 0926159. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Excellence in Graduate Research, Rowan University 2. 2011 Graduate Research/Teaching Fellowship, Drexel University 3. 2009 Graduate Research Assistantship, Rowan University 4. 2008 Award for Service to the IEEE Student Branch, Pennsylvania College of Technology 5. 2008 Award for Service to the College and Community, Pennsylvania College of Technology Introduction Motivation HDDDM Results Conclusions
that are relevant to a user’s interests o User interests are known to evolve – or drift – with time o Algorithms must identify the change in a user’s interest to be considered useful o Ok, that is a simple example, but where can I find this in practice? • Yandex.Direct selects ads which reflect the user’s current interest • Google Adsense display adds related to the content of your webpage • The above applications are related through contextual advertising • General Examples: electricity demand, financial, climate, epidemiological, and spam (too name a few) Introduction Motivation HDDDM Results Conclusions
Shrewhart and CUSUM are control charts commonly used in drift detection (derived from monitoring processes) [2-4] o CI-CUSUM – pdf free extension to the CUSUM [5;6] • ICI-rule based drift detection [7] o Use the intersection of confidence intervals (ICI) rule to detect a nonstationary change when the distribution is unknown. • Methods like control charts and the ICI-rule may monitor features, but classifier error is another possibility o EDDM – use classifier error and the distances between classifier error to determine when sufficient evidence is found to detect change [8] • Extension of DDM • Cieslak & Chawla [12] suggest using the Hellinger distance to measure bias between a training and testing source o A framework was developed from monitoring a classifier’s failure on a testing dataset o We use this a stepping stone to tackle drift detection in an incremental learning scenario Introduction Motivation HDDDM Results Conclusions
[4] o data chunks: batch based • as opposed to processing an instance at a time o information used: raw features • as opposed to error or features which summarize the data (mean, median, variance, kurtosis,…) o change detection mode: explicit • as opposed to implicit detection which will not flag drift being detected o classifier specific vs. classifier-free: classifier free • Strategy: use recent raw features from sequential batch data to determine if there is enough evidence to suggest if the data instances are sampled from different distributions Introduction Motivation HDDDM Results Conclusions
o & are relative frequency charts with bins, i.e. histograms of the features • Rule of thumb = where is the number of instances in the batch o Computed as the averaged of the Hellinger distance of the individual features [12] • Properties o 0 ≤ , ≤ 2 o , = 2 if and only if and are completely divergent, i.e. no overlap o , = 0 if and are equal , = 1 , , =1 − , , =1 2 =1 =1 (1) Introduction Motivation HDDDM Results Conclusions
Straight forward interpretation of the divergence between two distributions o No assumption about the distribution of the data o We can avoid holding on to old data by keeping a histogram of the data over time , = 1 , , =1 − , , =1 2 =1 =1 (1) Introduction Motivation HDDDM Results Conclusions
two Gaussian components each belonging to a different class where o Centers: 1 = cos , sin and 2 = − 1 o Covariance: Σ1 = Σ2 = 0.5 ∗ o = 2 where where is the number of cycles, is the (integer valued) time stamp that iterates from zero to − 1, and is a 2×2 identity matrix. • Compute Hellinger distance between the 1st batch and each subsequent batch of data Introduction Motivation HDDDM Results Conclusions
600 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 all 1 2 0 100 200 300 400 500 600 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 all 1 2 Fig. Hellinger distance computed on a rotating Gaussian centers problem. The Hellinger distance is computed between datasets 1 and where = 2,3, … , 600 for 1 , 2 , and all classes. Recurring environments occur when the Hellinger distance is at a minimum. Fig. Hellinger distance computed on a static Gaussian problem. The Hellinger distance is computed between 1 and where = 2,3, … , 600 for 1 , 2 , and all classes. • The Hellinger distance changes as the Gaussian components drift away from and towards the original component location • No drift no variation in the Hellinger distance • Natural bias in the sample is responsible for a non-zeros Hellinger distance Introduction Motivation HDDDM Results Conclusions
Relies only on the raw data features to estimate whether drift is present in a supervised incremental learning scenario o Hellinger distance is used to infer whether drift is present between two batches of data using an adaptive threshold o Inspired by Cieslak & Chawla’s study of bias and classifier failure [12] • Assumption o Data distributions have finite support: ≤ = 0 for ≤ 1 and ≥ = 0 for ≥ 2, where T1 < T2 are finite real numbers. • this assumption is primarily required because of our histogram implementation • HDDDM is designed for truly incremental learning scenarios and can be applied to supervised, i.e. change in (|), as well as unsupervised learning scenarios, i.e. change in () Introduction Motivation HDDDM Results Conclusions
1 and = 1 o where indicates the last time change was detected o 1 is established as the first baseline reference • Construct histograms from and from • The HD between the two histograms and is then computed using Eq. (1). • Compute (), the difference in divergence between , and − 1 • Compute the mean, , and standard deviation, , of () o where = , + 1, … , − 1 o current time stamp and all steps before are not included in the mean difference calculation Input: Training data, = () ∈ (); ∈ Ω where = 1,2, … Initialize: = 1 then = 1 for = 2,3, … do • Generate a histogram, , from and a histogram, , from .Each histogram has bins. • Calculate the Hellinger distance between and using Eq. (1). Call this . • Compute the difference in Hellinger distance = − − 1 • Update the adaptive threshold = 1 − − 1 −1 = = − 2 −1 = − − 1 • Compute using the standard deviation or the confidence interval method. • Determine if drift is present if > then = Reset by setting = Indicate change was detected else Update with → = , end if end for Introduction Motivation HDDDM Results Conclusions
adaptive threshold, , at time step • Deviation test = + where is some positive/real constant, indicating how many standard deviations of change around the mean we accept as “different enough”. • Confidence test = + 2, − − 1 where 2, where is the two tailed -statistic with degrees of freedom • We use + and not ± o flag drift when the magnitude of the change is significantly greater than the average of the change in recent time Input: Training data, = () ∈ (); ∈ Ω where = 1,2, … Initialize: = 1 then = 1 for = 2,3, … do • Generate a histogram, , from and a histogram, , from .Each histogram has bins. • Calculate the Hellinger distance between and using Eq. (1). Call this . • Compute the difference in Hellinger distance = − − 1 • Update the adaptive threshold = 1 − − 1 −1 = = − 2 −1 = − − 1 • Compute using the standard deviation or the confidence interval method. • Determine if drift is present if > then = Reset by setting = Indicate change was detected else Update with → = , end if end for Introduction Motivation HDDDM Results Conclusions
> () then we signal that change has been detected o reset = and = • If drift is not detected, then we update, the distribution with the new data • This histogram can be updated or reset using the following equation: , ← , + , , if drift is not detected , ← , , if drift is detected • We need only retain the histogram of rather than the data o HDDDM fulfills the incremental learning requirement Input: Training data, = () ∈ (); ∈ Ω where = 1,2, … Initialize: = 1 then = 1 for = 2,3, … do • Generate a histogram, , from and a histogram, , from .Each histogram has bins. • Calculate the Hellinger distance between and using Eq. (1). Call this . • Compute the difference in Hellinger distance = − − 1 • Update the adaptive threshold = 1 − − 1 −1 = = − 2 −1 = − − 1 • Compute using the standard deviation or the confidence interval method. • Determine if drift is present if > then = Reset by setting = Indicate change was detected else Update with → = , end if end for Introduction Motivation HDDDM Results Conclusions
to evaluate the HDDDM algorithm o Four synthetic and three real-world o Drift is synthetically injected into the database from UCI [18] (*** MAGIC) • Sort by a feature in ascending order and divide the entire data in bins. o Train on current bin; Test on next future bin • Missing at Random (MAR) bias • Experimental Setup o Use histograms in HDDDM to model | and Dataset Instances Source Drift type Checkerboard [15] 800,000 Synthetically Generated Synthetic Elec 27,5494 Web source1 Natural SEA [5%] 400,000 Synthetically Generated Synthetic RandGauss 812,500 Synthetically Generated Synthetic Magic 13,065 UCI2 Synthetic NOAA 18,159 NOAA3 Natural GaussCir 950,000 Synthetically Generated Synthetic 1. http://www.liaad.up.pt/~jgama/ales/ales_5.html 2. UCI Machine Learning Repository (http://www.ics.uci.edu/~mlearn/) 3. National Oceanic and Atmospheric Administration(www.noaa.gov) 4. All missing instances with missing features have been removed Introduction Motivation HDDDM Results Conclusions
database. Call them 1 and 2 . for = 2,3, … do • Begin processing new databases for the presence of concept drift using HDDDM. o 1 is our target classifier which is updated or reset based on HDDDM decision. If drift is detected using HDDDM, reset 1 and train 1 only on the new data, otherwise incrementally update 1 with the new data. o Regardless of whether or not drift is detected, 2 is incrementally trained with the new data. 2 is therefore the control classifier that is not subject to HDDDM intervention. o Compute error of 1 and 2 on new test data end for Introduction Motivation HDDDM Results Conclusions
drift is present in an incremental learning scenario o Hellinger distance is used as a measure to quantify the “overlap” between a baseline and the current distribution o An adaptive threshold in used in conjunction with the difference in Hellinger distance to determine when drift is present o Two formulations of the adaptive threshold have been presented and analyzed • Our HDDDM approach can be used in conjunction with a classifier to decrease the error of the classification system o Control classifier is updated all the time o HDDDM+OLC resets a classifier when HDDDM detects drift, otherwise the classifier is trained on new data o Results indicate that HDDDM+OLC can efficiently detect drift on controlled experiments • Future work: integration of other detection procedure to reinforce the decision of the HDDDM as well as addressing false alarms in only one class Introduction Motivation HDDDM Results Conclusions
drift: definitions and related work," Trinity College, Dublin, Ireland,TCD-CS-2004-15, 2004. 2. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed John Wiley & Sons, Inc., 2001. 3. C. Alippi, G. Boracchi, and M. Roveri, "Just in time classifiers: managing the slow drift case," International Joint Conference on Neural Networks, pp. 114-120, 2009. 4. L. I. Kuncheva, "Using control charts for detecting concept change in streaming data," School of Computer Science, Bangor University,BCS-TR-001-2009, 2009. 5. B. Manly and D. MacKenzie, "A cumulative sum type of method for environmental monitoring," Environmetrics, vol. 11, pp. 151-166, 2000. 6. C. Alippi and M. Roveri, "Just-in-Time Adaptive Classifiers - Part I: Detecting Nonstationary Changes," IEEE Transactions on Neural Networks, vol. 19, no. 7, pp. 1145-1153, 2008. 7. C. Alippi and M. Roveri, "Just-in-time adaptive classifiers - part II: Designing the classifier," IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 2053- 2064, 2008. 8. C. Alippi, G. Boracchi, and M. Roveri, "Change Detection Tests Using the ICI Rule," World Congress of Computational Intelligence, pp. 1190-1196, 2010. 9. M. Baena-Garcia, J. del Campo-vila, R. Fidalgo, and A. Bifet, "Early drift detection method," 4th ECML PKDD International Workshop on Knowledge Discovery from Data Streams, pp. 77-86, 2006. 10. C. Mesterharm, "Tracking Linear-Threshold Concepts with Winnow," Journal of Machine Learning Research, vol. 4, pp. 819-838, 2003. 11. J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with Drift Detection," SBIA Brazilian Symposium on Artificial Intelligence, vol. 3741, pp. 286-295, 2004. 12. D. A. Cieslak and N. V. Chawla, "A Framework for Monitoring a Classifiers' Performance: When and Why Failures Occurs?," Knowledge and Information Systems, vol. 18, no. 1, pp. 83-108, 2009. 13. F. J. Massey, "The Kolmogorov-Smirnov Test for Goodness of Fit," Journal of the American Statistical Association, vol. 46, pp. 68-78, 1951. 14. R. Polikar, L. Upda, S. S. Upda, and V. Honavar, "Learn++: an incremental learning algorithm for supervised neural networks," IEEE Transactions on Systems, Man and Cybernetics, Part C, vol. 31, no. 4, pp. 497-508, 2001. 15. L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms John Wiley & Sons, Inc., 2004. 16. W. N. Street and Y. Kim, "A streaming ensemble algorithm (SEA) for large-scale classification," Seventh ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 377-382, 2001. 17. C. L. Blake and C. J. Merz, "UCI repository of machine learning databases," 1998. 18. F. H. Hamker, "Life-long learning cell structures continuously learning without catastrophic interference," Neural Networks, vol. 14, no. 5, pp. 551-573, 2001. 19. M. Muhlbaier and R. Polikar, "An Ensemble Approach for Incremental Learning in Nonstationary Environments," 7th International Workshop on Multiple Classifier Systemsin Lecture Notes in Computer Science, vol. 4472, Springer, pp. 490-500, 2007. 20. R. Elwell and R. Polikar, "Incremental Learning in Nonstationary Environments with Controlled Forgetting," International Joint Conference on Neural Networks, pp. 771-778, 2009. Introduction Motivation HDDDM Results Conclusions