IJCNN2011

 IJCNN2011

Slides for the IEEE/INNS International Joint Conference on Neural Networks

C382513e7a7ad401c00c4d427942a0f1?s=128

Gregory Ditzler

July 12, 2013
Tweet

Transcript

  1. Semi-supervised Learning in Nonstationary Environments Gregory Ditzler1,2 and Robi Polikar2

    This material is based upon work supported by the National Science Foundation under Grant No 0926159. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. International Joint Conference on Neural Networks, 2011 1Drexel University Biochemical Signal Processing Laboratory Dept. of Electrical & Computer Engineering Philadelphia, PA, USA 2Rowan University Signal Processing & Pattern Recognition Laboratory Dept. of Electrical & Computer Engineering Glassboro, NJ, USA gregory.ditzler@gmail.com, polikar@rowan.edu
  2. Contents • Introduction o Motivation for the weight estimation algorithm

    o Prior work in concept drift and unlabeled data • Approach o Assumptions and modeling techniques o The weight estimation algorithm • Experiments o Empirical results on two synthetic datasets o Comparison with Learn++.NSE • Conclusions o Summarizing remarks o Discussion Introduction Approach Experiments Conclusions
  3. Contents • Introduction o Motivation for the weight estimation algorithm

    o Prior work in concept drift and unlabeled data • Approach o Assumptions and modeling techniques o The weight estimation algorithm • Experiments o Empirical results on two synthetic datasets o Comparison with Learn++.NSE • Conclusions o Summarizing remarks o Discussion Introduction Approach Experiments Conclusions
  4. Where is concept drift found in real-world scenarios? • Specific

    Example: predict adds that are relevant to a user’s interests o User interests are known to evolve – or drift – with time o Algorithms must identify the change in a user’s interest to be considered useful o Ok, that is a simple example, but where can I find this in practice? • Yandex.Direct  selects ads which reflect the user’s current interest • Google Adsense  display adds related to the content of your webpage • The above applications are related through contextual advertising • General Examples: electricity demand, financial, climate, epidemiological, and spam (too name a few) Introduction Approach Experiments Conclusions
  5. Background • Concept drift: a change in the joint probability

    distribution between the data and classes change with time (i.e., Pt-1 (X,ω)≠Pt (X,ω)) o Categories: incremental, gradual, reoccurring • Reoccurring is a special case of either incremental or gradual as is abrupt change • What is actually changing? • How can has concept drift been handled in the past? o Ensemble classifiers (L++.NSE[1], ASHT[2], DWM[3]), adaptive sliding windows (FLORA[4], ADWIN[5]), drift detection (DDM[6], EDDM[7], JIT[8,9]), instance selection (FISH[10]) o See Žliobaitė’s concept drift review paper for an excellent introduction and background into concept drift [11] • Ensembles that process data in batches tend to build a classifier and update classifier weights when new data is presented o Pro: if the drift is evolutionary in nature, then the voting weights should be good enough to reduce the influence of poorly performing classifiers and increase weight of strong predictors o Con: if the weights are computed on Pt-1 (X,ω) and tested on Pt (X,ω) then the weights will be biased and could be improved upon if we knew something about Pt (X,ω) Introduction Approach Experiments Conclusions
  6. Contents • Introduction o Motivation for the weight estimation algorithm

    o Prior work in concept drift and unlabeled data • Approach o Assumptions and modeling techniques o The weight estimation algorithm • Experiments o Empirical results on two synthetic datasets o Comparison with Learn++.NSE • Conclusions o Summarizing remarks o Discussion Introduction Approach Experiments Conclusions
  7. Approach • Can we use unlabeled data to infer upon

    Pt (X,ω)? o Well, determining Pt (X) is not too bad • Consider generating Gaussian mixture models (GMM) from the data sampled from Pt (X,ω) • A GMM generated from this data can give us a model of Pt (X). Still no class information o What if we use a GMM generated from Pt-1 (X,ω) to infer the class membership of the components within the GMM generated from Pt (X)? • This provides a potentially accurate estimate of Pt (X,ω) • Thus an estimated error of a classifier can be computed for Pt (X,ω) • O.k., what assumptions do we need to make? o Assumption: drift is evolutionary in nature and smooth from a global perspective o This is not a very limiting assumptions as streaming data is changing in this manner (e.g., weather data changing throughout the year) Introduction Approach Experiments Conclusions
  8. Weight Estimation Algorithm • Produce a classifier from the labeled

    data • Generate a GMM with Kc components for each class, c, in the labeled dataset • Wait for unlabeled data to arrive o Generate a 2nd GMM with K components from the unlabeled data o No class association with any component • Compute the Bhattacharyya distance between each component in the unlabeled GMM and all components in the GMMs generated with class information o Temporarily assign the unknown component with the label of with the minimum distance Introduction Approach Experiments Conclusions Input: Labeled training data () = { ∈ ; ∈ } where = 1, … , () Unlabelled field data ℬ() = � ∈ �� where = 1, … , () : number of centers for the th class in a GMM (): number of instances generated to estimate the classifier error BaseClassifier learning algorithm for = 1,2, … do 1. Call BaseClassifier on (t) to generate ℎ : → 2. Generate a GMM for each class with centers in labeled (t). Refer to these mixture models as ℳc (t). 3. Generate a GMM with ∑ centers from unlabeled ℬ(). Refer to this mixture model as (). 4. Compute Bhattacharyya distance between the components in () and the components in ℳc (t). Assign each component in () with the label of the closest component in ℳc (t). Refer to this mixture as () 5. Generate synthetic data from ()and compute the error for each classifier on synthetic data ̂ () = 1 () �⟦ℎ ( ) = ⟧ () =1 where () is the number of synthetic instances generated and = 1,2, … , if ̂ () > 1 2 ⁄ then ̂ () = 1 2 ⁄ end if 6. Compute classifier voting weights for the field data () ∝ log 1 − ̂ () ̂ () 7. Classify the field data in ℬ() ()� ∈ ℬ()� = arg max c∈Ω � ()�ℎ � � = � =1 end for ( ) ( ) 1 2 1 1 ' min min log 8 2 2 c c k c T c c k k k c k K c k c µ µ µ µ − ∈Ω ∈   Σ + Σ     Σ + Σ   = − − +       Σ Σ      
  9. Weight Estimation Algorithm • Generate synthetic data from the GMM

    with temporary labels o Synthetic data should reasonably approximate the unknown distribution assuming a limited drift in the structure of the data • An estimate of the error on the unlabeled data is computed using the synthetic data o Again, assuming the limited drift and a reasonably selection of GMM parameters, this should give at least an estimate of the classifier error • Classifier weights are inversely proportional to the error of a classifier • Ensemble decision is made via a weighted majority vote Introduction Approach Experiments Conclusions Input: Labeled training data () = { ∈ ; ∈ } where = 1, … , () Unlabelled field data ℬ() = � ∈ �� where = 1, … , () : number of centers for the th class in a GMM (): number of instances generated to estimate the classifier error BaseClassifier learning algorithm for = 1,2, … do 1. Call BaseClassifier on (t) to generate ℎ : → 2. Generate a GMM for each class with centers in labeled (t). Refer to these mixture models as ℳc (t). 3. Generate a GMM with ∑ centers from unlabeled ℬ(). Refer to this mixture model as (). 4. Compute Bhattacharyya distance between the components in () and the components in ℳc (t). Assign each component in () with the label of the closest component in ℳc (t). Refer to this mixture as () 5. Generate synthetic data from ()and compute the error for each classifier on synthetic data ̂ () = 1 () �⟦ℎ ( ) = ⟧ () =1 where () is the number of synthetic instances generated and = 1,2, … , if ̂ () > 1 2 ⁄ then ̂ () = 1 2 ⁄ end if 6. Compute classifier voting weights for the field data () ∝ log 1 − ̂ () ̂ () 7. Classify the field data in ℬ() ()� ∈ ℬ()� = arg max c∈Ω � ()�ℎ � � = � =1 end for
  10. Contents • Introduction o Motivation for the weight estimation algorithm

    o Prior work in concept drift and unlabeled data • Approach o Assumptions and modeling techniques o The weight estimation algorithm • Experiments o Empirical results on two synthetic datasets o Comparison with Learn++.NSE • Conclusions o Summarizing remarks o Discussion Introduction Approach Experiments Conclusions
  11. Datasets & Comparisons • Synthetic Datasets o Circular Gaussian Drift:

    two class problem consisting two Gaussian components drifting in a circular pattern o Triangular Gaussian Drift: three class problem consisting three Gaussian components drifting in a triangular pattern • WEA is compared to Learn++.NSE[1] o Unlike WEA, Learn++.NSE uses a classifiers previous errors as a means to infer bias on a future dataset • Experiments are run over 50 independent trials Introduction Approach Experiments Conclusions
  12. Results 0 20 40 60 80 100 0.87 0.88 0.89

    0.9 0.91 0.92 0.93 GaussCir: Bias=0 Learn++.NSE WEA 0 20 40 60 80 100 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 GaussCir: Bias=1 Learn++.NSE WEA 0 20 40 60 80 100 0.82 0.84 0.86 0.88 0.9 0.92 0.94 GaussCir: Bias=3 Learn++.NSE WEA 0 20 40 60 80 100 0.75 0.8 0.85 0.9 0.95 GaussCir: Bias=5 Learn++.NSE WEA 0 20 40 60 80 100 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 GaussCir: Bias=7 Learn++.NSE WEA 0 20 40 60 80 100 0.4 0.5 0.6 0.7 0.8 0.9 1 GaussCir: Bias=10 Learn++.NSE WEA 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 0.5 GaussCir: Bias=13 Learn++.NSE WEA (a) (b) (c) (d) (e) (f) (g) approximately equal accuracy WEA remains an accurate predictor even as the bias increases WEA fails as the bias becomes to large Introduction Approach Experiments Conclusions
  13. Results 0 50 100 150 200 0.7 0.75 0.8 0.85

    0.9 0.95 1 GaussTri: Bias=0 Learn++.NSE WEA 0 50 100 150 200 0.7 0.75 0.8 0.85 0.9 0.95 1 GaussTri: Bias=1 Learn++.NSE WEA 0 50 100 150 200 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 GaussTri: Bias=3 Learn++.NSE WEA 0 50 100 150 200 0.5 0.6 0.7 0.8 0.9 1 GaussTri: Bias=5 Learn++.NSE WEA 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 1 GaussTri: Bias=7 Learn++.NSE WEA 0 50 100 150 200 0.4 0.5 0.6 0.7 0.8 0.9 1 GaussTri: Bias=10 Learn++.NSE WEA 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 GaussTri: Bias=13 Learn++.NSE WEA (d) (e) (f) (a) (b) (c) (g) approximately equal accuracy WEA remains an accurate predictor even as the bias increases WEA fails as the bias becomes to large Introduction Approach Experiments Conclusions
  14. Results • WEA and Learn++.NSE perform approximately equally when no

    bias is present o Little or no statistical significance is found in many of the time stamps throughout each experiment • Both algorithms experience a significant boost in accuracy in the presence of reoccurring environments • WEA maintains nearly the same accuracy as it did without any bias when the bias is increases o Learn++.NSE’s accuracy begins to drop off rapidly as the bias increases o WEA becomes slightly less stable when the bias increases further o WEA maintains a dominant accuracy over Learn++.NSE until the bias in the data becomes large • WEA should perform well on a variety of problems where the distribution can be modeled with GMMs Introduction Approach Experiments Conclusions
  15. Contents • Introduction o Motivation for the weight estimation algorithm

    o Prior work in concept drift and unlabeled data • Approach o Assumptions and modeling techniques o The weight estimation algorithm • Experiments o Empirical results on two synthetic datasets o Comparison with Learn++.NSE • Conclusions o Summarizing remarks o Discussion Introduction Approach Experiments Conclusions
  16. Conclusions • We have presented a weight estimation algorithm (WEA)

    for determining classifier-voting weights in the presence of concept drift o Ensemble based algorithm that uses labeled and to build classifiers and unlabeled data to aid in the calculation of the classifier voting weights before the data are classified • WEA showed significant improvement when bias was present between the distributions of labeled and unlabeled batches o Empirical results indicate that WEA performs similarly to Learn++.NSE when there is no bias between the labeled and unlabeled data o A clear violation of the limited drift assumption can result in poor overall accuracy compared to state of the art approaches like Learn++.NSE o The violation of the limited drift assumption may not be as detrimental to the overall accuracy if the GMMs contain a large number of mixtures Introduction Approach Experiments Conclusions
  17. Future Work • Develop a theoretical foundation, not necessarily for

    WEA, but a more general algorithm that infers upon classifier bias in concept drift problems o What is the most effective way we can use the unlabeled data from a different distribution? o Should an active learning rather than transductive learning approach be pursued? • Develop general error bounds for such a framework Introduction Approach Experiments Conclusions
  18. References 1. Muhlbaier, M. & Polikar, R. Multiple classifiers based

    incremental learning algorithm for learning nonstationary environments IEEE International Conference on Machine Learning and Cybernetics, 2007, 3618-3623 2. Bifet, A.; Holmes, G.; B; Pfahringer; Kirkby, R. & Gavalda, R. New Ensemble Methods For Evolving Data Streams Knowledge and Data Discovery, 2009 3. Kolter, J. & Maloof, M. Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts Journal of Machine Learning Research, 2007, 8, 2755-2790 4. Widmer, G. & Kubat, M. Learning in the presence of concept drift and hidden contexts Machine Learning, 1996, 23, 69-101 5. Bifet, A. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams Frontiers in Artificial Intelligence and Applications, 2010 6. Gama, J.; Medas, P.; Castillo, G. & Rodrigues, P. Learning with Drift Detection Lecture Notes in Computer Science, 2004, 3741, 286-295 7. Baena-Garcia, M.; del Campo-Avila, J.; Fidalgo, R.; Bifet, A.; Gavaldua, R. & Morales-Bueno, R. Early Drift Detection Method International Workshop on Knowledge Discovery from Data Streams, 2006 8. Alippi, C. & Roveri, M. Just-in-Time Adaptive Classifiers--Part I: Detecting Nonstationary Changes IEEE Transactions on Neural Networks, 2008, 19, 1145-1153 9. Alippi, C. & Roveri, M. Just-in-Time Adaptive Classifiers--Part II: Designing the Classifier IEEE Transactions on Neural Networks, 2008, 19, 2053-2064 10. Žliobaitė, I. Combining similarity in time and space for training set formulation under concept drift Intelligent Data Analysis, 2010, 14, 4, to appear 11. Žliobaitė, I. Learning under Concept Drift: An Overview Vilnius University, Technical Report, 2009
  19. This material is based upon work supported by the National

    Science Foundation under Grant No ECCS-0926159. National Science Foundation “America’s investment in future.”