$30 off During Our Annual Pro Sale. View Details »

IJCNN2011

 IJCNN2011

Slides for the IEEE/INNS International Joint Conference on Neural Networks

Gregory Ditzler

July 12, 2013
Tweet

More Decks by Gregory Ditzler

Other Decks in Research

Transcript

  1. Semi-supervised Learning in
    Nonstationary Environments
    Gregory Ditzler1,2 and Robi Polikar2
    This material is based upon work supported by the National Science Foundation under Grant No
    0926159. Any opinions, findings, and conclusions or recommendations expressed in this
    material are those of the author(s) and do not necessarily reflect the views of the National
    Science Foundation.
    International Joint Conference on Neural Networks, 2011
    1Drexel University
    Biochemical Signal Processing Laboratory
    Dept. of Electrical & Computer Engineering
    Philadelphia, PA, USA
    2Rowan University
    Signal Processing & Pattern Recognition Laboratory
    Dept. of Electrical & Computer Engineering
    Glassboro, NJ, USA
    [email protected], [email protected]

    View Slide

  2. Contents
    • Introduction
    o Motivation for the weight estimation algorithm
    o Prior work in concept drift and unlabeled data
    • Approach
    o Assumptions and modeling techniques
    o The weight estimation algorithm
    • Experiments
    o Empirical results on two synthetic datasets
    o Comparison with Learn++.NSE
    • Conclusions
    o Summarizing remarks
    o Discussion
    Introduction Approach Experiments Conclusions

    View Slide

  3. Contents
    • Introduction
    o Motivation for the weight estimation algorithm
    o Prior work in concept drift and unlabeled data
    • Approach
    o Assumptions and modeling techniques
    o The weight estimation algorithm
    • Experiments
    o Empirical results on two synthetic datasets
    o Comparison with Learn++.NSE
    • Conclusions
    o Summarizing remarks
    o Discussion
    Introduction Approach Experiments Conclusions

    View Slide

  4. Where is concept drift
    found in real-world scenarios?
    • Specific Example: predict adds that are relevant to a user’s interests
    o User interests are known to evolve – or drift – with time
    o Algorithms must identify the change in a user’s interest to be considered useful
    o Ok, that is a simple example, but where can I find this in practice?
    • Yandex.Direct  selects ads which reflect the user’s current interest
    • Google Adsense  display adds related to the content of your webpage
    • The above applications are related through contextual advertising
    • General Examples: electricity demand, financial, climate,
    epidemiological, and spam (too name a few)
    Introduction Approach Experiments Conclusions

    View Slide

  5. Background
    • Concept drift: a change in the joint probability distribution between the
    data and classes change with time (i.e., Pt-1
    (X,ω)≠Pt
    (X,ω))
    o Categories: incremental, gradual, reoccurring
    • Reoccurring is a special case of either incremental or gradual as is abrupt change
    • What is actually changing?
    • How can has concept drift been handled in the past?
    o Ensemble classifiers (L++.NSE[1], ASHT[2], DWM[3]), adaptive sliding windows (FLORA[4],
    ADWIN[5]), drift detection (DDM[6], EDDM[7], JIT[8,9]), instance selection (FISH[10])
    o See Žliobaitė’s concept drift review paper for an excellent introduction and background into
    concept drift [11]
    • Ensembles that process data in batches tend to build a classifier and
    update classifier weights when new data is presented
    o Pro: if the drift is evolutionary in nature, then the voting weights should be good enough to
    reduce the influence of poorly performing classifiers and increase weight of strong predictors
    o Con: if the weights are computed on Pt-1
    (X,ω) and tested on Pt
    (X,ω) then the weights will be
    biased and could be improved upon if we knew something about Pt
    (X,ω)
    Introduction Approach Experiments Conclusions

    View Slide

  6. Contents
    • Introduction
    o Motivation for the weight estimation algorithm
    o Prior work in concept drift and unlabeled data
    • Approach
    o Assumptions and modeling techniques
    o The weight estimation algorithm
    • Experiments
    o Empirical results on two synthetic datasets
    o Comparison with Learn++.NSE
    • Conclusions
    o Summarizing remarks
    o Discussion
    Introduction Approach Experiments Conclusions

    View Slide

  7. Approach
    • Can we use unlabeled data to infer upon Pt
    (X,ω)?
    o Well, determining Pt
    (X) is not too bad
    • Consider generating Gaussian mixture models (GMM) from the data
    sampled from Pt
    (X,ω)
    • A GMM generated from this data can give us a model of Pt
    (X). Still no class
    information
    o What if we use a GMM generated from Pt-1
    (X,ω) to infer the class membership of the
    components within the GMM generated from Pt
    (X)?
    • This provides a potentially accurate estimate of Pt
    (X,ω)
    • Thus an estimated error of a classifier can be computed for Pt
    (X,ω)
    • O.k., what assumptions do we need to make?
    o Assumption: drift is evolutionary in nature and smooth from a global perspective
    o This is not a very limiting assumptions as streaming data is changing in this manner
    (e.g., weather data changing throughout the year)
    Introduction Approach Experiments Conclusions

    View Slide

  8. Weight Estimation Algorithm
    • Produce a classifier from the labeled
    data
    • Generate a GMM with Kc
    components for each class, c, in the
    labeled dataset
    • Wait for unlabeled data to arrive
    o Generate a 2nd GMM with K components from
    the unlabeled data
    o No class association with any component
    • Compute the Bhattacharyya distance
    between each component in the
    unlabeled GMM and all components
    in the GMMs generated with class
    information
    o Temporarily assign the unknown component
    with the label of with the minimum distance
    Introduction Approach Experiments Conclusions
    Input: Labeled training data () = {
    ∈ ;
    ∈ }
    where = 1, … , ()
    Unlabelled field data ℬ() = �

    ��
    where = 1, … , ()

    : number of centers for the th class in a GMM
    (): number of instances generated to estimate the
    classifier error
    BaseClassifier learning algorithm
    for = 1,2, … do
    1. Call BaseClassifier on (t) to generate ℎ
    : →
    2. Generate a GMM for each class with
    centers in labeled (t). Refer to these
    mixture models as ℳc
    (t).
    3. Generate a GMM with ∑
    centers from unlabeled ℬ(). Refer to this mixture
    model as ().
    4. Compute Bhattacharyya distance between the components in () and the
    components in ℳc
    (t). Assign each component in () with the label of the closest
    component in ℳc
    (t). Refer to this mixture as

    ()
    5. Generate synthetic data from

    ()and compute the error for each classifier on
    synthetic data
    ̂

    () =
    1
    ()
    �⟦ℎ
    (
    ) =

    ()
    =1
    where () is the number of synthetic instances
    generated and = 1,2, … ,
    if ̂

    () > 1 2
    ⁄ then
    ̂

    () = 1 2

    end if
    6. Compute classifier voting weights for the field data


    () ∝ log
    1 − ̂

    ()
    ̂

    ()
    7. Classify the field data in ℬ()
    ()�
    ∈ ℬ()� = arg max
    c∈Ω


    ()�ℎ

    � =


    =1
    end for
    ( ) ( )
    1
    2
    1 1
    ' min min log
    8 2 2
    c
    c
    k
    c
    T
    c c
    k
    k k
    c k K c
    k
    c µ µ µ µ

    ∈Ω ∈
     
    Σ + Σ
     
     
    Σ + Σ  
    = − − +  
     
      Σ Σ
     
     
     

    View Slide

  9. Weight Estimation Algorithm
    • Generate synthetic data from the
    GMM with temporary labels
    o Synthetic data should reasonably approximate
    the unknown distribution assuming a limited
    drift in the structure of the data
    • An estimate of the error on the
    unlabeled data is computed using
    the synthetic data
    o Again, assuming the limited drift and a
    reasonably selection of GMM parameters, this
    should give at least an estimate of the
    classifier error
    • Classifier weights are inversely
    proportional to the error of a
    classifier
    • Ensemble decision is made via a
    weighted majority vote
    Introduction Approach Experiments Conclusions
    Input: Labeled training data () = {
    ∈ ;
    ∈ }
    where = 1, … , ()
    Unlabelled field data ℬ() = �

    ��
    where = 1, … , ()

    : number of centers for the th class in a GMM
    (): number of instances generated to estimate the
    classifier error
    BaseClassifier learning algorithm
    for = 1,2, … do
    1. Call BaseClassifier on (t) to generate ℎ
    : →
    2. Generate a GMM for each class with
    centers in labeled (t). Refer to these
    mixture models as ℳc
    (t).
    3. Generate a GMM with ∑
    centers from unlabeled ℬ(). Refer to this mixture
    model as ().
    4. Compute Bhattacharyya distance between the components in () and the
    components in ℳc
    (t). Assign each component in () with the label of the closest
    component in ℳc
    (t). Refer to this mixture as

    ()
    5. Generate synthetic data from

    ()and compute the error for each classifier on
    synthetic data
    ̂

    () =
    1
    ()
    �⟦ℎ
    (
    ) =

    ()
    =1
    where () is the number of synthetic instances
    generated and = 1,2, … ,
    if ̂

    () > 1 2
    ⁄ then
    ̂

    () = 1 2

    end if
    6. Compute classifier voting weights for the field data


    () ∝ log
    1 − ̂

    ()
    ̂

    ()
    7. Classify the field data in ℬ()
    ()�
    ∈ ℬ()� = arg max
    c∈Ω


    ()�ℎ

    � =


    =1
    end for

    View Slide

  10. Contents
    • Introduction
    o Motivation for the weight estimation algorithm
    o Prior work in concept drift and unlabeled data
    • Approach
    o Assumptions and modeling techniques
    o The weight estimation algorithm
    • Experiments
    o Empirical results on two synthetic datasets
    o Comparison with Learn++.NSE
    • Conclusions
    o Summarizing remarks
    o Discussion
    Introduction Approach Experiments Conclusions

    View Slide

  11. Datasets & Comparisons
    • Synthetic Datasets
    o Circular Gaussian Drift: two class problem consisting two Gaussian components
    drifting in a circular pattern
    o Triangular Gaussian Drift: three class problem consisting three Gaussian components
    drifting in a triangular pattern
    • WEA is compared to Learn++.NSE[1]
    o Unlike WEA, Learn++.NSE uses a classifiers previous errors as a means to infer bias on
    a future dataset
    • Experiments are run over 50 independent trials
    Introduction Approach Experiments Conclusions

    View Slide

  12. Results
    0 20 40 60 80 100
    0.87
    0.88
    0.89
    0.9
    0.91
    0.92
    0.93
    GaussCir: Bias=0
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0.86
    0.87
    0.88
    0.89
    0.9
    0.91
    0.92
    0.93
    GaussCir: Bias=1
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0.82
    0.84
    0.86
    0.88
    0.9
    0.92
    0.94
    GaussCir: Bias=3
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0.75
    0.8
    0.85
    0.9
    0.95
    GaussCir: Bias=5
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0.65
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    1
    GaussCir: Bias=7
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    GaussCir: Bias=10
    Learn++.NSE
    WEA
    0 20 40 60 80 100
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    GaussCir: Bias=13
    Learn++.NSE
    WEA
    (a) (b) (c)
    (d) (e) (f)
    (g)
    approximately equal
    accuracy
    WEA remains an accurate predictor
    even as the bias increases
    WEA fails as the bias becomes to
    large
    Introduction Approach Experiments Conclusions

    View Slide

  13. Results
    0 50 100 150 200
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    1
    GaussTri: Bias=0
    Learn++.NSE
    WEA
    0 50 100 150 200
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    1
    GaussTri: Bias=1
    Learn++.NSE
    WEA
    0 50 100 150 200
    0.65
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    1
    GaussTri: Bias=3
    Learn++.NSE
    WEA
    0 50 100 150 200
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    GaussTri: Bias=5
    Learn++.NSE
    WEA
    0 50 100 150 200
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    GaussTri: Bias=7
    Learn++.NSE
    WEA
    0 50 100 150 200
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    GaussTri: Bias=10
    Learn++.NSE
    WEA
    0 50 100 150 200
    0
    0.2
    0.4
    0.6
    0.8
    1
    GaussTri: Bias=13
    Learn++.NSE
    WEA
    (d) (e) (f)
    (a) (b) (c)
    (g)
    approximately equal
    accuracy
    WEA remains an accurate predictor
    even as the bias increases
    WEA fails as the bias becomes to
    large
    Introduction Approach Experiments Conclusions

    View Slide

  14. Results
    • WEA and Learn++.NSE perform approximately equally when no bias
    is present
    o Little or no statistical significance is found in many of the time stamps throughout each
    experiment
    • Both algorithms experience a significant boost in accuracy in the
    presence of reoccurring environments
    • WEA maintains nearly the same accuracy as it did without any bias
    when the bias is increases
    o Learn++.NSE’s accuracy begins to drop off rapidly as the bias increases
    o WEA becomes slightly less stable when the bias increases further
    o WEA maintains a dominant accuracy over Learn++.NSE until the bias in the data
    becomes large
    • WEA should perform well on a variety of problems where the
    distribution can be modeled with GMMs
    Introduction Approach Experiments Conclusions

    View Slide

  15. Contents
    • Introduction
    o Motivation for the weight estimation algorithm
    o Prior work in concept drift and unlabeled data
    • Approach
    o Assumptions and modeling techniques
    o The weight estimation algorithm
    • Experiments
    o Empirical results on two synthetic datasets
    o Comparison with Learn++.NSE
    • Conclusions
    o Summarizing remarks
    o Discussion
    Introduction Approach Experiments Conclusions

    View Slide

  16. Conclusions
    • We have presented a weight estimation algorithm (WEA) for
    determining classifier-voting weights in the presence of concept
    drift
    o Ensemble based algorithm that uses labeled and to build classifiers and unlabeled data to
    aid in the calculation of the classifier voting weights before the data are classified
    • WEA showed significant improvement when bias was present
    between the distributions of labeled and unlabeled batches
    o Empirical results indicate that WEA performs similarly to Learn++.NSE when there is no
    bias between the labeled and unlabeled data
    o A clear violation of the limited drift assumption can result in poor overall accuracy
    compared to state of the art approaches like Learn++.NSE
    o The violation of the limited drift assumption may not be as detrimental to the overall
    accuracy if the GMMs contain a large number of mixtures
    Introduction Approach Experiments Conclusions

    View Slide

  17. Future Work
    • Develop a theoretical foundation, not necessarily for WEA, but a
    more general algorithm that infers upon classifier bias in concept
    drift problems
    o What is the most effective way we can use the unlabeled data from a different
    distribution?
    o Should an active learning rather than transductive learning approach be pursued?
    • Develop general error bounds for such a framework
    Introduction Approach Experiments Conclusions

    View Slide

  18. References
    1. Muhlbaier, M. & Polikar, R.
    Multiple classifiers based incremental learning algorithm for learning nonstationary environments
    IEEE International Conference on Machine Learning and Cybernetics, 2007, 3618-3623
    2. Bifet, A.; Holmes, G.; B; Pfahringer; Kirkby, R. & Gavalda, R.
    New Ensemble Methods For Evolving Data Streams
    Knowledge and Data Discovery, 2009
    3. Kolter, J. & Maloof, M.
    Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
    Journal of Machine Learning Research, 2007, 8, 2755-2790
    4. Widmer, G. & Kubat, M.
    Learning in the presence of concept drift and hidden contexts
    Machine Learning, 1996, 23, 69-101
    5. Bifet, A.
    Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams
    Frontiers in Artificial Intelligence and Applications, 2010
    6. Gama, J.; Medas, P.; Castillo, G. & Rodrigues, P.
    Learning with Drift Detection
    Lecture Notes in Computer Science, 2004, 3741, 286-295
    7. Baena-Garcia, M.; del Campo-Avila, J.; Fidalgo, R.; Bifet, A.; Gavaldua, R. & Morales-Bueno, R.
    Early Drift Detection Method
    International Workshop on Knowledge Discovery from Data Streams, 2006
    8. Alippi, C. & Roveri, M.
    Just-in-Time Adaptive Classifiers--Part I: Detecting Nonstationary Changes
    IEEE Transactions on Neural Networks, 2008, 19, 1145-1153
    9. Alippi, C. & Roveri, M.
    Just-in-Time Adaptive Classifiers--Part II: Designing the Classifier
    IEEE Transactions on Neural Networks, 2008, 19, 2053-2064
    10. Žliobaitė, I.
    Combining similarity in time and space for training set formulation under concept drift
    Intelligent Data Analysis, 2010, 14, 4, to appear
    11. Žliobaitė, I.
    Learning under Concept Drift: An Overview
    Vilnius University, Technical Report, 2009

    View Slide

  19. This material is based upon work supported by the
    National Science Foundation under Grant No ECCS-0926159.
    National Science Foundation
    “America’s investment in future.”

    View Slide