Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Analysis of Active Learning Strategies for Sequence Labeling Tasks

Yemane
November 30, 2016

An Analysis of Active Learning Strategies for Sequence Labeling Tasks

Burr Settles and Mark Craven
University of Wisconsin

Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079,
Honolulu, October 2008.
© 2008 Association for Computational Linguistics

Yemane

November 30, 2016
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. An Analysis of Active Learning Strategies for Sequence Labeling Tasks

    Burr Settles and Mark Craven University of Wisconsin --------- Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, Honolulu, October 2008. © 2008 Association for Computational Linguistics 1
  2. Abstract  aims to shed light on the best active

    learning approaches for sequence labeling tasks  surveyed previous query selection strategies for sequence models and performed large-scale empirical comparison  proposed several novel algorithms 2
  3. Introduction  Unlabeled data may be readily available but obtaining

    training labels can be expensive  Active learning  select instances to label and add to a training set  Examples:  Part-of-speech tagging, information extraction, document segmentation  More attention of Active Learning has been given to classification tasks  Linear chain CRFs (conditional random fields) are used for experiments of sequence labeling 3
  4. Active Learning with Sequence Models  Instance x  unlabeled

    pool U  Labeled examples L  query strategy φ(x) - determine how informative each instance is  X∗ - the most informative instance according to some φ(x) 4
  5. Query strategies (1) uncertainty sampling 1) Uncertainty Sampling select the

    instance that a model is most uncertain on how to label  Least confidence (LC) y * - viterbi parse, theta – parameters other methods are based on entropy 6
  6. Query strategies (1) uncertainty sampling 1.1 Token (label) entropy (TE)

    entropy of the model’s posteriors over its labeling P(yt =m) - marginal probability that m is the label at position t in the sequence, T- sequence length 1/T – normalizing querying long sequences 1.2 Total token entropy (TTE) [ proposed ] query long distance if more information exists 7
  7. Query strategies (1) uncertainty sampling 1.3 Sequence Entropy (SE) considers

    the entropy of the label sequence y as a whole possible labelings grows exponentially with the length of x 1.4 N-best SE N = {y1 , . . . , yN }, the set of the N most likely parses 8
  8. 2. Query-By-Committee - a committee of models C = {(1),

    . . . ,(C)} represent C different hypotheses - The most informative query: the instance over which the committee is in most disagreement about how to label 2.1 Vote Entropy (VE) where V (yt,m) is the number of “votes” label m receives from all the committee member’s Viterbi labelings at sequence position t - Other variants: (2.2) Sequence VE [proposed], (2.3) Kullback- Leibler (KL) divergence, (2.4) Sequence KL [proposed] 9
  9. 3. Expected Gradient Length (EGL) • query the instance that

    would provide the greatest change to the current model if we knew its label gradient of the log likelihood with respect to the model parameters the new gradient that would be obtained by adding the training tuple (x, y) to L 10
  10. Problems with existing methods • There are suggestions that uncertainty

    sampling and QBC and EGL are prone to querying outliers • e.g. least certain instance outlies on the classification boundary, but is not “representative” of other instances in the distribution 11
  11. 4. Information Density • The informativeness of x is weighted

    by its average similarity to all other sequences in U, • parameter β that controls the relative importance of the density term • sequence entropy SE measures the “base” informativeness 12
  12. 4. Information Density Density measure 1. Simplify feature tokens into

    a single kernel vectors fj(x_t) is the value of feature fj for token x 2. apply cosine similarity to simplified representation 13
  13. 5. Fisher information • query selection strategy for sequence models

    based on Fisher information [ Zhang and Oles (2000)] • Fisher information T() represents the overall uncertainty about the estimated model parameters 14
  14. Evaluation settings • Baseline • random instance selection (i.e., passive

    learning) • naively querying the longest sequence in terms of tokens • Features • includes words, orthographic patterns, part-of-speech, lexicons, etc. • N-best approximation, N = 15 • QBC methods, # committee C = 3 • For information density, B = 1 (i.e., the information and density terms have equal weight) • Initialized L, five random labeled instances to • 150 queries are selected from U in batches of size B =5 • five folds cross-validation 16
  15. Result Report - area under the F1 curve Max score

    - 150 Best method – inside rectangle Second best – underlined Third best - bold 17
  16. Discussion 1. no single best method from the strategies •

    Information density (ID) performs well (for the most part) • Effective on large corpora, does not perform poorly, has highest AUC 2. based on informativeness measurement type • among uncertainty sampling • Sequence entropy (SE) and Least confidence (LC) perform best • among QBC • Sequence vote entropy (SVE) is the best • The 3 measure are best for use as base information measures with Information density (ID) 18
  17. Discussion 3. query strategies that evaluate the entire sequence (SE,

    SVE, SKL) have better performance compared to those which aggregate token-level information 4. Fisher Information – unpredictable results 19
  18. Conclusion •presented analysis of active learning for sequence labeling tasks

    •proposed several novel strategies to address some of previous work shortcomings •conducted large-scale empirical evaluation and showed the methods that advance in active learning •The methods include information density (recommended) sequence vote entropy, and Fisher information. 21