An Analysis of Active Learning Strategies for Sequence Labeling Tasks

An Analysis of Active Learning Strategies for Sequence Labeling Tasks
Burr Settles and Mark Craven University of Wisconsin --------- Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, Honolulu, October 2008. © 2008 Association for Computational Linguistics 1

Abstract  aims to shed light on the best active
learning approaches for sequence labeling tasks  surveyed previous query selection strategies for sequence models and performed large-scale empirical comparison  proposed several novel algorithms 2

Introduction  Unlabeled data may be readily available but obtaining
training labels can be expensive  Active learning  select instances to label and add to a training set  Examples:  Part-of-speech tagging, information extraction, document segmentation  More attention of Active Learning has been given to classification tasks  Linear chain CRFs (conditional random fields) are used for experiments of sequence labeling 3

Active Learning with Sequence Models  Instance x  unlabeled
pool U  Labeled examples L  query strategy φ(x) - determine how informative each instance is  X∗ - the most informative instance according to some φ(x) 4

Active Learning with Sequence Models 5

Query strategies (1) uncertainty sampling 1) Uncertainty Sampling select the
instance that a model is most uncertain on how to label  Least confidence (LC) y * - viterbi parse, theta – parameters other methods are based on entropy 6

Query strategies (1) uncertainty sampling 1.1 Token (label) entropy (TE)
entropy of the model’s posteriors over its labeling P(yt =m) - marginal probability that m is the label at position t in the sequence, T- sequence length 1/T – normalizing querying long sequences 1.2 Total token entropy (TTE) [ proposed ] query long distance if more information exists 7

Query strategies (1) uncertainty sampling 1.3 Sequence Entropy (SE) considers
the entropy of the label sequence y as a whole possible labelings grows exponentially with the length of x 1.4 N-best SE N = {y1 , . . . , yN }, the set of the N most likely parses 8

2. Query-By-Committee - a committee of models C = {(1),
. . . ,(C)} represent C different hypotheses - The most informative query: the instance over which the committee is in most disagreement about how to label 2.1 Vote Entropy (VE) where V (yt,m) is the number of “votes” label m receives from all the committee member’s Viterbi labelings at sequence position t - Other variants: (2.2) Sequence VE [proposed], (2.3) Kullback- Leibler (KL) divergence, (2.4) Sequence KL [proposed] 9

3. Expected Gradient Length (EGL) • query the instance that
would provide the greatest change to the current model if we knew its label gradient of the log likelihood with respect to the model parameters the new gradient that would be obtained by adding the training tuple (x, y) to L 10

Problems with existing methods • There are suggestions that uncertainty
sampling and QBC and EGL are prone to querying outliers • e.g. least certain instance outlies on the classification boundary, but is not “representative” of other instances in the distribution 11

4. Information Density • The informativeness of x is weighted
by its average similarity to all other sequences in U, • parameter β that controls the relative importance of the density term • sequence entropy SE measures the “base” informativeness 12

4. Information Density Density measure 1. Simplify feature tokens into
a single kernel vectors fj(x_t) is the value of feature fj for token x 2. apply cosine similarity to simplified representation 13

5. Fisher information • query selection strategy for sequence models
based on Fisher information [ Zhang and Oles (2000)] • Fisher information T() represents the overall uncertainty about the estimated model parameters 14

Evaluation dataset - fifteen query selection strategies - CRFs -
eight data sets 15

Evaluation settings • Baseline • random instance selection (i.e., passive
learning) • naively querying the longest sequence in terms of tokens • Features • includes words, orthographic patterns, part-of-speech, lexicons, etc. • N-best approximation, N = 15 • QBC methods, # committee C = 3 • For information density, B = 1 (i.e., the information and density terms have equal weight) • Initialized L, five random labeled instances to • 150 queries are selected from U in batches of size B =5 • five folds cross-validation 16

Result Report - area under the F1 curve Max score
- 150 Best method – inside rectangle Second best – underlined Third best - bold 17

Discussion 1. no single best method from the strategies •
Information density (ID) performs well (for the most part) • Effective on large corpora, does not perform poorly, has highest AUC 2. based on informativeness measurement type • among uncertainty sampling • Sequence entropy (SE) and Least confidence (LC) perform best • among QBC • Sequence vote entropy (SVE) is the best • The 3 measure are best for use as base information measures with Information density (ID) 18

Discussion 3. query strategies that evaluate the entire sequence (SE,
SVE, SKL) have better performance compared to those which aggregate token-level information 4. Fisher Information – unpredictable results 19

Learning curve 20

Conclusion •presented analysis of active learning for sequence labeling tasks
•proposed several novel strategies to address some of previous work shortcomings •conducted large-scale empirical evaluation and showed the methods that advance in active learning •The methods include information density (recommended) sequence vote entropy, and Fisher information. 21

An Analysis of Active Learning Strategies for S...

An Analysis of Active Learning Strategies for Sequence Labeling Tasks

Yemane

More Decks by Yemane

Other Decks in Education

Featured

Transcript