An Analysis of Active Learning Strategies for Sequence Labeling Tasks

Slide 1

Slide 1 text

An Analysis of Active Learning Strategies for Sequence Labeling Tasks Burr Settles and Mark Craven University of Wisconsin --------- Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, Honolulu, October 2008. © 2008 Association for Computational Linguistics 1

Slide 2

Slide 2 text

Abstract  aims to shed light on the best active learning approaches for sequence labeling tasks  surveyed previous query selection strategies for sequence models and performed large-scale empirical comparison  proposed several novel algorithms 2

Slide 3

Slide 3 text

Introduction  Unlabeled data may be readily available but obtaining training labels can be expensive  Active learning  select instances to label and add to a training set  Examples:  Part-of-speech tagging, information extraction, document segmentation  More attention of Active Learning has been given to classification tasks  Linear chain CRFs (conditional random fields) are used for experiments of sequence labeling 3

Slide 4

Slide 4 text

Active Learning with Sequence Models  Instance x  unlabeled pool U  Labeled examples L  query strategy φ(x) - determine how informative each instance is  X∗ - the most informative instance according to some φ(x) 4

Slide 5

Slide 5 text

Active Learning with Sequence Models 5

Slide 6

Slide 6 text

Query strategies (1) uncertainty sampling 1) Uncertainty Sampling select the instance that a model is most uncertain on how to label  Least confidence (LC) y * - viterbi parse, theta – parameters other methods are based on entropy 6

Slide 7

Slide 7 text

Query strategies (1) uncertainty sampling 1.1 Token (label) entropy (TE) entropy of the model’s posteriors over its labeling P(yt =m) - marginal probability that m is the label at position t in the sequence, T- sequence length 1/T – normalizing querying long sequences 1.2 Total token entropy (TTE) [ proposed ] query long distance if more information exists 7

Slide 8

Slide 8 text

Query strategies (1) uncertainty sampling 1.3 Sequence Entropy (SE) considers the entropy of the label sequence y as a whole possible labelings grows exponentially with the length of x 1.4 N-best SE N = {y1 , . . . , yN }, the set of the N most likely parses 8

Slide 9

Slide 9 text

2. Query-By-Committee - a committee of models C = {(1), . . . ,(C)} represent C different hypotheses - The most informative query: the instance over which the committee is in most disagreement about how to label 2.1 Vote Entropy (VE) where V (yt,m) is the number of “votes” label m receives from all the committee member’s Viterbi labelings at sequence position t - Other variants: (2.2) Sequence VE [proposed], (2.3) Kullback- Leibler (KL) divergence, (2.4) Sequence KL [proposed] 9

Slide 10

Slide 10 text

3. Expected Gradient Length (EGL) • query the instance that would provide the greatest change to the current model if we knew its label gradient of the log likelihood with respect to the model parameters the new gradient that would be obtained by adding the training tuple (x, y) to L 10

Slide 11

Slide 11 text

Problems with existing methods • There are suggestions that uncertainty sampling and QBC and EGL are prone to querying outliers • e.g. least certain instance outlies on the classification boundary, but is not “representative” of other instances in the distribution 11

Slide 12

Slide 12 text

4. Information Density • The informativeness of x is weighted by its average similarity to all other sequences in U, • parameter β that controls the relative importance of the density term • sequence entropy SE measures the “base” informativeness 12

Slide 13

Slide 13 text

4. Information Density Density measure 1. Simplify feature tokens into a single kernel vectors fj(x_t) is the value of feature fj for token x 2. apply cosine similarity to simplified representation 13

Slide 14

Slide 14 text

5. Fisher information • query selection strategy for sequence models based on Fisher information [ Zhang and Oles (2000)] • Fisher information T() represents the overall uncertainty about the estimated model parameters 14

Slide 15

Slide 15 text

Evaluation dataset - fifteen query selection strategies - CRFs - eight data sets 15

Slide 16

Slide 16 text

Evaluation settings • Baseline • random instance selection (i.e., passive learning) • naively querying the longest sequence in terms of tokens • Features • includes words, orthographic patterns, part-of-speech, lexicons, etc. • N-best approximation, N = 15 • QBC methods, # committee C = 3 • For information density, B = 1 (i.e., the information and density terms have equal weight) • Initialized L, five random labeled instances to • 150 queries are selected from U in batches of size B =5 • five folds cross-validation 16

Slide 17

Slide 17 text

Result Report - area under the F1 curve Max score - 150 Best method – inside rectangle Second best – underlined Third best - bold 17

Slide 18

Slide 18 text

Discussion 1. no single best method from the strategies • Information density (ID) performs well (for the most part) • Effective on large corpora, does not perform poorly, has highest AUC 2. based on informativeness measurement type • among uncertainty sampling • Sequence entropy (SE) and Least confidence (LC) perform best • among QBC • Sequence vote entropy (SVE) is the best • The 3 measure are best for use as base information measures with Information density (ID) 18

Slide 19

Slide 19 text

Discussion 3. query strategies that evaluate the entire sequence (SE, SVE, SKL) have better performance compared to those which aggregate token-level information 4. Fisher Information – unpredictable results 19

Slide 20

Slide 20 text

Learning curve 20

Slide 21

Slide 21 text

Conclusion •presented analysis of active learning for sequence labeling tasks •proposed several novel strategies to address some of previous work shortcomings •conducted large-scale empirical evaluation and showed the methods that advance in active learning •The methods include information density (recommended) sequence vote entropy, and Fisher information. 21