(Re)connecting Complex Lexical Data: Updating the Historical Thesaurus of English

(Re)connecting Complex Lexical Data Updating the Historical Thesaurus of English
Fraser Dallachy, Brian Aitken, and Marc Alexander University of Glasgow

Updating the Historical Thesaurus • New data produced through OED3
research ‣ New headwords/senses ‣ More accurate dating of senses • HT project provides categorisation expertise for new data

The Problem • Primary key ﬁelds created after original data
sharing between HT and OED • On most occasions it should be safe to assume that an OED thesaurus category will have an HT counterpart • Categories should exist in same place in hierarchical structure • However, databases have been adjusted by HT and OED teams separately and have thus diverged

oed.com/thesaurus

The Problem • Lexical data largely consists of text strings
‣ Lexemes (i.e. word) ‣ Quasi-text category numbers (e.g. 01.02.05) ‣ Labels (e.g. dict, arch) • Some associated numerical data ‣ Attestation dates

The Problem • Even minor diﬀerences between text strings can
stymie automatic matching

Some Numbers • In OED data: ‣ 237,734 categories ‣
229,325 with part of speech and not empty ‣ 715,546 lexemes

Techniques • Link categories ﬁrst, then lexemes • Adjustment of
HT v1 category numbering to match expected OED category numbering • Check categories with same numbers are correct matches

Category Matching • ‘Stripped’ form of category headings ‣ Remove
punctuation, incl. spaces ‣ Replace ‘/’ with ‘or’ ‣ List of variant forms • Lexeme count + ‘stripped’ last word

Pattern Matching • Uses known variations between HT and OED
expressions ‣ Pertaining to -> relating to ‣ Inhabitant of -> inhabitant ‣ spec. -> speciﬁc ‣ -ize -> -ise ‣ January -> jan. ‣ Remove ‘a ’ and ‘to ’ at beginning of string

Sibling Matching • If a category is unmatched… • Is
the parent category matched? • If yes… • Find unmatched siblings belonging to same parent… • Look for category heading amongst these

Dating Proﬁle

Dating Proﬁle • Select date of ﬁrst attestation for each
lexeme in potentially matched categories • Build list of these for OED and HT categories • Compare

Monosemous Pairing • Search OED data for unmatched lexemes which
only appear once (single sense or single unmatched sense) • Do the same for HT data • Look for matching single-sense item in both datasets

0 57,500 115,000 172,500 230,000 Start 1 2 3 4
5 6 7 8 9 10 11 1: Stripped headings 2: Pattern matching 3: Levenshtein distance of 1 4: Sibling matching 5: Noun heading in other POS 6: Content (lexeme + date) matching 7: Correction of misaligned categories 8: Date proﬁles 9: Monosemous matching 10: Monosemous siblings 11: Gap matching Categories remaining unmatched after each process

0 7,500 15,000 22,500 30,000 1 2 3 4 5
6 7 8 9 10 11 1: Stripped headings 2: Pattern matching 3: Levenshtein distance of 1 4: Sibling matching 5: Noun heading in other POS 6: Content (lexeme + date) matching 7: Correction of misaligned categories 8: Date proﬁles 9: Monosemous matching 10: Monosemous siblings 11: Gap matching Categories remaining unmatched after each process

Lexeme Matching • Uses techniques developed for category matching ‣
Stripped text strings + dates ‣ Pattern matching and Levenshtein edit distances ‣ Monosemous pairing ‣ Variant forms list

0 200,000 400,000 600,000 800,000 Start Stripped text + date
Exact text Monosemous pairing 91,961 92,100 123,293 751,156 Lexemes remaining unmatched after each process

0 32,500 65,000 97,500 130,000 Stripped text + date Exact
text Monosemous pairing 91,961 92,100 123,293 Lexemes remaining unmatched after each process

Lessons • Stripped text can take you a long way
• Date information (even without lexeme text) very useful • Monosemous pairing helps but needs close monitoring • Iterative checking with distinct conﬁdence levels essential

Extensibility • Dictionaries/thesauri share lexis + dates structure • Links
between resources are valuable for lexicographers, lexicologists, semanticists, the general interested public • Methods can be applied to/adapted for other lexical resources ‣ Thesaurus of the Scots Language pilot project (Carnegie Research Incentive Grant)

ht.ac.uk thesaurus.ac.uk

(Re)connecting Complex Lexical Data: Updating t...

(Re)connecting Complex Lexical Data: Updating the Historical Thesaurus of English

Marc Alexander

More Decks by Marc Alexander

Featured

Transcript