Upgrade to Pro — share decks privately, control downloads, hide ads and more …

(Re)connecting Complex Lexical Data: Updating t...

Marc Alexander
July 10, 2019
41

(Re)connecting Complex Lexical Data: Updating the Historical Thesaurus of English

Presented by Fraser Dallachy at Digital Humanities 2019, University of Utrecht

Marc Alexander

July 10, 2019
Tweet

More Decks by Marc Alexander

Transcript

  1. (Re)connecting Complex Lexical Data Updating the Historical Thesaurus of English

    Fraser Dallachy, Brian Aitken, and Marc Alexander University of Glasgow
  2. Updating the Historical Thesaurus • New data produced through OED3

    research ‣ New headwords/senses ‣ More accurate dating of senses • HT project provides categorisation expertise for new data
  3. The Problem • Primary key fields created after original data

    sharing between HT and OED • On most occasions it should be safe to assume that an OED thesaurus category will have an HT counterpart • Categories should exist in same place in hierarchical structure • However, databases have been adjusted by HT and OED teams separately and have thus diverged
  4. The Problem • Lexical data largely consists of text strings

    ‣ Lexemes (i.e. word) ‣ Quasi-text category numbers (e.g. 01.02.05) ‣ Labels (e.g. dict, arch) • Some associated numerical data ‣ Attestation dates
  5. Some Numbers • In OED data: ‣ 237,734 categories ‣

    229,325 with part of speech and not empty ‣ 715,546 lexemes
  6. Techniques • Link categories first, then lexemes • Adjustment of

    HT v1 category numbering to match expected OED category numbering • Check categories with same numbers are correct matches
  7. Category Matching • ‘Stripped’ form of category headings ‣ Remove

    punctuation, incl. spaces ‣ Replace ‘/’ with ‘or’ ‣ List of variant forms • Lexeme count + ‘stripped’ last word
  8. Pattern Matching • Uses known variations between HT and OED

    expressions ‣ Pertaining to -> relating to ‣ Inhabitant of -> inhabitant ‣ spec. -> specific ‣ -ize -> -ise ‣ January -> jan. ‣ Remove ‘a ’ and ‘to ’ at beginning of string
  9. Sibling Matching • If a category is unmatched… • Is

    the parent category matched? • If yes… • Find unmatched siblings belonging to same parent… • Look for category heading amongst these
  10. Dating Profile • Select date of first attestation for each

    lexeme in potentially matched categories • Build list of these for OED and HT categories • Compare
  11. Monosemous Pairing • Search OED data for unmatched lexemes which

    only appear once (single sense or single unmatched sense) • Do the same for HT data • Look for matching single-sense item in both datasets
  12. 0 57,500 115,000 172,500 230,000 Start 1 2 3 4

    5 6 7 8 9 10 11 1: Stripped headings 2: Pattern matching 3: Levenshtein distance of 1 4: Sibling matching 5: Noun heading in other POS 6: Content (lexeme + date) matching 7: Correction of misaligned categories 8: Date profiles 9: Monosemous matching 10: Monosemous siblings 11: Gap matching Categories remaining unmatched after each process
  13. 0 7,500 15,000 22,500 30,000 1 2 3 4 5

    6 7 8 9 10 11 1: Stripped headings 2: Pattern matching 3: Levenshtein distance of 1 4: Sibling matching 5: Noun heading in other POS 6: Content (lexeme + date) matching 7: Correction of misaligned categories 8: Date profiles 9: Monosemous matching 10: Monosemous siblings 11: Gap matching Categories remaining unmatched after each process
  14. Lexeme Matching • Uses techniques developed for category matching ‣

    Stripped text strings + dates ‣ Pattern matching and Levenshtein edit distances ‣ Monosemous pairing ‣ Variant forms list
  15. 0 200,000 400,000 600,000 800,000 Start Stripped text + date

    Exact text Monosemous pairing 91,961 92,100 123,293 751,156 Lexemes remaining unmatched after each process
  16. 0 32,500 65,000 97,500 130,000 Stripped text + date Exact

    text Monosemous pairing 91,961 92,100 123,293 Lexemes remaining unmatched after each process
  17. Lessons • Stripped text can take you a long way

    • Date information (even without lexeme text) very useful • Monosemous pairing helps but needs close monitoring • Iterative checking with distinct confidence levels essential
  18. Extensibility • Dictionaries/thesauri share lexis + dates structure • Links

    between resources are valuable for lexicographers, lexicologists, semanticists, the general interested public • Methods can be applied to/adapted for other lexical resources ‣ Thesaurus of the Scots Language pilot project (Carnegie Research Incentive Grant)