Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lexicalization Pressure

Marc Alexander
August 29, 2018
250

Lexicalization Pressure

Presented at ICEHL XX, Edinburgh, Scotland

Marc Alexander

August 29, 2018
Tweet

More Decks by Marc Alexander

Transcript

  1. Lexicalization Pressure icehl 20 August 2018 – Marc Alexander Work

    also by Brian Aitken, Fraser Dallachy, and Christian Kay University of Glasgow
  2. […] the primacy of meaning, and the analysis of meaning,

    as the essential tool and criterion for the study of any language [and] nothing less than classification, by meaning, of the whole if one is to begin to understand the parts. Kay et al 2017: ‘Why a Historical Thesaurus’ https://ht.ac.uk/why/
  3. “Time and practice are the only ways to learn. Do

    not get bogged down, confused and bewildered.” Christian Kay, Classifying [internal notes], c1980s
  4. ‣ 793,742 words ‣ 225,131 categories (= meanings) ‣ Approximately

    3.5 words for each concept, on average ‣ Largest categories: • 01.05.06.08.02 av (264 synonyms) “Immediately” • 02.01.09.03 aj (248 synonyms) ”Dull, stupid” • 02.06.01.06 (224 synonyms) ”Excellent” • 01.02.03 (213 synonyms) ”Die” • 02.01.09.06.01 (203 synonyms) ”Stupid person, dolt, blockhead”
  5. v.1 Print htoed v.3 Small-scale moves, remove “00” categories, fix

    +_. 99 subcategories v.4 Large-scale renumbering v.4.2 Thematic dataset v.5 oed3 updates/sync; approx 35% of entries from v1 re-dated, 20,000 new words added, changes to 39% of categories
  6. ‣ ht Thematic Dataset ‣ ‘Human scale’ categories ‣ for

    example: aq03b Ghost/phantom ‣ 4,034 categories (as opposed to 797k) ‣ Useful for: ‣ browsing ‣ users finding search terms ‣ an aggregation level for data ‣ larger categories for analysis thematic categories
  7. 80,000 160,000 240,000 320,000 400,000 10501100115012001250130013501400145015001550160016501700175018001850190019502000 15,343 15,343 15,405 18,257

    21,841 30,857 37,408 67,229 75,396 85,249 106,314 152,212 184,602 199,224 205,892 220,539 248,448 278,415 334,064 363,039 Middle English Early Modern English Later Modern English Old English
  8. 0 100 200 300 400 1050 1150 1250 1350 1450

    1550 1650 1750 1850 1950 Middle English Early Modern English Later Modern English Old English 01.10.09.07 Named colours
  9. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.01.05.04 Fountain 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.01.10.13 Astrology 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.03.06.01 Inodorousness
  10. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 02.02.18.03 Meekness 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 02.02.28.03 Arrogance 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.03.05.01 Insipidity
  11. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.04.01 Alchemy 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.04.02.01 Chemistry
  12. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000
  13. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000
  14. 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325

    1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.01.05.04 Fountain 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.01.10.13 Astrology 1100 1125 1150 1175 1200 1225 1250 1275 1300 1325 1350 1375 1400 1425 1450 1475 1500 1525 1550 1575 1600 1625 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975 2000 01.03.06.01 Inodorousness
  15. ‣ Loosely, assigning a word form to a meaning ‣

    Across a Thesaurus category, patterns of lexicalization show increasing and decreasing word senses available to describe the category’s concept ‣ Such patterns show: ‣ a ‘pressure’ to increase the word-stock of a particular concept ‣ where significant unusual attention has been placed to a concept over time ‣ unusual places where the language develops out of line with general trends lexicalization
  16. Thematic Heading bk04h03b Organ Total Size 284 Average Size 37.2

    Average New Words 2.84 Average Words Falling Out of Use 1.3 Average Difference between Decades 1.54 Average Churn 2% Birth-Rate Average 7% Average Variation from Overall Rate of Change 0.008369565 Largest Churn 22% Largest Churn Post-1500 22% Peak Decade for Churn 1850s Peak Post-1500 Decade for Churn 1850s Standard Deviation of Size 55.2612 Largest Size 184 Difference between Largest Size and End Size -30 Percent Difference between Largest and End Size -16% Peak Decade for Size 1880s Modal Size 154 Frequency of Modal Size 8 Frequency within 5% of Modal Size 14 Largest Increase 58 Largest Fall -27 % Increase as a Percentage of Size 156% % Fall as a Percentage of Size -73% For each decade: Count 184 New Senses 44 Senses Falling out of Use 37 Difference to Next Decade -27 Churn 20% Rate of Birth 24% Variation from Overall Average -15%
  17. ‣ Standard deviation of the 1000-2000 decade ‘chunks’ greater than

    30; make sure that the categories have enough variation to generate a peak ‣ The difference between the largest decade and the present is 10% or more of the peak size; make sure that the peak is pronounced ‣ Exclude the oe dates (a long period; not comparable to a decade) ‣ Exclude peaks in the 1900s-2000s; ignore peaks which are in line with the general increase ‣ Using these criteria, there are 464 Thematic Categories (10%) displaying some sort of ‘peak’ definition: peak
  18. ‣ ac01c03 Disorders of horses; peak 1720 ‣ ag03h Hawking;

    peak 1610 ‣ ai10e Tobacco; peak 1880 ‣ bg04g01 Heraldic devices collectively; peak 1820 (modern English 473, at its largest 700) ‣ bk05g04 Architectural ornament; peak 1850 peaks reflecting changing society?
  19. ‣ ai16e01 Privy/latrine; word count in the 30s until the

    1830s, then increases to 133 by the 1840s ‣ ba15b Armour; 1610 ‣ bh14k17 Shipbuilding and repairing; 1860s ‣ bh14 Sailing; in general, tends to peak 1860s ‣ bg07e Handwriting/style of; 1880 ‣ ba03 Victory; 1600 ‣ ag01x02b Excessive consumption of food/drink; 1600 ‣ ah02d01 Undressing/removing clothing; 1650 and 1840 ‣ ai12e Sourness/acidity; 1670 ‣ am02a Eternity/infinite duration; 1670 (99, ModE 77) peaks reflecting changing technology (and society)?
  20. ‣ aa11b Bad weather; 1850 ‣ The Supernatural: ‣ aq02b

    Sorcery/witchcraft/magic; 1650 ‣ aq03a Evil spirit/demon; 1840 ‣ aq03c Fairy/elf; 1880 peaks reflecting…?
  21. ‣ Two periods come up a lot: 1640s-1650s, 1840s-1860s ‣

    From Charlotte Brewer’s Examining the oed: peaks
  22. peaks ‣ Two periods come up a lot: 1640s-1650s, 1840s-1860s

    ‣ From Charlotte Brewer’s Examining the oed:
  23. ‣ For each decade chunk, find the mode; get the

    most frequent size of the category ‣ Find how often the category size is within 5% of that mode; get a rough idea of the plateau period ‣ Find all those categories with more than 30 decades are within 5% of the mode; get categories where there is a significant period of plateau initial definition: plateau
  24. ‣ ao Action plateaus heavily; etymological makeup ‣ Society: ay06a03a

    Title/form of address for persons of rank; plateaus from 1590s ‣ Society/data: au28b Kiss; two plateaus – begins with 10 words, then triples between 1540-1600, then plateaus until 1930 and then doubles from 1930-1980 ‣ ?: ap06b Sufficient quantity/amount/degree; plateaus from 1570 onwards, at around 60 words plateaus reflecting…?
  25. ‣ Calculate the difference in size between each decade chunk

    ‣ Of the falls or rises, take only those where 10 or more words are lost in a decade; find big falls ‣ Of these, find all those categories where the fall compared to the average size of that category is greater than 5%; filter out enormous categories a stab at trauma
  26. ‣ am06b Watch ‣ bi06j Mining ‣ bk01d03 Spotting trains,

    watching birds, etc. ‣ bk07k05 Printing ‣ ar06g01 Phrenology ‣ bf21a03 Doctrines concerning the soul ‣ bf24b01 Meeting for observance a stab at trauma
  27. ‣ These previous measures focus on size and do not

    look at rates of churn (+2 between two decades could be +2 or could be -10 +12) ‣ So take the number of synonyms lost in each decade (that is, which have a final citation date in that decade) and divide it by the size of the category during that decade for a measure of churn ‣ Here, categories bigger than 10 synonyms with an overall churn greater than 4% churn
  28. ‣ aa05d Africa ‣ ab14 Biological theories ‣ ac02k Psychiatry

    ‣ ae14a Subclass Marsupialia (marsupials) ‣ aj02a Chemistry as a science ‣ aj07e03 Particle physics ‣ aj07f01 Radioactivity ‣ aq02a Spiritualism ‣ ar42c01 A charlatan, fraudster ‣ ar53e Idealism ‣ bb06k03 Party politics ‣ bf24a Kinds of worship ‣ bg08i Printing machine/press churn
  29. ‣ If we agree that lexicalization measures demonstrate something in

    accordance with cultural trends, and mirrors historical facts, then we have a rough measure of attention across the history of English which is primarily independent of frequency ‣ It shows us if our investigation is in a semantic field with unusual overall patterning in a particular period ‣ It connects historical linguistics more closely to the history of ideas what can we tell from this?
  30. ‣ Three sets of data: ‣ Raw data (email me;

    we may make an api if there’s interest) ‣ Sparklines for ‘shape’ of a semantic field’s behaviour (ht.ac.uk/ sparklines) ‣ Heatmap for contextual differences in lexical growth (ht.ac.uk/ heatmap) ‣ Evidence for the ways in which lexicalization of a semantic field can reflect the linguistic record or the context of production lexicalization pressure
  31. ‣ Hypothesis: ‣ The primary semantic value of a word

    form in a language is the form most likely to bud off further meanings (that is, that will generate re-use of the same word form in the same semantic field) density
  32. density run 10 an04a Swiftness 8 bk08v Ball game 7

    aj05h Action/process of flowing 7 bh11f Transport/conveyance in a vehicle 7 bk08q Racing/race 5 aj05g Liquid which has been emitted 5 bk01d A specific form of amusement/ a pastime
  33. density run cobuild first senses: 1. Move more quickly than

    you walk 2. Run in a race in competition with other people/horses 3. Course or position (eg wire or river)
  34. density 10 an04a Swiftness 8 bk08v Ball game 7 aj05h

    Action/process of flowing 7 bh11f Transport/conveyance in a vehicle 7 bk08q Racing/race 5 aj05g Liquid which has been emitted 5 bk01d A specific form of amusement/ a pastime run
  35. density 7 an04a Swiftness 5 aj05h Action/process of flowing 5

    bk08q Racing/race 4 an03 Progressive motion 4 bh Travel and travelling 3 an05k Going/coming out 3 an05m Going away run, 1700s
  36. density 9 an08a Striking 8 ba15e Operation and use of

    weapons 6 ac01e Injury 5 ag02c Cultivation/tillage 5 aj03c Temperature 5 al03g Flatness/levelness 5 al08 Impact strike
  37. density 9 an08a Striking 7 ba15e Operation and use of

    weapons 5 an08 Impact 5 au01 Emotions, mood 4 ac01e Injury 4 aj08 Light 4 al03g Flatness/levelness strike, 1700s
  38. ‣ Where a word form is polysemous, density of the

    word form’s location in the Thesaurus hierarchy can supplement our information about where the word form’s primary semantic meaning is density
  39. ‣ If we agree that lexicalization measures demonstrate something in

    accordance with cultural trends, and mirrors historical facts, then we have a rough measure of attention ‣ If we agree that density of a widespread word form in a semantic category indicates the primary sense of that word form at that time, then we have another rough measure of associations of that word form ‣ Frequency is a proxy; we can supplement that proxy what can we tell from this?
  40. ‣ Lexicographical data is susceptible, very heavily, to biases in

    the lexicographical record, not least that of oed; ’any dictionary dates should be treated with a certain amount of caution’ (Durkin) ‣ This data can supplement frequency but should not aim to replace it health warnings
  41. ‘For dealing historically with the phonological and grammatical systems of

    English, we have inherited from our predecessors a legacy which can and should be used fruitfully [...] In the field of lexis – apart from alphabetical dictionaries – we are comparatively poorly equipped, and the way before us is a long one.’ Samuels (1965: 40)