Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Semantic Tagging and Early Modern Collocates

Marc Alexander
July 23, 2015
62

Semantic Tagging and Early Modern Collocates

Presented at Corpus Linguistics 2015

Authors (asterisk indicates presenting authors):
* Marc Alexander, University of Glasgow
Alistair Baron, Lancaster University
* Fraser Dallachy, University of Glasgow
* Scott Piao, Lancaster University
Paul Rayson, Lancaster University
Stephen Wattam, Lancaster University

Marc Alexander

July 23, 2015
Tweet

More Decks by Marc Alexander

Transcript

  1. Semantic Tagging and Early Modern Collocates M A R C

    A L E X A N D E R * , A L I S TA I R B A R O N † , F R A S E R D A L L A C H Y * , S C O T T P I A O † , PA U L R A Y S O N † , A N D S T E P H E N WA T TA M † * U N I V E R S I T Y O F G L A S G O W, U K † L A N C A S T E R U N I V E R S I T Y, U K
  2. Dr Marc Alexander University of Glasgow Jean Anderson University of

    Glasgow Professor Dawn Archer University of Central Lancashire Dr Alistair Baron Lancaster University Professor Jonathan Hope University of Strathclyde Professor Lesley Jeffries University of Huddersfield Professor Christian Kay University of Glasgow Dr Paul Rayson Lancaster University Dr Brian Walker University of Huddersfield Brian Aitken University of Glasgow Dr Fraser Dallachy University of Glasgow Dr Jane Demmen University of Huddersfield Bethan Malory University of Central Lancashire Dr Scott Piao Lancaster University Stephen Wattam Lancaster University Professor Mark Davies Brigham Young University Professor Anthony Johnson Åbo Akademi University Ilkka Juuso University of Oulu Professor Tapio Seppänen University of Oulu Other partners: Oxford University Press, the University of Wisconsin-Madison
 and the Folger Shakespeare Library.
  3. “Words, words. They’re all we have to go on.” Stoppard,

    Tom. 1967. Rosencrantz and Guildenstern are Dead.
  4. 62% of English word forms refer to more than one

    meaning Of the 793,742 entries in the Historical Thesaurus
 of English there are 370,011 non-Old-English
 word forms, of which: 67 have more than 100 possible meanings 464 have more than 50 possible meanings 2,580 have more than 20 possible meanings 7,554 have more than 10 possible meanings 111,127 have more than 1 possible meaning 258,883 have just 1 possible meaning
  5. 01.03.01.05.02|03.05 n 
 Health and disease :: Disorders of cattle/horse/sheep

    :: disorders of cattle/sheep :: other disorders strike (1933–) 01.03.03.04.14|02 vt
 Make healthy :: Practise physiotherapy :: rub/stroke with hands strike (1400 + 1611 + 1886 dial.) 01.02.04.04.03|07 vt 
 Come by death :: Kill by specific method :: by poisoning strike (1592–1621) 01.06.10.08|01 vi
 Plant :: Be a root :: grow (as root) strike (1682–) 01.05.17.05.02|04 n
 Animals :: Suborder Ophidia (snakes) :: act of darting at prey strike (1879–) 01.10.03.03.02.03|12.06 vt
 Burn/consume by fire :: kindle/set alight :: produce (fire/spark) by striking strike (c1450– also fig.) 01.10.09.03.03|01 vi
 Dye :: sink in strike (c1790–) 01.13.06.01|04 vt
 Time :: Clock :: strike strike (1417–) 03.11.04.03|05 vi
 Carry on an occupation/work :: Participate in labour relations :: strike strike (1768–)
  6. A C C E S S • Web demo site:

    http://is.gd/semtag – Quick access • A more convenient GUI tool for processing multiple texts – Available at www.glasgow.ac.uk/samuels • (Soon) Access via WMatrix API – http://ucrel.lancs.ac.uk/wmatrix
  7. Historical  Thesaurus  Based  Semantic  Tagger  (HTST) • HTST  is  developed

     based  on  Lancaster  UCREL  corpus  annotation  tools,   including  Semantic  Annotation  system  (USAS),  CLAWS,  VARD.   • Incorporated  the  Historical  Thesaurus  (HT)  and  annotate  words  with  semantic   categories  of  HT.   • Mainly  produce  three  layers  of  annotation:   – USAS  semantic  tags   – Full  HT  categories,  e.g.  03.12.20.02-­‐07.10  (cost  of  living)   – Broader  thematic  categories,  e.g.  BJ.01.y.02  (Expenditure)   • Time-­‐sensitive  semantic  annotation.   – Can  use  time  parameter  to  produce  more  accurate  annotation,  particularly    for   historical  data.   • OED  sense  definition  data  is  used  to  improve  the  accuracy  of  annotation.
  8. Architecture  of  HTST VARD CLAWS HT  sense  tagger USAS  NLP

      lexicon   resources USAS [HT-­‐related  resources]   Historical  Thesaurus;   Higher-­‐level  HT   categories;   Linked  HT  categories;   Highly  polysemous   words;   Z-­‐category  words;   Polyseme  density  list; Input  raw  text Annotated  text HT  sense   disambiguator Spelling  train   model Context  feature  data   from  OED  defs  &   examples
  9. Core  Lexical  Resources  of  HTST • The  most  important  part

     of  HTST  is  a  set  of  resources,  including:   – USAS  semantic  lexicons.   – Historical  Thesaurus  of  English  (University  of  Glasgow)  and  auxiliary  sub-­‐ lexicons.   – Statistical  correlation  data  between  USAS  and  HT  semantic  tags  extracted   from  OED  sense  definitions.
  10. USAS  Lexical  resources • Lexicon  of  over  56,300  items  

    – presentation    NN1          Q2.2  A8  S1.1.1  K4   • MWE  list  of  about  18,970  items  (including  templates)   – travel_NN1  card*_NN*            M3/Q1.2   • A  small  wildcard  lexicon   – *kg                                            NNU          N3.5
  11. A   General  and  abstract  terms B   The  body

     and  the  individual C   Arts  and  crafts E   Emotion F   Food  and  farming G   Government  and  public H   Architecture,  housing  and  the   home I   Money  and  commerce  in   industry K   Entertainment,  sports  and   games L   Life  and  living  things M   Movement,  location,  travel  and   transport N   Numbers  and  measurement O   Substances,  materials,  objects   and  equipment P   Education   Q   Language  and  communication S   Social  actions,  states  and   processes T   Time   W   World  and  environment X   Psychological  actions,  states   and  processes Y   Science  and  technology Z   Names  and  grammar USAS  Main  Semantic  Categories
 (For  further  details,  see  http://ucrel.lancs.ac.uk/usas/)
  12. Fine-­‐grained  Large-­‐scale  HT  Data
 (http://historicalthesaurus.arts.gla.ac.uk/) • 793,742  word  forms  arranged

     into  about  225,131  semantic  categories.   • Three  primary  divisions:     – I  The  External  World   – II  The  Mental  World   – III  The  Social  World   • Organized  in  a  hierarchical  structure  under  these  three  top  categories.   • The  HT  semantic  categories  are  mapped  to  4,028  thematic-­‐level  categories.   • Each  category  is  given  a  numerical  reference  code.       E.g.  "01.02.08.02.02.06.01  "  for  the  category  “Whisky”
  13. Context  feature  data  from  OED • Extraction  of  main  HT

     category  vs.  USAS  code  association  data  from  sense   definitions.   • All  together  198,783  co-­‐occurrence  pairs  (  f  >=  3)  are  collected.   • Log-­‐likelihood  correlation  scores  are  calculated  for  these  pairs.   • This  data  is  used  to  improve  the  context-­‐based  HT  sense  disambiguation   method.   co-occur. pair co-occur. freq HT code freq USAS code freq log-likelihood 01.01.11.02.07_W4 387 466 4815 2435.92859 01.11.02.02_A1.1.2 414 727 8489 1925.78816 01.09.10.02_B5 502 626 17490 1895.23302 01.15.18.01_A1.4- 180 301 422 1862.89737 03.06.05.07.02.04_   S3.2+/I3.1/S2mf 172 416 282 1792.09625
  14. Outline  of  Main  HTST  Disambiguation  Methods   • Pre-­‐processing:  Process

     text  using  VARD,  CLAWS  and  USAS  tagger.   • If  multiple  candidate  semantic  categories  exist,using  headings  (brief  definitions)  of  HT  semantic   categories  as  feature  set,  search  for  the  semantic  categories,  which  have  the  closest  distances  to  a   given  context.   • Jaccard  Distance  score  is  used  for  measuring  the  distance.   • Multiword  expressions  (MWEs)  (e.g.  kick  the  bucket)  also  help  to  disambiguate  the  annotation.   • OED  sense  definitions  are  used  to  estimate  the  Log-­‐Likelihood  correlation  between  USAS  and  HT   semantic  categories,  which  is  used  to  identify  most  likely  HT  tags  for  a  given  context.   • Time  filtering   – Filter  word  senses  whose  usage  appear  outside  a  given  time  window  in  the  HT  thesaurus.   – Users  can  set  upper  and  lower  time  boundaries  (in  years)  to  increase  the  relevance  of  the  HT   categories  to  the  given  time.   • E.g.  if  a  text  was  published  in  1800,  using  the  time  filter,  ignore  the  word  senses  which  appear  after  that  era.   – Particularly  useful  for  tagging  historical  data.
  15. HTST  Performance  and  Access • On  average,  about  82%  precision

     is  expected.   • With  proper  parameter  setting,  thematic  code  tagging  can  reach  nearly  88%.     • Access  to  HTST:   – A  demo  website:  http://phlox.lancs.ac.uk/ucrel/semtagger/english   – Graphical  client  tool  (GUI)  for  processing  a  collection  of  texts  using  HTST  service.   –      Access  via  Wmatrix  API.  
  16. H T C O D E S A N D

    T H E M AT I C C O D E S AW.16.c.01 – Piracy AW – Possession/ownership AW.16 – Taking surreptitiously AW.16.c – Robbery, piracy, raiding AW.16.c.01 – Piracy Corresponds roughly to HT 02.06.13.05.05.02 - Piracy
  17. T H E M AT I C C O L

    L O C AT I O N S I N C Q P W E B ( S P E C I A L I S E D F O R S A M U E L S )
  18. S E M A N T I C E E

    B O C O L L O C AT E S • ‘Taking surreptitiously/Theft’ (AW16) • ‘Birds’ (AE13) • ‘Order Rodentia (rodents)’ (AE14h) • ‘Nations’ (AD15) • ‘Food’ (AG01) • ‘Authority’ (BB) • ‘Strength’ (AJ04e) • ‘Position’ (AL04a) • ‘Possession/ownership’ (AW01a)
  19. I N S I P I D A S W

    I T H O U T TA S T E There are also some Apples that are insipid, or without taste: they are of a waterish substance, altogether vnpleasant to the stomack, and vnprofitable for meat. Venner, Tobias. 1620. Via recta ad vitam longam. London: Richard Moore. The vertues therefore of Baths coming not from insipid water, but from those most subtile, volatile, sulphureous, and salt spirits. Glauber, Johann Rudolf. 1651. A description of new philosophical furnaces, or A new art of distilling. London: Tho: Williams.
  20. I N S I P I D A S L

    A C K I N G I N T E R E S T / AT T R A C T I V E Q U A L I T I E S Fifthly, consider, The longer we enjoy any worldly thing, the more flat and insipid doth it grow: We are soon at the bottom, and find nothing but dregs there. Hopkins, Ezekiel. 1668. The vanity of the vvorld by Ezekiel Hopkins. London: Nathaniel Ranew and Jonathan Robinson. It is certainly true, that Women are caught for the most part in such weak Nets as these, that the most shallow, the most insipid, nay, the uglyest of Men have been the most successful in gaining an ascendant over the hearts of poor Women. Pallavicino, Ferrante. 1683. The whore’s rhetorick calculated to the meridian of London, and conformed to the rules of art. London: George Shell.
  21. I N S I P I D S P E

    C I F I C A L LY O F S P E E C H O R W R I T I N G Thy Poetrie’s insipid, none can taste it: Thou art a wordyfoolish Scribler, who Writ’st nothing but high-sounding frothy stuff. Shadwell, Thomas. 1678. The history of Timon of Athens, the man- hater as it is acted at the Dukes Theatre: made into a play. London: Henry Herringman. For I know some will say, why does he treat us with insipid descriptions of Weeds, and make us hobble after him over broken stones, decayed buildings, and old rubbish? Wheler, George. 1682. A journey into Greece by George Wheler, Esq., in company of Dr. Spon of Lyons. London: William Cademan, Robert Kettlewell, and Awnsham Churchill.
  22. I N S I P I D A S C

    H E M I C A L LY U N R E A C T I V E Thus Quicksilver, that is insipid, will in the cold dissolve Gold, which Aqua Fortis it self, though assisted by exeternal heat will not work upon. Boyle, Robert. 1685. Of the reconcileableness of specifick medicines to the corpuscular philosophy. London: Sam. Smith.