Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges of presenting and analyzing etymological data of South-East Asian languages

Challenges of presenting and analyzing etymological data of South-East Asian languages

Talk presented at the 46th Poznan Linguistic Meeting (2016/09/15-17, Poznan, Poland)

Johann-Mattis List

September 15, 2016
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Beyond Cognacy Challenges of Representing and Analyzing Etymological Data of

    South-East Asian Languages Nathan W. Hill and Johann-Mattis List
  2. Traditional Approaches to Etymological Data • traditional etymology looks back

    on great success story • huge dictionaries have been published (Pokorny 1959, Kluge 1883, Mayerhofer 1986–2001, Meyer-Luebke 1911) • thousands of word histories have been reconstructed “Chaque mot a son histoire!” (attr. to Jules Gillieron, 1854-1926)
  3. Drawbacks of Traditional Approaches Etymological dictionaries are: 1. extremely time

    consuming to produce and to use 2. insufficiently formalized, untransparent, and idiosyncratic 3. difficult if not unrealistic to produce for understudied languages
  4. Quantitative Approaches to Etymological Data Current approaches to quantitative historical

    linguistics: • rely on wordlists of basic vocabulary (Atkinson & Gray 2006) • show an increase in breadth (more languages) • show a decrease in depth (fewer words per language) • usually ignore morphology (important in traditional approaches) • show an untransparent motivation for cognate judgements • usually never reach the etymological dictionary level of annotation
  5. Challenges for Computers “the belly” in Chinese and Tibetan: •

    Old Chinese *puk • Written Tibetan (1) grod-pa • Written Tibetan (2) gsus-pa • Lhasa Tibetan [tʂʰo¹³-ko ²]
  6. Challenges for Computers “the belly” in Chinese and Tibetan: •

    Old Chinese *puk • Written Tibetan (1) grod-pa • Written Tibetan (2) gsus-pa • Lhasa Tibetan [tʂʰo¹³-ko ²]
  7. Challenges for Computers “the belly” in Chinese and Tibetan: •

    Old Chinese *puk • Written Tibetan (1) grod-pa • Written Tibetan (2) gsus-pa • Lhasa Tibetan [tʂʰo¹³-ko ²]
  8. Challenges for Humans “There is a severe imbalance of being

    data-rich and theory-poor.” (William S.-Y. Wang, 1996) • many datasets on South-East Asian languages have been published (Sidwell 2015, Wang 2004, Huang 1992, etc.) • large digitized collections have been made available via the STEDT project (Matisoff 2011) • but the majority of these data is unprocessed (not further checked by linguists), lacking etymologies, cognate judgments, phonetic transcriptions, or concept annotations
  9. The Best of Two Worlds? Can we combine the advantages

    of traditional and quantitative approaches to profit from computational efficiency and human insight? Which challenges do we face when pursuing integrated frameworks in South-East Asian languages?
  10. Etymological Database of Burmish Languages Background: • part of ERC

    synergy grant 'Beyond Boundaries' (SOAS, British Museum, British Museum) Goal: • creating a classical etymological dictionary, taking full advantage of computational approaches with an openly published database online
  11. Previous Research Burlig (1967): pioneering and rigorous, but data too

    sparse Bradley (1979): inexplicit, not Burmish (Loloish) Mann (1998): no use of Old Burmese, no relative chronology of changes, morphemes as cognate sets Nishi (1999): no reconstruction, very clean organization into cognate sets, larger dataset than predecessors
  12. Challenges: Suprasegmental Correspondences Not all sound correspondences occur between sounds

    which are in the same prosodic position of a word. Notably processes like tonogenesis and various patterns of aspiration and voicing often co-occur with other sound changes. In these cases, a simple alignment of the words under consideration is usually not enough, but an analysis of the patterns of sound correspondences needs to be carried out.
  13. Challenges: Suprasegmental Correspondences (1) Burmish tonal split in checked syllables:

    Aspirate initials corresponding to Lashi tone [55] Wbur. khwak 'bowl', Lashi khuʔ OBur. khlup 'sew', Lashi khju:p WBur. khrok 'six', Lashi khjuk
  14. Challenges: Suprasegmental Correspondences (1) Burmish tonal split in checked syllables:

    Non-aspirated initials corresponding to Lashi [31] WBur. kok 'paddy rice', Lashi kuk³¹ WBur. klyap 'ten cents; kyat', Lashi kjɔ³¹ WBur. krok 'be afraid', Lashi kju:k³¹
  15. Challenges: Suprasegmental Correspondences (2) Burling's (1967) law (loss of preglottalization

    consonants): Examples to be reconstructed *C-: OBur. -khuiwḥ < *kuiwḥ 'smoke', Lashi -khouH OBur. khlyap < *klap 'flat object', Lashi khjapH
  16. Challenges: Suprasegmental Correspondences (2) Burling's (1967) law (loss of preglottalization

    consonants): Examples to be reconstructed *ˀC-: OBur. khat < *ˀkat 'put in (to); pack', Lashi kạ:tH OBur. khraṅ < *ˀkraŋ 'mosquito', Lashi kjạŋ
  17. Challenges: Partial Cognates ‘Cognacy is not a binary relation which

    is either present or not. Instead, we can distinguish different subtypes of cognacy, just as biologists can identify specific types of homology between genes.’ (List 2016: 133) partial cognacy in Chinese dialects (List et al. 2016)
  18. Challenges: Partial Cognates Binarisation of Partial Cognate Relations: • strict

    (only fully identical words are considered cognate) • loose (words sharing a cognate morpheme are cognate) Problems of Binarisation: • not realistic with respect to lexical change • over- or underestimates the amount of shared cognates
  19. Challenges: Partial Cognates Under-Estimation in Strict Encoding: “yesterday” • Bola

    [a³¹ŋji³ nɛʔ³¹] • Lashi [a³¹ŋjei nap³¹] • Rangoon [mɑ ³ne ³kɑ ³] • Xiandao [n³¹m ̥ an³ ]
  20. Challenges: Partial Cognates Under-Estimation in Strict Encoding: “yesterday” • Bola

    [a³¹ŋji³ nɛʔ³¹] • Lashi [a³¹ŋjei nap³¹] • Rangoon [mɑ ³ne ³kɑ ³] • Xiandao [n³¹m ̥ an³ ]
  21. Challenges: Partial Cognates Over-Estimation in Loose Encoding: “the leaf” •

    Bola [sak faʔ ] • Lashi [a³¹fu ̱ ʔ ] • Rangoon [ɑ ³jwɛʔ ] • Xiandao [a³¹xʐoʔ ]
  22. Challenges: Partial Cognates Over-Estimation in Loose Encoding: “the leaf” •

    Bola [sak faʔ ] • Lashi [a³¹fu ̱ ʔ ] • Rangoon [ɑ ³jwɛʔ ] • Xiandao [a³¹xʐoʔ ]
  23. Challenges: Language-Internal Cognates When using alignments to derive statistics from

    sound correspondences, dependencies inside a language need to be taken into account to avoid an overscoring of regularities. Language-internal cognates are invaluable evidence in classical cognate judgments and reconstruction. Current computational approaches ignore them completely.
  24. Challenges: Language-Internal Cognates prefix a- in Old Burmese • “the

    branch” a khak • “the mother” a miy • “the flower” a po₁ṅʔ • “the feather” a muyḥ • “the father” a phiy • “the leaf” a ro₁k “the dog” in Atsi • “the wolf” [vam ¹kʰui²¹mo ] • “the dog” [kʰui²¹] • “the fox” [tan kʰui²¹]
  25. Algorithms LingPy (Python library for historical linguistics) • automatic cognate

    detection (2014) • automatic detection of partial cognates (List et al. 2016) • multiple phonetic alignment (2014)
  26. Tools EDICTOR (Etymological Dictionary Editor) • data editing (X-Sampa input,

    automatic consistency tests) • data annotation (cognates, partial cognates, alignments, morpheme tagging) • data inspection (frequency analysis, structural analysis)
  27. Materials • data taken from Huang (1992) • currently 8

    Burmish varieties • 248 concepts selected (basic vocabulary, and etymologically important words) • partial cognates were automatically inferred and then manually corrected • alignments were automatically computed (will be manually corrected)
  28. Sound Correspondences: Searching for Patterns What shall we do with

    morpheme alignments? • if the Neogrammarians are right, ◦ a given proto-form in a given context should always yield the same reflex in a given descendant language • this means, ◦ compatible patterns in aligned cognate sets will hint to specific proto-sounds or proto-sounds in specific contexts
  29. Sound Correspondences: Searching for Patterns What is compatibility? • take

    all alignments for a given dataset, and select one common sound position (e.g., initial of each morpheme) • when plotting for each language in our sample, which sound occurs in a given cognate set in the position, we can make a first step to compare these patterns
  30. Sound Correspondences: Searching for Patterns What is compatibility? compatible Cognate

    set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-2 p Ø p p Ø f p p
  31. Sound Correspondences: Searching for Patterns What is compatibility? compatible Cognate

    set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-3 Ø p p p f f p p
  32. Sound Correspondences: Searching for Patterns What is compatibility? NOT COMPATIBLE

    Cognate set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-4 Ø p f p f p p p
  33. Sound Correspondences: Searching for Patterns What is compatibility? • compatibility

    of two identical positions in different alignments is a necessary requirement to assume that the two alignments represent a common proto-sound in a common proto-context • it is not sufficient, as we have to deal with missing data, which may sufficiently blur the picture
  34. Sound Correspondences: Searching for Patterns Building a compatibility network of

    aligned cognate sets: • take the same position (e.g., initial consonant) in all alignments (called a “site” of the alignment) • make a network in which the alignment sites are nodes • edges in the network are drawn between two nodes if these are compatible with each other • weights between the edges are determined by counting the positions without a gap in both alignment sites
  35. Sound Correspondences: Searching for Patterns Search for maximal cliques to

    increase cluster compatibility: • A clique in a network is a group of nodes which are all connected with each other. • Cliques of compatible alignment sites represent the strongest evidence for a coherent group of regular correspondences pointing to the same proto-sound in a given context. • We use a simple method to search for non-overlapping cliques by maximizing their size, which is done in an iterative manner.
  36. Sound Correspondences: Searching for Patterns Search for maximal cliques to

    increase cluster compatibility: • What looks chaotic is less scary, if we look at the patterns! • Of 638 cognate sets: ◦ 337 occur in at least 3 taxa ◦ 317 start with an initial consonant ◦ 234 could be assigned to 35 transitive groups of minimal size 2 ◦ 74% (234 / 337) of the cognate sets can be seen as “regular”. The remaining cognate sets will be checked and either corrected or their incompatibility will be explained.
  37. Sound Correspondences: Searching for Patterns Be careful with the interpretation

    of compatibility networks: • remember, one clique in our network does not necessarily correspond to one proto-sound in the proto-language (maybe, our alignments are wrong, our cognates are wrong, the words are borrowed…) • but: if a proto-sound in a certain number of identical contexts has derived regularly from proto-language to descendant languages, it should form a clique in our data! • compatibility networks of prosodically similar alignment sites are just a first step towards computer-assisted language reconstruction
  38. Language-Internal Cognates: Improving Annotation Atsi • “the moon” [lo ̱

    mo ] • “the tiger” [lo²¹mo ] • “the wolf” [vam ¹kʰui²¹mo ] • “the dog” [kʰui²¹] • “the fox” [tan kʰui²¹] Bola • “the wolf” [mjaŋ kʰui³ ] • “the dog” [kʰui³ ] • “the thunder” [mau³¹mjaŋ ] • “the sky” [mau³¹kʰauŋ ]
  39. Language-Internal Cognates: Improving Annotation Language Concept Word Motivation Atsi the

    moon lo ̱ mo moon m-suffix Atsi the tiger lo²¹mo tiger m-suffix Atsi the wolf vam ¹kʰui²¹mo bear dog m-suffix Atsi the dog kʰui²¹ dog Atsi the fox tan kʰui²¹ fox(?) dog
  40. Language-Internal Cognates: Improving Annotation Lexeme motivation annotation in the EDICTOR:

    • simplified schema, all glosses are allowed, as long they do not contain white-space • question marks can be used to express doubt • identically annotated morphemes can be inspected (they indicated language-internal cognates) • partial and full colexifications can be used to assist the morphological analysis • data can be visualized with help of partial colexification graphs
  41. Availability • Our GitHub (listing issues, plans, and software) ◦

    https://github.com/digling/burmish • Our database in the EDICTOR (current state, constantly evolving): ◦ https://dighl.github.io/burmish/ • Results of our Alignment Site Analysis: ◦ https://dighl.github.io/burmish/plot-corrs.html • For more questions, email us: ◦ [email protected]
  42. Outlook Chances are there! • working with partial cognates increases

    the realism of our analyses • working with alignments opens new horizons for computer-assisted consistency analysis • annotation of language-internal cognacy opens exciting research avenues for the investigation of semantic change, lexical typology, and language relationship
  43. Outlook Challenges remain! • How can we get from compatibility

    clusters in our alignments to first reconstructions? • Can we use compatibility to test the consistency of reconstruction systems? • How can we formalize the assignment of cross-semantic cognates in our wordlists?
  44. Thanks for Your Attention! Thanks to: • the Équipe AIRE

    (UPMC, Paris) for inspiration and help with network analyses • Guillaume Jacques and Laurent Sagart for helpful feedback on EDICTOR tool and active help in creating the first concept list • Doug Cooper for helping out with parts of the data that we used
  45. References Bradley, David (1979). Proto-Loloish. London: Curzon Press. Burling, Robbins

    (1967). Proto-Lolo-Burmese. Bloomington: Indiana University. Hill, Nathan W. (2013) 'The merger of Proto-Burmish *ts and *č in Burmese.' SOAS Working Papers in Linguistics 16: 334-345. Kluge, Friedrich (1883). Etymologische Wörterbuch der deutschen Sprache. Strassburg: K. J. Trübner. List, Johann-Mattis (2014): Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.
  46. References List, Johann-Mattis, Philippe Lopez, and Eric Bapteste (2016): Using

    sequence similarity networks to identify partial cognates in multilingual wordlists. Proceedings of the Annual Meeting of the ACL. Berlin: Association of Computational Linguistics. 599–605. Mann, Noel Walter (1998). A phonological reconstruction of Proto Northern Burmic. Unpublished thesis. Arlington: The University of Texas. Mayerhofer, Manfred (1986-2001). Etymologisches Wörterbuch des Altindoarischen. Heidelberg: Carl Winter.
  47. References Meyer-Luebke, Wilhelm (1911). Romanisches etymologisches Wörterbuch. Heidelberg: Winter. Nishi,

    Yoshio (1999). Four Papers on Burmese: Toward the history of Burmese (the Myanmar language). Tokyo: Institute for the study of languages and cultures of Asia and Africa, Tokyo University of Foreign Studies. Pokorny, Julius (1959). Indogermanisches etymologisches Wörterbuch. Bern and Münich: Francke.