Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges of presenting and analyzing etymological data of South-East Asian languages

Challenges of presenting and analyzing etymological data of South-East Asian languages

Talk presented at the 46th Poznan Linguistic Meeting (2016/09/15-17, Poznan, Poland)

Johann-Mattis List

September 15, 2016
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Beyond Cognacy
    Challenges of Representing and Analyzing Etymological Data
    of South-East Asian Languages
    Nathan W. Hill and Johann-Mattis List

    View Slide

  2. Background

    View Slide

  3. Traditional Approaches to Etymological Data
    ● traditional etymology looks back on great success story
    ● huge dictionaries have been published (Pokorny 1959,
    Kluge 1883, Mayerhofer 1986–2001, Meyer-Luebke 1911)
    ● thousands of word histories have been reconstructed
    “Chaque mot a son histoire!”
    (attr. to Jules Gillieron, 1854-1926)

    View Slide

  4. Drawbacks of Traditional Approaches
    Etymological dictionaries are:
    1. extremely time consuming to produce and to use
    2. insufficiently formalized, untransparent, and idiosyncratic
    3. difficult if not unrealistic to produce for understudied
    languages

    View Slide

  5. Quantitative Approaches to Etymological Data
    Current approaches to quantitative historical linguistics:
    ● rely on wordlists of basic vocabulary (Atkinson & Gray 2006)
    ● show an increase in breadth (more languages)
    ● show a decrease in depth (fewer words per language)
    ● usually ignore morphology (important in traditional approaches)
    ● show an untransparent motivation for cognate judgements
    ● usually never reach the etymological dictionary level of annotation

    View Slide

  6. Challenges for Computers
    “the belly” in Chinese and Tibetan:
    ● Old Chinese *puk
    ● Written Tibetan (1) grod-pa
    ● Written Tibetan (2) gsus-pa
    ● Lhasa Tibetan [tʂʰo¹³-ko ²]

    View Slide

  7. Challenges for Computers
    “the belly” in Chinese and Tibetan:
    ● Old Chinese *puk
    ● Written Tibetan (1) grod-pa
    ● Written Tibetan (2) gsus-pa
    ● Lhasa Tibetan [tʂʰo¹³-ko ²]

    View Slide

  8. Challenges for Computers
    “the belly” in Chinese and Tibetan:
    ● Old Chinese *puk
    ● Written Tibetan (1) grod-pa
    ● Written Tibetan (2) gsus-pa
    ● Lhasa Tibetan [tʂʰo¹³-ko ²]

    View Slide

  9. Challenges for Humans
    “There is a severe imbalance of
    being data-rich and theory-poor.”
    (William S.-Y. Wang, 1996)
    ● many datasets on South-East Asian languages have been published
    (Sidwell 2015, Wang 2004, Huang 1992, etc.)
    ● large digitized collections have been made available via the STEDT
    project (Matisoff 2011)
    ● but the majority of these data is unprocessed (not further checked
    by linguists), lacking etymologies, cognate judgments, phonetic
    transcriptions, or concept annotations

    View Slide

  10. The Best of Two Worlds?
    Can we combine the advantages
    of traditional and quantitative
    approaches to profit from
    computational efficiency and
    human insight?
    Which challenges do we face when
    pursuing integrated frameworks in
    South-East Asian languages?

    View Slide

  11. Etymological Database of Burmish Languages
    Background:
    ● part of ERC synergy grant 'Beyond Boundaries' (SOAS,
    British Museum, British Museum)
    Goal:
    ● creating a classical etymological dictionary, taking full
    advantage of computational approaches with an openly
    published database online

    View Slide

  12. The Burmish Family
    classification following Hill 2013
    geographic distribution
    (Hammaström et al. 2015)

    View Slide

  13. Previous Research
    Burlig (1967): pioneering and rigorous, but data too sparse
    Bradley (1979): inexplicit, not Burmish (Loloish)
    Mann (1998): no use of Old Burmese, no relative
    chronology of changes, morphemes as cognate sets
    Nishi (1999): no reconstruction, very clean organization into
    cognate sets, larger dataset than predecessors

    View Slide

  14. Challenges

    View Slide

  15. Challenges: Suprasegmental Correspondences
    Not all sound correspondences occur between sounds which
    are in the same prosodic position of a word. Notably
    processes like tonogenesis and various patterns of aspiration
    and voicing often co-occur with other sound changes. In
    these cases, a simple alignment of the words under
    consideration is usually not enough, but an analysis of the
    patterns of sound correspondences needs to be carried out.

    View Slide

  16. Challenges: Suprasegmental Correspondences (1)
    Burmish tonal split in checked syllables:
    Aspirate initials corresponding to Lashi tone [55]
    Wbur. khwak 'bowl', Lashi khuʔ
    OBur. khlup 'sew', Lashi khju:p
    WBur. khrok 'six', Lashi khjuk

    View Slide

  17. Challenges: Suprasegmental Correspondences (1)
    Burmish tonal split in checked syllables:
    Non-aspirated initials corresponding to Lashi [31]
    WBur. kok 'paddy rice', Lashi kuk³¹
    WBur. klyap 'ten cents; kyat', Lashi kjɔ³¹
    WBur. krok 'be afraid', Lashi kju:k³¹

    View Slide

  18. Challenges: Suprasegmental Correspondences (2)
    Burling's (1967) law (loss of preglottalization consonants):
    Examples to be reconstructed *C-:
    OBur. -khuiwḥ < *kuiwḥ 'smoke', Lashi -khouH
    OBur. khlyap < *klap 'flat object', Lashi khjapH

    View Slide

  19. Challenges: Suprasegmental Correspondences (2)
    Burling's (1967) law (loss of preglottalization consonants):
    Examples to be reconstructed *ˀC-:
    OBur. khat < *ˀkat 'put in (to); pack', Lashi kạ:tH
    OBur. khraṅ < *ˀkraŋ 'mosquito', Lashi kjạŋ

    View Slide

  20. Challenges: Partial Cognates
    ‘Cognacy is not a binary relation which is either present or not. Instead, we
    can distinguish different subtypes of cognacy, just as biologists can identify
    specific types of homology between genes.’ (List 2016: 133)
    partial cognacy in Chinese dialects (List et al. 2016)

    View Slide

  21. Challenges: Partial Cognates
    Binarisation of Partial Cognate Relations:
    ● strict (only fully identical words are considered cognate)
    ● loose (words sharing a cognate morpheme are cognate)
    Problems of Binarisation:
    ● not realistic with respect to lexical change
    ● over- or underestimates the amount of shared cognates

    View Slide

  22. Challenges: Partial Cognates
    Under-Estimation in Strict Encoding: “yesterday”
    ● Bola [a³¹ŋji³ nɛʔ³¹]
    ● Lashi [a³¹ŋjei nap³¹]
    ● Rangoon [mɑ ³ne ³kɑ ³]
    ● Xiandao [n³¹m
    ̥ an³ ]

    View Slide

  23. Challenges: Partial Cognates
    Under-Estimation in Strict Encoding: “yesterday”
    ● Bola [a³¹ŋji³ nɛʔ³¹]
    ● Lashi [a³¹ŋjei nap³¹]
    ● Rangoon [mɑ ³ne ³kɑ ³]
    ● Xiandao [n³¹m
    ̥ an³ ]

    View Slide

  24. Challenges: Partial Cognates
    Over-Estimation in Loose Encoding: “the leaf”
    ● Bola [sak faʔ ]
    ● Lashi [a³¹fu
    ̱ ʔ ]
    ● Rangoon [ɑ ³jwɛʔ ]
    ● Xiandao [a³¹xʐoʔ ]

    View Slide

  25. Challenges: Partial Cognates
    Over-Estimation in Loose Encoding: “the leaf”
    ● Bola [sak faʔ ]
    ● Lashi [a³¹fu
    ̱ ʔ ]
    ● Rangoon [ɑ ³jwɛʔ ]
    ● Xiandao [a³¹xʐoʔ ]

    View Slide

  26. Challenges: Language-Internal Cognates
    When using alignments to derive statistics from sound
    correspondences, dependencies inside a language need to
    be taken into account to avoid an overscoring of regularities.
    Language-internal cognates are invaluable evidence in
    classical cognate judgments and reconstruction. Current
    computational approaches ignore them completely.

    View Slide

  27. Challenges: Language-Internal Cognates
    prefix a- in Old Burmese
    ● “the branch” a khak
    ● “the mother” a miy
    ● “the flower” a po₁ṅʔ
    ● “the feather” a muyḥ
    ● “the father” a phiy
    ● “the leaf” a ro₁k
    “the dog” in Atsi
    ● “the wolf” [vam ¹kʰui²¹mo ]
    ● “the dog” [kʰui²¹]
    ● “the fox” [tan kʰui²¹]

    View Slide

  28. Preliminary Solutions

    View Slide

  29. Algorithms
    LingPy (Python library for historical linguistics)
    ● automatic cognate detection (2014)
    ● automatic detection of partial cognates (List et al. 2016)
    ● multiple phonetic alignment (2014)

    View Slide

  30. Tools
    EDICTOR (Etymological Dictionary Editor)
    ● data editing (X-Sampa input, automatic consistency tests)
    ● data annotation (cognates, partial cognates, alignments,
    morpheme tagging)
    ● data inspection (frequency analysis, structural analysis)

    View Slide

  31. Tools

    View Slide

  32. Materials
    ● data taken from Huang (1992)
    ● currently 8 Burmish varieties
    ● 248 concepts selected (basic vocabulary, and
    etymologically important words)
    ● partial cognates were automatically inferred and then
    manually corrected
    ● alignments were automatically computed (will be
    manually corrected)

    View Slide

  33. Partial Cognates: Annotation with the EDICTOR

    View Slide

  34. Partial Cognates: Annotation with the EDICTOR

    View Slide

  35. Partial Cognates: Annotation with the EDICTOR

    View Slide

  36. Partial Cognates: Annotation with the EDICTOR

    View Slide

  37. Partial Cognates: Annotation with the EDICTOR

    View Slide

  38. Partial Cognates: Annotation with the EDICTOR

    View Slide

  39. Partial Cognates and Strict vs. Loose Coding

    View Slide

  40. Partial Cognates: Strict versus Loose Coding

    View Slide

  41. Partial Cognates: Strict versus Loose Coding

    View Slide

  42. Sound Correspondences: Searching for Patterns
    patterns of initial sounds in 337 aligned partial cognate sets:

    View Slide

  43. Sound Correspondences: Searching for Patterns
    patterns of initial sounds in aligned partial cognate sets:

    View Slide

  44. Sound Correspondences: Searching for Patterns
    patterns of initial sounds in aligned partial cognate sets:

    View Slide

  45. Sound Correspondences: Searching for Patterns
    patterns of initial sounds in aligned partial cognate sets:

    View Slide

  46. Sound Correspondences: Searching for Patterns
    patterns of initial sounds in aligned partial cognate sets:

    View Slide

  47. Sound Correspondences: Searching for Patterns
    What shall we do with morpheme alignments?
    ● if the Neogrammarians are right,
    ○ a given proto-form in a given context should always
    yield the same reflex in a given descendant language
    ● this means,
    ○ compatible patterns in aligned cognate sets will hint to
    specific proto-sounds or proto-sounds in specific
    contexts

    View Slide

  48. Sound Correspondences: Searching for Patterns
    What is compatibility?
    ● take all alignments for a given dataset, and select one
    common sound position (e.g., initial of each morpheme)
    ● when plotting for each language in our sample, which
    sound occurs in a given cognate set in the position, we
    can make a first step to compare these patterns

    View Slide

  49. Sound Correspondences: Searching for Patterns
    What is compatibility?
    compatible
    Cognate set L1 L2 L3 L4 L5 L6 L7 L8
    morpheus-1 p p p Ø f f Ø p
    morpheus-2 p Ø p p Ø f p p

    View Slide

  50. Sound Correspondences: Searching for Patterns
    What is compatibility?
    compatible
    Cognate set L1 L2 L3 L4 L5 L6 L7 L8
    morpheus-1 p p p Ø f f Ø p
    morpheus-3 Ø p p p f f p p

    View Slide

  51. Sound Correspondences: Searching for Patterns
    What is compatibility?
    NOT COMPATIBLE
    Cognate set L1 L2 L3 L4 L5 L6 L7 L8
    morpheus-1 p p p Ø f f Ø p
    morpheus-4 Ø p f p f p p p

    View Slide

  52. Sound Correspondences: Searching for Patterns
    What is compatibility?
    ● compatibility of two identical positions in different
    alignments is a necessary requirement to assume that the
    two alignments represent a common proto-sound in a
    common proto-context
    ● it is not sufficient, as we have to deal with missing data,
    which may sufficiently blur the picture

    View Slide

  53. Sound Correspondences: Searching for Patterns
    Building a compatibility network of aligned cognate sets:
    ● take the same position (e.g., initial consonant) in all
    alignments (called a “site” of the alignment)
    ● make a network in which the alignment sites are nodes
    ● edges in the network are drawn between two nodes if
    these are compatible with each other
    ● weights between the edges are determined by counting
    the positions without a gap in both alignment sites

    View Slide

  54. Sound Correspondences: Searching for Patterns

    View Slide

  55. Sound Correspondences: Searching for Patterns

    View Slide

  56. Sound Correspondences: Searching for Patterns
    Search for maximal cliques to increase cluster compatibility:
    ● A clique in a network is a group of nodes which are all connected with
    each other.
    ● Cliques of compatible alignment sites represent the strongest
    evidence for a coherent group of regular correspondences pointing to
    the same proto-sound in a given context.
    ● We use a simple method to search for non-overlapping cliques by
    maximizing their size, which is done in an iterative manner.

    View Slide

  57. Sound Correspondences: Searching for Patterns

    View Slide

  58. Sound Correspondences: Searching for Patterns
    Search for maximal cliques to increase cluster compatibility:
    ● What looks chaotic is less scary, if we look at the patterns!
    ● Of 638 cognate sets:
    ○ 337 occur in at least 3 taxa
    ○ 317 start with an initial consonant
    ○ 234 could be assigned to 35 transitive groups of minimal size 2
    ○ 74% (234 / 337) of the cognate sets can be seen as “regular”. The
    remaining cognate sets will be checked and either corrected or
    their incompatibility will be explained.

    View Slide

  59. Sound Correspondences: Searching for Patterns

    View Slide

  60. Sound Correspondences: Searching for Patterns

    View Slide

  61. Sound Correspondences: Searching for Patterns

    View Slide

  62. Sound Correspondences: Searching for Patterns

    View Slide

  63. Sound Correspondences: Searching for Patterns

    View Slide

  64. Sound Correspondences: Searching for Patterns

    View Slide

  65. Sound Correspondences: Searching for Patterns
    Proto-Burmish *k-

    View Slide

  66. Sound Correspondences: Searching for Patterns
    Proto-Burmish *ˀk-
    Proto-Burmish *kh-

    View Slide

  67. Sound Correspondences: Searching for Patterns
    Be careful with the interpretation of compatibility networks:
    ● remember, one clique in our network does not necessarily correspond
    to one proto-sound in the proto-language (maybe, our alignments are
    wrong, our cognates are wrong, the words are borrowed…)
    ● but: if a proto-sound in a certain number of identical contexts has
    derived regularly from proto-language to descendant languages, it
    should form a clique in our data!
    ● compatibility networks of prosodically similar alignment sites are just a
    first step towards computer-assisted language reconstruction

    View Slide

  68. Language-Internal Cognates: Improving Annotation
    Atsi
    ● “the moon” [lo
    ̱ mo ]
    ● “the tiger” [lo²¹mo ]
    ● “the wolf” [vam ¹kʰui²¹mo ]
    ● “the dog” [kʰui²¹]
    ● “the fox” [tan kʰui²¹]
    Bola
    ● “the wolf” [mjaŋ kʰui³ ]
    ● “the dog” [kʰui³ ]
    ● “the thunder” [mau³¹mjaŋ ]
    ● “the sky” [mau³¹kʰauŋ ]

    View Slide

  69. Language-Internal Cognates: Improving Annotation
    Language Concept Word Motivation
    Atsi the moon lo
    ̱ mo moon m-suffix
    Atsi the tiger lo²¹mo tiger m-suffix
    Atsi the wolf vam ¹kʰui²¹mo bear dog m-suffix
    Atsi the dog kʰui²¹ dog
    Atsi the fox tan kʰui²¹ fox(?) dog

    View Slide

  70. Language-Internal Cognates: Improving Annotation
    Lexeme motivation annotation in the EDICTOR:
    ● simplified schema, all glosses are allowed, as long
    they do not contain white-space
    ● question marks can be used to express doubt
    ● identically annotated morphemes can be inspected
    (they indicated language-internal cognates)
    ● partial and full colexifications can be used to
    assist the morphological analysis
    ● data can be visualized with help of partial
    colexification graphs

    View Slide

  71. Language-Internal Cognates: Improving Annotation
    “black” , “dark”, and “the mouse or rat” in Bola

    View Slide

  72. Language-Internal Cognates: Improving Annotation
    “black” , “dark”, “early”, and “the mouse or rat” in Atsi

    View Slide

  73. Language-Internal Cognates: Improving Annotation
    “bad”, “good”, and “the man” in Bola

    View Slide

  74. Language-Internal Cognates: Improving Annotation
    “good”, “the man”, and “the son” in Atsi

    View Slide

  75. Language-Internal Cognates: Improving Annotation
    automatically computed
    partial colexification graph
    for “noʔ²¹” in Atsi

    View Slide

  76. Language-Internal Cognates: Improving Annotation
    partial colexification graph
    for manually annotated
    cluster of “tree” in Bola

    View Slide

  77. Availability
    ● Our GitHub (listing issues, plans, and software)
    ○ https://github.com/digling/burmish
    ● Our database in the EDICTOR (current state,
    constantly evolving):
    ○ https://dighl.github.io/burmish/
    ● Results of our Alignment Site Analysis:
    ○ https://dighl.github.io/burmish/plot-corrs.html
    ● For more questions, email us:
    [email protected]

    View Slide

  78. Outlook

    View Slide

  79. Outlook
    Chances are there!
    ● working with partial cognates increases the realism of our analyses
    ● working with alignments opens new horizons for computer-assisted
    consistency analysis
    ● annotation of language-internal cognacy opens exciting research
    avenues for the investigation of semantic change, lexical typology,
    and language relationship

    View Slide

  80. Outlook
    Challenges remain!
    ● How can we get from compatibility clusters in our alignments to first
    reconstructions?
    ● Can we use compatibility to test the consistency of reconstruction
    systems?
    ● How can we formalize the assignment of cross-semantic cognates in
    our wordlists?

    View Slide

  81. Thanks for Your Attention!
    Thanks to:
    ● the Équipe AIRE (UPMC, Paris) for inspiration and help with network analyses
    ● Guillaume Jacques and Laurent Sagart for helpful feedback on EDICTOR tool
    and active help in creating the first concept list
    ● Doug Cooper for helping out with parts of the data that we used

    View Slide

  82. References
    Bradley, David (1979). Proto-Loloish. London: Curzon Press.
    Burling, Robbins (1967). Proto-Lolo-Burmese. Bloomington: Indiana University.
    Hill, Nathan W. (2013) 'The merger of Proto-Burmish *ts and *č in Burmese.'
    SOAS Working Papers in Linguistics 16: 334-345.
    Kluge, Friedrich (1883). Etymologische Wörterbuch der deutschen Sprache.
    Strassburg: K. J. Trübner.
    List, Johann-Mattis (2014): Sequence comparison in historical linguistics.
    Düsseldorf: Düsseldorf University Press.

    View Slide

  83. References
    List, Johann-Mattis, Philippe Lopez, and Eric Bapteste (2016): Using sequence
    similarity networks to identify partial cognates in multilingual wordlists.
    Proceedings of the Annual Meeting of the ACL. Berlin: Association of
    Computational Linguistics. 599–605.
    Mann, Noel Walter (1998). A phonological reconstruction of Proto Northern
    Burmic. Unpublished thesis. Arlington: The University of Texas.
    Mayerhofer, Manfred (1986-2001). Etymologisches Wörterbuch des
    Altindoarischen. Heidelberg: Carl Winter.

    View Slide

  84. References
    Meyer-Luebke, Wilhelm (1911). Romanisches etymologisches Wörterbuch.
    Heidelberg: Winter.
    Nishi, Yoshio (1999). Four Papers on Burmese: Toward the history of Burmese
    (the Myanmar language). Tokyo: Institute for the study of languages and cultures
    of Asia and Africa, Tokyo University of Foreign Studies.
    Pokorny, Julius (1959). Indogermanisches etymologisches Wörterbuch. Bern and
    Münich: Francke.

    View Slide