Challenges of presenting and analyzing etymological data of South-East Asian languages

Embed

Start on current slide

Slide 1

Slide 1 text

Beyond Cognacy Challenges of Representing and Analyzing Etymological Data of South-East Asian Languages Nathan W. Hill and Johann-Mattis List

Slide 2

Slide 2 text

Background

Slide 3

Slide 3 text

Traditional Approaches to Etymological Data ● traditional etymology looks back on great success story ● huge dictionaries have been published (Pokorny 1959, Kluge 1883, Mayerhofer 1986–2001, Meyer-Luebke 1911) ● thousands of word histories have been reconstructed “Chaque mot a son histoire!” (attr. to Jules Gillieron, 1854-1926)

Slide 4

Slide 4 text

Drawbacks of Traditional Approaches Etymological dictionaries are: 1. extremely time consuming to produce and to use 2. insufficiently formalized, untransparent, and idiosyncratic 3. difficult if not unrealistic to produce for understudied languages

Slide 5

Slide 5 text

Quantitative Approaches to Etymological Data Current approaches to quantitative historical linguistics: ● rely on wordlists of basic vocabulary (Atkinson & Gray 2006) ● show an increase in breadth (more languages) ● show a decrease in depth (fewer words per language) ● usually ignore morphology (important in traditional approaches) ● show an untransparent motivation for cognate judgements ● usually never reach the etymological dictionary level of annotation

Slide 6

Slide 6 text

Challenges for Computers “the belly” in Chinese and Tibetan: ● Old Chinese *puk ● Written Tibetan (1) grod-pa ● Written Tibetan (2) gsus-pa ● Lhasa Tibetan [tʂʰo¹³-ko ²]

Slide 7

Slide 7 text

Challenges for Computers “the belly” in Chinese and Tibetan: ● Old Chinese *puk ● Written Tibetan (1) grod-pa ● Written Tibetan (2) gsus-pa ● Lhasa Tibetan [tʂʰo¹³-ko ²]

Slide 8

Slide 8 text

Challenges for Computers “the belly” in Chinese and Tibetan: ● Old Chinese *puk ● Written Tibetan (1) grod-pa ● Written Tibetan (2) gsus-pa ● Lhasa Tibetan [tʂʰo¹³-ko ²]

Slide 9

Slide 9 text

Challenges for Humans “There is a severe imbalance of being data-rich and theory-poor.” (William S.-Y. Wang, 1996) ● many datasets on South-East Asian languages have been published (Sidwell 2015, Wang 2004, Huang 1992, etc.) ● large digitized collections have been made available via the STEDT project (Matisoff 2011) ● but the majority of these data is unprocessed (not further checked by linguists), lacking etymologies, cognate judgments, phonetic transcriptions, or concept annotations

Slide 10

Slide 10 text

The Best of Two Worlds? Can we combine the advantages of traditional and quantitative approaches to profit from computational efficiency and human insight? Which challenges do we face when pursuing integrated frameworks in South-East Asian languages?

Slide 11

Slide 11 text

Etymological Database of Burmish Languages Background: ● part of ERC synergy grant 'Beyond Boundaries' (SOAS, British Museum, British Museum) Goal: ● creating a classical etymological dictionary, taking full advantage of computational approaches with an openly published database online

Slide 12

Slide 12 text

The Burmish Family classification following Hill 2013 geographic distribution (Hammaström et al. 2015)

Slide 13

Slide 13 text

Previous Research Burlig (1967): pioneering and rigorous, but data too sparse Bradley (1979): inexplicit, not Burmish (Loloish) Mann (1998): no use of Old Burmese, no relative chronology of changes, morphemes as cognate sets Nishi (1999): no reconstruction, very clean organization into cognate sets, larger dataset than predecessors

Slide 14

Slide 14 text

Challenges

Slide 15

Slide 15 text

Challenges: Suprasegmental Correspondences Not all sound correspondences occur between sounds which are in the same prosodic position of a word. Notably processes like tonogenesis and various patterns of aspiration and voicing often co-occur with other sound changes. In these cases, a simple alignment of the words under consideration is usually not enough, but an analysis of the patterns of sound correspondences needs to be carried out.

Slide 16

Slide 16 text

Challenges: Suprasegmental Correspondences (1) Burmish tonal split in checked syllables: Aspirate initials corresponding to Lashi tone [55] Wbur. khwak 'bowl', Lashi khuʔ OBur. khlup 'sew', Lashi khju:p WBur. khrok 'six', Lashi khjuk

Slide 17

Slide 17 text

Challenges: Suprasegmental Correspondences (1) Burmish tonal split in checked syllables: Non-aspirated initials corresponding to Lashi [31] WBur. kok 'paddy rice', Lashi kuk³¹ WBur. klyap 'ten cents; kyat', Lashi kjɔ³¹ WBur. krok 'be afraid', Lashi kju:k³¹

Slide 18

Slide 18 text

Challenges: Suprasegmental Correspondences (2) Burling's (1967) law (loss of preglottalization consonants): Examples to be reconstructed *C-: OBur. -khuiwḥ < *kuiwḥ 'smoke', Lashi -khouH OBur. khlyap < *klap 'flat object', Lashi khjapH

Slide 19

Slide 19 text

Challenges: Suprasegmental Correspondences (2) Burling's (1967) law (loss of preglottalization consonants): Examples to be reconstructed *ˀC-: OBur. khat < *ˀkat 'put in (to); pack', Lashi kạ:tH OBur. khraṅ < *ˀkraŋ 'mosquito', Lashi kjạŋ

Slide 20

Slide 20 text

Challenges: Partial Cognates ‘Cognacy is not a binary relation which is either present or not. Instead, we can distinguish different subtypes of cognacy, just as biologists can identify specific types of homology between genes.’ (List 2016: 133) partial cognacy in Chinese dialects (List et al. 2016)

Slide 21

Slide 21 text

Challenges: Partial Cognates Binarisation of Partial Cognate Relations: ● strict (only fully identical words are considered cognate) ● loose (words sharing a cognate morpheme are cognate) Problems of Binarisation: ● not realistic with respect to lexical change ● over- or underestimates the amount of shared cognates

Slide 22

Slide 22 text

Challenges: Partial Cognates Under-Estimation in Strict Encoding: “yesterday” ● Bola [a³¹ŋji³ nɛʔ³¹] ● Lashi [a³¹ŋjei nap³¹] ● Rangoon [mɑ ³ne ³kɑ ³] ● Xiandao [n³¹m ̥ an³ ]

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Challenges: Partial Cognates Over-Estimation in Loose Encoding: “the leaf” ● Bola [sak faʔ ] ● Lashi [a³¹fu ̱ ʔ ] ● Rangoon [ɑ ³jwɛʔ ] ● Xiandao [a³¹xʐoʔ ]

Slide 25

Slide 25 text

Challenges: Partial Cognates Over-Estimation in Loose Encoding: “the leaf” ● Bola [sak faʔ ] ● Lashi [a³¹fu ̱ ʔ ] ● Rangoon [ɑ ³jwɛʔ ] ● Xiandao [a³¹xʐoʔ ]

Slide 26

Slide 26 text

Challenges: Language-Internal Cognates When using alignments to derive statistics from sound correspondences, dependencies inside a language need to be taken into account to avoid an overscoring of regularities. Language-internal cognates are invaluable evidence in classical cognate judgments and reconstruction. Current computational approaches ignore them completely.

Slide 27

Slide 27 text

Challenges: Language-Internal Cognates prefix a- in Old Burmese ● “the branch” a khak ● “the mother” a miy ● “the flower” a po₁ṅʔ ● “the feather” a muyḥ ● “the father” a phiy ● “the leaf” a ro₁k “the dog” in Atsi ● “the wolf” [vam ¹kʰui²¹mo ] ● “the dog” [kʰui²¹] ● “the fox” [tan kʰui²¹]

Slide 28

Slide 28 text

Preliminary Solutions

Slide 29

Slide 29 text

Algorithms LingPy (Python library for historical linguistics) ● automatic cognate detection (2014) ● automatic detection of partial cognates (List et al. 2016) ● multiple phonetic alignment (2014)

Slide 30

Slide 30 text

Tools EDICTOR (Etymological Dictionary Editor) ● data editing (X-Sampa input, automatic consistency tests) ● data annotation (cognates, partial cognates, alignments, morpheme tagging) ● data inspection (frequency analysis, structural analysis)

Slide 31

Slide 31 text

Tools

Slide 32

Slide 32 text

Materials ● data taken from Huang (1992) ● currently 8 Burmish varieties ● 248 concepts selected (basic vocabulary, and etymologically important words) ● partial cognates were automatically inferred and then manually corrected ● alignments were automatically computed (will be manually corrected)

Slide 33

Slide 33 text

Partial Cognates: Annotation with the EDICTOR

Slide 34

Slide 34 text

Partial Cognates: Annotation with the EDICTOR

Slide 35

Slide 35 text

Partial Cognates: Annotation with the EDICTOR

Slide 36

Slide 36 text

Partial Cognates: Annotation with the EDICTOR

Slide 37

Slide 37 text

Partial Cognates: Annotation with the EDICTOR

Slide 38

Slide 38 text

Partial Cognates: Annotation with the EDICTOR

Slide 39

Slide 39 text

Partial Cognates and Strict vs. Loose Coding

Slide 40

Slide 40 text

Partial Cognates: Strict versus Loose Coding

Slide 41

Slide 41 text

Partial Cognates: Strict versus Loose Coding

Slide 42

Slide 42 text

Sound Correspondences: Searching for Patterns patterns of initial sounds in 337 aligned partial cognate sets:

Slide 43

Slide 43 text

Sound Correspondences: Searching for Patterns patterns of initial sounds in aligned partial cognate sets:

Slide 44

Slide 44 text

Sound Correspondences: Searching for Patterns patterns of initial sounds in aligned partial cognate sets:

Slide 45

Slide 45 text

Sound Correspondences: Searching for Patterns patterns of initial sounds in aligned partial cognate sets:

Slide 46

Slide 46 text

Sound Correspondences: Searching for Patterns patterns of initial sounds in aligned partial cognate sets:

Slide 47

Slide 47 text

Sound Correspondences: Searching for Patterns What shall we do with morpheme alignments? ● if the Neogrammarians are right, ○ a given proto-form in a given context should always yield the same reflex in a given descendant language ● this means, ○ compatible patterns in aligned cognate sets will hint to specific proto-sounds or proto-sounds in specific contexts

Slide 48

Slide 48 text

Sound Correspondences: Searching for Patterns What is compatibility? ● take all alignments for a given dataset, and select one common sound position (e.g., initial of each morpheme) ● when plotting for each language in our sample, which sound occurs in a given cognate set in the position, we can make a first step to compare these patterns

Slide 49

Slide 49 text

Sound Correspondences: Searching for Patterns What is compatibility? compatible Cognate set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-2 p Ø p p Ø f p p

Slide 50

Slide 50 text

Sound Correspondences: Searching for Patterns What is compatibility? compatible Cognate set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-3 Ø p p p f f p p

Slide 51

Slide 51 text

Sound Correspondences: Searching for Patterns What is compatibility? NOT COMPATIBLE Cognate set L1 L2 L3 L4 L5 L6 L7 L8 morpheus-1 p p p Ø f f Ø p morpheus-4 Ø p f p f p p p

Slide 52

Slide 52 text

Sound Correspondences: Searching for Patterns What is compatibility? ● compatibility of two identical positions in different alignments is a necessary requirement to assume that the two alignments represent a common proto-sound in a common proto-context ● it is not sufficient, as we have to deal with missing data, which may sufficiently blur the picture

Slide 53

Slide 53 text

Sound Correspondences: Searching for Patterns Building a compatibility network of aligned cognate sets: ● take the same position (e.g., initial consonant) in all alignments (called a “site” of the alignment) ● make a network in which the alignment sites are nodes ● edges in the network are drawn between two nodes if these are compatible with each other ● weights between the edges are determined by counting the positions without a gap in both alignment sites