Computer-Assisted Approaches to Linguistic Reconstruction A Case Study from the Burmish Languages Johann-Mattis List¹ and Nathan W. Hill² 2017-07-21 ¹ Max Planck Institute for the Science of Human History, ² SOAS, University of London

Outline 1. Introduction 2. Phonological Reconstruction 3. Computer-Assisted Phonological Reconstruction 4. Examples 5. Outlook 2

Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, • and as a result, many classical linguists are very sceptical of the new approaches. 4

Etymology 3.0? Towards a “qualitative turn” in computational historical linguistics: • Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... • But we don’t a framework in which computers do our work for us, instead, we need a framework, where we tell computers to some work for us, in order to render our research more explicit, more efficient, and more rigorous. 5

Computer-Assisted Language Comparison CALC (MPI-SHH, Jena) • computational formalization of the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form ASIA (SOAS, London) • reconstruction of Proto-Burmish 7

The Burmish Etymological Database (BED) problem current etymological accounts on Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish procedure BED as litmus test for CALC: • make the Proto-Burmish Etymological Database project a first test for the CALC framework • use existing computational methods to pre-analyze the data • develop interfaces to allow for correction and inspection by the experts 9

Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). 10

Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). 10

Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). □ Write a data-driven etymological dictionary (table of contents is half-written). 10

What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. 12

What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. • The task of lexical reconstruction follows phonological reconstruction in projecting full lexemes to the Ursprache, thereby also assessing their meaning, and whether it is reasonable to reconstruct them at all. 12

Classical Workflow (Comparative Method) • assemble cognate sets and sound correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns • explain exceptions by: • refining inferred sound change processes (cf. “Verner’s law”) • borrowing (“substratum influence”) • analogy (leftovers) 13

General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) • automatic reconstruction using weighted directed networks (→ this talk) 17

Detailed Workflow: Preliminary Steps Fúzhōu ŋuoʔ⁵ Měixiàn ŋiat⁵ 0.44 kuoŋ⁴⁴ 0.78 0.78 Wēnzhōu y²¹ ȵ 0.30 0.35 0.67 ku ³ ɔ ⁵ 0.80 0.85 0.27 0.67 vai¹³ 0.85 0.85 0.82 0.73 0.73 Běijīng y ¹ ɛ⁵ 0.77 0.84 0.73 0.56 0.56 0.66 li ŋ¹ ɑ 0.78 0.78 0.44 0.67 0.82 0.82 0.80 ŋiat⁵ kuoŋ⁴⁴ ŋuoʔ⁵ ȵy²¹ yɛ⁵¹ kuɔ³⁵ liɑŋ¹ vai¹³ ŋiat⁵ vai¹³ kuoŋ⁴⁴ ŋuoʔ⁵ liɑŋ¹ yɛ⁵¹ ȵy²¹ kuɔ³⁵ ȵy²¹ kuɔ³⁵ ŋiat⁵ yɛ⁵¹ liɑŋ¹ ŋuoʔ⁵ kuoŋ⁴⁴ vai¹³ B C D A Partial cognate detection: List, Lopez, and Bapteste (2016) 18

Detailed Workflow: Preliminary Steps EDICTOR tool: List (2017) 19

Detailed Workflow: Preliminary Steps Language 'mountain' 'dog' 'thunder' 'wolf' 'bear (n.)' Atsi pum⁵¹ kʰui²¹ mau²¹ mjiŋ⁵¹ vam⁵¹ kʰui²¹ mo⁵⁵ vam⁵¹ mountain dog sky + thunder bear + dog + m-suff. bear Bola pam⁵⁵ kʰui³⁵ mau³¹ mjaŋ⁵⁵ mjaŋ⁵⁵ kʰui³⁵ vɛ⁵ ⁵⁵ mountain dog sky + thunder thunder + dog bear Lashi pɔm³¹ kʰui⁵⁵ mou³³ kɔm³³ wɔm³¹ kʰui⁵⁵ wɔm³¹ mountain dog sky + thunderB bear + dog bear Maru pam³¹ lə¹ ³¹ kʰa³⁵ muk⁵⁵ kum³¹ mjaŋ³¹ kʰa³⁵ vɛ⁵ ³¹ mountain ? + dog sky + thunderB thunder + dog bear Achang pum⁵⁵ xui³¹ mau³¹ ʐau³¹ pum⁵⁵ xui³¹ ɔm⁵⁵ mountain dog sky + thunderC mountain + dog bear Morpheme Annotation (Hill and List forthc.) 20

Detailed Workflow: Sound Correspondence Pattern Inference Clackson (2007: 37) 21

Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) • the main problem of correspondence pattern identification is the handling of missing data, since not all cognate sets will necessarily contain reflexes from each of the languages under investigation 22

Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible 24

Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 24

Detailed Workflow: Sound Correspondence Pattern Inference s s s s s s s s s s s k s - x x x x k k k k k k kʰ k ʃ k ʃ ʃ x ɣ ʃ k k k k k k k s s s s s n s s k k k k ʃ ʃ s s ʃ ʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ x x x x x ʃ ʃ ʃ ʃ ʃ ʃ ʃ kʰ ʃ ʃ s ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t t t t t t t t t kʰ t kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ 25

Slide 61

Slide 62

Slide 67

Slide 72

General Findings Basic Statistics: • 8 languages • 240 concepts • 855 partial cognate sets • 728 cross-semantic partial cognate sets • 218 valid cognate sets (with more than two reflexes) • 104 initial consonant patterns (48 with more than one reflex, the rest highly irregular) • well-reconstructed proto-sounds: stops and affricates *k, *kʰ, *t, *tʰ, *tʃ, *tʃʰ, *ts, *tsʰ, p, pʰ fricatives s, ʃ, x liquids and j r, l, j nasals ŋ, n, m 32

Specific Findings: “black” and “dark” Language black dark Old Burmese n a k - ∅ Rangoon n ɛ ʔ ⁴ ∅ Achang l ɔ k ⁵⁵ ∅ Xiandao n ɔ ʔ ⁵⁵ ∅ Atsi n o ʔ ²¹ n o ʔ ²¹ Bola n a ʔ ³¹ n a ʔ ³¹ Lashi n ɔː ʔ ³¹ ∅ Maru n ɔ ʔ ³¹ n ɔ ʔ ³¹ Proto-Burmish n *a k ³¹ *n *a|*ṵ ʔ|*k ³¹ [i] The discrepancy in the reconstructions for these two forms which were regularly recognized as cognate in all languages is due to the insufficient reconstruction by the proto-type which takes each correspondence set in- dependently, rather than summarizing all possible reflexes for accepted cross-semantic cognate sets. 33

Specific Findings: “middle” and “outside” Language middle outside (“out-middle”) Old Burmese ∅ ∅ Rangoon ∅ ∅ Achang k u - ŋ ⁵⁵ ∅ Xiandao k o - ŋ ⁵⁵ ∅ Atsi k u - ŋ ²¹ ∅ Bola k a u ŋ ³¹ k a u ŋ ³¹ Lashi k u - ŋ ³¹ ∅ Maru k a u ŋ ³⁵ k a u ŋ ³⁵ Proto-Burmish *k u - *ŋ ⁵⁵ ∅ [i] The algorithm refuses to reconstruct the morpheme “middle” in the word “outside”, as it only occurs two times. If the proto-type, again, only reconstructed one time per cross-semantic cognate set, the results would be the same. 34

Slide 95

Slide 96

We are only beginning to explore the potential of sound correspon- dence pattern analysis as a backbone for automatic linguistic recon- struction. Even now, it is straightforward to manually annotate all different sound correspondence patterns which we could infer from the data. To test the full potential of the approach, we will have to drastically increase the number of lexical items in our data, but even in this state, the approach is promising, and can serve as a starting point for a classical phonological reconstruction analysis. 37

