Computer-Assisted Approaches to Linguistic Reconstruction A Case Study from the Burmish Languages Johann-Mattis List¹ and Nathan W. Hill² 2017-07-21 ¹ Max Planck Institute for the Science of Human History, ² SOAS, University of London
Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, 4
Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, 4
Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, • and as a result, many classical linguists are very sceptical of the new approaches. 4
Etymology 3.0? Towards a “qualitative turn” in computational historical linguistics: • Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... 5
Etymology 3.0? Towards a “qualitative turn” in computational historical linguistics: • Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... • But we don’t a framework in which computers do our work for us, instead, we need a framework, where we tell computers to some work for us, in order to render our research more explicit, more efficient, and more rigorous. 5
Computer-Assisted Language Comparison CALC (MPI-SHH, Jena) • computational formalization of the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form 7
Computer-Assisted Language Comparison CALC (MPI-SHH, Jena) • computational formalization of the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form ASIA (SOAS, London) • reconstruction of Proto-Burmish 7
The Burmish Etymological Database (BED) problem current etymological accounts on Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) 9
The Burmish Etymological Database (BED) problem current etymological accounts on Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish 9
The Burmish Etymological Database (BED) problem current etymological accounts on Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish procedure BED as litmus test for CALC: • make the Proto-Burmish Etymological Database project a first test for the CALC framework • use existing computational methods to pre-analyze the data • develop interfaces to allow for correction and inspection by the experts 9
Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. 10
Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). 10
Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). 10
Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). □ Write a data-driven etymological dictionary (table of contents is half-written). 10
What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. 12
What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. 12
What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. • The task of lexical reconstruction follows phonological reconstruction in projecting full lexemes to the Ursprache, thereby also assessing their meaning, and whether it is reasonable to reconstruct them at all. 12
Classical Workflow (Comparative Method) • assemble cognate sets and sound correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns 13
Classical Workflow (Comparative Method) • assemble cognate sets and sound correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns • explain exceptions by: 13
Computer-Based Automatic Approaches Problems of computer-based approaches: (a) fail to model sound change as a systemic process (each column of an alignment is counted independently) 15
Computer-Based Automatic Approaches Problems of computer-based approaches: (a) fail to model sound change as a systemic process (each column of an alignment is counted independently) (b) fail to make use of linguistic knowledge on the directionality of sound change processes and have to rely on phylogenies 15
Computer-Based Automatic Approaches Problems of computer-based approaches: (a) fail to model sound change as a systemic process (each column of an alignment is counted independently) (b) fail to make use of linguistic knowledge on the directionality of sound change processes and have to rely on phylogenies (c) fail to handle unattested sounds, as only sounds which are in the data can be reconstructed 15
General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) 17
General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: 17
General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) 17
General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) • automatic reconstruction using weighted directed networks (→ this talk) 17
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern 22
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) 22
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) • the main problem of correspondence pattern identification is the handling of missing data, since not all cognate sets will necessarily contain reflexes from each of the languages under investigation 22
Detailed Workflow: Sound Correspondence Pattern Inference Graphs of Compatible Correspondence Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets 23
Detailed Workflow: Sound Correspondence Pattern Inference Graphs of Compatible Correspondence Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets • if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages 23
Detailed Workflow: Sound Correspondence Pattern Inference Graphs of Compatible Correspondence Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets • if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages • if two or more correspondence sets are compatible, we can impute missing values by combining them 23
Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible 24
Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 24
Detailed Workflow: Sound Correspondence Pattern Inference s s s s s s s s s s s k s - x x x x k k k k k k kʰ k ʃ k ʃ ʃ x ɣ ʃ k k k k k k k s s s s s n s s k k k k ʃ ʃ s s ʃ ʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ x x x x x ʃ ʃ ʃ ʃ ʃ ʃ ʃ kʰ ʃ ʃ s ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t t t t t t t t t kʰ t kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ 25
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. 26
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. 26
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. • We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. 26
Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. • We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. • By applying an approximation algorithm to infer a near-optimal clique cover of our data of aligned cognate sets, we can infer the most frequently recurring correspondence patterns in our data. 26
Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. 27
Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). 27
Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). • What it requires, however, is to know the major sound change transitions, which have strong directional preferences for consonants (much less for vowels and tones). 27
Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). • What it requires, however, is to know the major sound change transitions, which have strong directional preferences for consonants (much less for vowels and tones). • We can use this knowledge as a proxy to select which of the sounds in a given correspondence pattern is the best candidate for the proto-sound. 27
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 4. label the “quality” of the respective proto-form, specifically marking correspondence patterns which occur only one time in the data, 29
Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 4. label the “quality” of the respective proto-form, specifically marking correspondence patterns which occur only one time in the data, 5. have the expert clean up the mess. 29
Detailed Workflow: Automatic Reconstruction Advantage of the Approach: (a) ⊠ systemic aspects of sound change are integrated into the correspondence pattern detection algorithm 30
Detailed Workflow: Automatic Reconstruction Advantage of the Approach: (a) ⊠ systemic aspects of sound change are integrated into the correspondence pattern detection algorithm (b) ⊠ linguistic knowledge (even language-specific knowledge) is exhaustively used to construct the sound-change networks 30
Detailed Workflow: Automatic Reconstruction Advantage of the Approach: (a) ⊠ systemic aspects of sound change are integrated into the correspondence pattern detection algorithm (b) ⊠ linguistic knowledge (even language-specific knowledge) is exhaustively used to construct the sound-change networks (c) □ unattested sounds need to be manually handled by assigning them to specific correspondence patterns 30
Specific Findings: “black” and “dark” Language black dark Old Burmese n a k - ∅ Rangoon n ɛ ʔ ⁴ ∅ Achang l ɔ k ⁵⁵ ∅ Xiandao n ɔ ʔ ⁵⁵ ∅ Atsi n o ʔ ²¹ n o ʔ ²¹ Bola n a ʔ ³¹ n a ʔ ³¹ Lashi n ɔː ʔ ³¹ ∅ Maru n ɔ ʔ ³¹ n ɔ ʔ ³¹ Proto-Burmish n *a k ³¹ *n *a|*ṵ ʔ|*k ³¹ [i] The discrepancy in the reconstructions for these two forms which were regularly recognized as cognate in all languages is due to the insufficient reconstruction by the proto-type which takes each correspondence set in- dependently, rather than summarizing all possible reflexes for accepted cross-semantic cognate sets. 33
Specific Findings: “middle” and “outside” Language middle outside (“out-middle”) Old Burmese ∅ ∅ Rangoon ∅ ∅ Achang k u - ŋ ⁵⁵ ∅ Xiandao k o - ŋ ⁵⁵ ∅ Atsi k u - ŋ ²¹ ∅ Bola k a u ŋ ³¹ k a u ŋ ³¹ Lashi k u - ŋ ³¹ ∅ Maru k a u ŋ ³⁵ k a u ŋ ³⁵ Proto-Burmish *k u - *ŋ ⁵⁵ ∅ [i] The algorithm refuses to reconstruct the morpheme “middle” in the word “outside”, as it only occurs two times. If the proto-type, again, only reconstructed one time per cross-semantic cognate set, the results would be the same. 34
Specific Findings: “tree” and “wood” Language tree wood Old Burmese s a ts - s a ts - Rangoon tθ i ʔ ⁴ tθ i ʔ ⁴ Achang s a ŋ ³¹ ʂ ə k ⁵⁵ Xiandao ʂ ɯ k ⁵⁵ ∅ Atsi s i k ⁵⁵ s i k ⁵⁵ Bola s a k ⁵⁵ s a k ⁵⁵ Lashi s ə̰ k ⁵⁵ s ə̰ k ⁵⁵ Maru s a̰ k ⁵⁵ s a̰ k ⁵⁵ Proto-Burmish s a k ⁵⁵ *s ə̰ *k ⁵⁵ [i] Apart from the vowel, which is marked as irregular in the reconstruc- tion for “wood”, this reconstruction is regular and also easy to compare with other Sino-Tibetan reflexes. The reconstruction for “tree”, on the other hand, is irregular, due to the wrong cognate assignment for Achang. 35
We are only beginning to explore the potential of sound correspon- dence pattern analysis as a backbone for automatic linguistic recon- struction. Even now, it is straightforward to manually annotate all different sound correspondence patterns which we could infer from the data. To test the full potential of the approach, we will have to drastically increase the number of lexical items in our data, but even in this state, the approach is promising, and can serve as a starting point for a classical phonological reconstruction analysis. 37