Burmish Languages Johann-Mattis List¹ and Nathan W. Hill² 2017-07-21 ¹ Max Planck Institute for the Science of Human History, ² SOAS, University of London
methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, 4
methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, • and as a result, many classical linguists are very sceptical of the new approaches. 4
• Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... 5
• Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... • But we don’t a framework in which computers do our work for us, instead, we need a framework, where we tell computers to some work for us, in order to render our research more explicit, more efficient, and more rigorous. 5
the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form 7
the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form ASIA (SOAS, London) • reconstruction of Proto-Burmish 7
Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish 9
Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish procedure BED as litmus test for CALC: • make the Proto-Burmish Etymological Database project a first test for the CALC framework • use existing computational methods to pre-analyze the data • develop interfaces to allow for correction and inspection by the experts 9
to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). 10
to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). 10
to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). □ Write a data-driven etymological dictionary (table of contents is half-written). 10
as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. 12
as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. • The task of lexical reconstruction follows phonological reconstruction in projecting full lexemes to the Ursprache, thereby also assessing their meaning, and whether it is reasonable to reconstruct them at all. 12
correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns 13
correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns • explain exceptions by: 13
model sound change as a systemic process (each column of an alignment is counted independently) (b) fail to make use of linguistic knowledge on the directionality of sound change processes and have to rely on phylogenies 15
model sound change as a systemic process (each column of an alignment is counted independently) (b) fail to make use of linguistic knowledge on the directionality of sound change processes and have to rely on phylogenies (c) fail to handle unattested sounds, as only sounds which are in the data can be reconstructed 15
phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) 17
phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: 17
phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) 17
phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) • automatic reconstruction using weighted directed networks (→ this talk) 17
Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern 22
Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) 22
Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) • the main problem of correspondence pattern identification is the handling of missing data, since not all cognate sets will necessarily contain reflexes from each of the languages under investigation 22
Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets 23
Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets • if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages 23
Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets • if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages • if two or more correspondence sets are compatible, we can impute missing values by combining them 23
L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 24
s s s s s s s k s - x x x x k k k k k k kʰ k ʃ k ʃ ʃ x ɣ ʃ k k k k k k k s s s s s n s s k k k k ʃ ʃ s s ʃ ʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ x x x x x ʃ ʃ ʃ ʃ ʃ ʃ ʃ kʰ ʃ ʃ s ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t t t t t t t t t kʰ t kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ 25
as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. 26
as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. 26
as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. • We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. 26
as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. • We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. • By applying an approximation algorithm to infer a near-optimal clique cover of our data of aligned cognate sets, we can infer the most frequently recurring correspondence patterns in our data. 26
Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). 27
Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). • What it requires, however, is to know the major sound change transitions, which have strong directional preferences for consonants (much less for vowels and tones). 27
Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). • What it requires, however, is to know the major sound change transitions, which have strong directional preferences for consonants (much less for vowels and tones). • We can use this knowledge as a proxy to select which of the sounds in a given correspondence pattern is the best candidate for the proto-sound. 27
ɤ ə̰ ĭ aː i ɑ ɛ̃ ɑ̃ ɛ ɯ ɛ̰̃ ɔ̃ a̰: ɛ̰ ɔ̰̃ ã o̰ a̰ ɑ̰ w ∼ ŋ - v ĩ e a ḛ ẽ ɔ̰ ŋ̊ ŋʲ n◌̥ʲ n◌̥ ɲ̊ m n mʲ ɲ nʲ m◌̥ ɕ ʃ ç ɬ ʂ r◌̥ l◌̥ ɔː u: ṵ õ ɔ o ʊ ṵː u ũ j ɣ ʐ kʰ x ʑ rj xʐ r l tɕ tʃ t ts c k tʰ s tθ p ʔ pʰ f tsʰ tɕʰ tʃʰ cʰ sʰ 28
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 29
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, 29
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, 29
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 29
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 4. label the “quality” of the respective proto-form, specifically marking correspondence patterns which occur only one time in the data, 29
sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 4. label the “quality” of the respective proto-form, specifically marking correspondence patterns which occur only one time in the data, 5. have the expert clean up the mess. 29
systemic aspects of sound change are integrated into the correspondence pattern detection algorithm (b) ⊠ linguistic knowledge (even language-specific knowledge) is exhaustively used to construct the sound-change networks 30
systemic aspects of sound change are integrated into the correspondence pattern detection algorithm (b) ⊠ linguistic knowledge (even language-specific knowledge) is exhaustively used to construct the sound-change networks (c) □ unattested sounds need to be manually handled by assigning them to specific correspondence patterns 30
n a k - ∅ Rangoon n ɛ ʔ ⁴ ∅ Achang l ɔ k ⁵⁵ ∅ Xiandao n ɔ ʔ ⁵⁵ ∅ Atsi n o ʔ ²¹ n o ʔ ²¹ Bola n a ʔ ³¹ n a ʔ ³¹ Lashi n ɔː ʔ ³¹ ∅ Maru n ɔ ʔ ³¹ n ɔ ʔ ³¹ Proto-Burmish n *a k ³¹ *n *a|*ṵ ʔ|*k ³¹ [i] The discrepancy in the reconstructions for these two forms which were regularly recognized as cognate in all languages is due to the insufficient reconstruction by the proto-type which takes each correspondence set in- dependently, rather than summarizing all possible reflexes for accepted cross-semantic cognate sets. 33
Burmese ∅ ∅ Rangoon ∅ ∅ Achang k u - ŋ ⁵⁵ ∅ Xiandao k o - ŋ ⁵⁵ ∅ Atsi k u - ŋ ²¹ ∅ Bola k a u ŋ ³¹ k a u ŋ ³¹ Lashi k u - ŋ ³¹ ∅ Maru k a u ŋ ³⁵ k a u ŋ ³⁵ Proto-Burmish *k u - *ŋ ⁵⁵ ∅ [i] The algorithm refuses to reconstruct the morpheme “middle” in the word “outside”, as it only occurs two times. If the proto-type, again, only reconstructed one time per cross-semantic cognate set, the results would be the same. 34
s a ts - s a ts - Rangoon tθ i ʔ ⁴ tθ i ʔ ⁴ Achang s a ŋ ³¹ ʂ ə k ⁵⁵ Xiandao ʂ ɯ k ⁵⁵ ∅ Atsi s i k ⁵⁵ s i k ⁵⁵ Bola s a k ⁵⁵ s a k ⁵⁵ Lashi s ə̰ k ⁵⁵ s ə̰ k ⁵⁵ Maru s a̰ k ⁵⁵ s a̰ k ⁵⁵ Proto-Burmish s a k ⁵⁵ *s ə̰ *k ⁵⁵ [i] Apart from the vowel, which is marked as irregular in the reconstruc- tion for “wood”, this reconstruction is regular and also easy to compare with other Sino-Tibetan reflexes. The reconstruction for “tree”, on the other hand, is irregular, due to the wrong cognate assignment for Achang. 35
correspon- dence pattern analysis as a backbone for automatic linguistic recon- struction. Even now, it is straightforward to manually annotate all different sound correspondence patterns which we could infer from the data. To test the full potential of the approach, we will have to drastically increase the number of lexical items in our data, but even in this state, the approach is promising, and can serve as a starting point for a classical phonological reconstruction analysis. 37