Using networks to infer sound-correspondence patterns across multiple Languages Johann-Mattis List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2017-10-24 very long title P(A|B)=P(B|A)... 1 / 29
"All languages change, as long as they exist." (August Schleicher 1863) walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English walkman "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₂ L₁ L₃ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 29
Historical Language Comparison Sequences in Biology and Linguistics Alphabets in Biology and Linguistics • universal • language-specific • limited • widely varying 9 / 29
Historical Language Comparison Homolog Detection Inferring Homologs Cognate List Alignment Correspondences Bola six kʰ j a u ʔ ⁵⁵ Bola Maru Freq. a(a̰) a(a̰) 3 x u u 3 x ʔ k 3 x j j 2 x k(ʰ) k(ʰ) 2 x ⁵⁵ ⁵⁵ 2 x ³¹ ³¹ 1 x Maru six kʰ j a u k ⁵⁵ Bola lip k a̰ u ʔ ⁵⁵ Maru lip k a̰ u k ⁵⁵ Bola man j a u ʔ ³¹ Maru man j a u k ³¹ 11 / 29
Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t ts t tʰ tθ 12 / 29
Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t ts t tʰ tθ "tooth" Bola t u i ⁵⁵ Maru ts ɔ i ³¹ Rangoon tθ w a ⁵⁵ 12 / 29
Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t t t tʰ t 12 / 29
Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t t t tʰ t "wing" Bola t a u ŋ ⁵⁵ Maru t a u ŋ ³¹ Rangoon t ɑ u ∼ ²² 12 / 29
Inferring Correspondence Patterns Sound Correspondence Patterns Sound Correspondence Patterns PIE Hittite Sanskrit Avestan Greek Latin Gothic Old Church Slavonic Lithuanian Old Irish Armenian Tocharian *p p p p f p p f b p p Ø h w Ø p *b b p b bβ b b p b b b p p *bʰ b p bʱ/bh bβ pʰ/ph f b b b b b b p *t t t t θ t t θ/þ d t t t tʼ j/y t tʃ/c *d d t d d ð d d t d d d t ts ʃ/ś *dʰ d t dʰ/dh h d ð tʰ/th f d b d d d d t t tʃ/c ... ... ... ... ... ... ... ... ... ... ... ... *kʷ kʷ/ku k c k c k p t kʷ/qu hʷ/hw g k tʃ/č k c kʼ tʃʼ/čʼ k ʃʲ/ś *gʷ kʷ/u g j g j g b d gʷ/gu u q g ʒ/ž z g b k k ś *gʷʰ kʷ/ku gʷ/gu gʱ/gh h g j pʰ/ph tʰ/th kʰ/kh f gʷ/gu u g b g ʒ/ž z g g g dʒ/ǰ k ʃʲ/ś Clackson (2007: 37) 14 / 29
Inferring Correspondence Patterns Sound Correspondence Patterns Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets 15 / 29
Inferring Correspondence Patterns Sound Correspondence Patterns Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets they are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) 15 / 29
Inferring Correspondence Patterns Sound Correspondence Patterns Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets they are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) the main problem of correspondence pattern identification is the handling of missing data, since not all cognate sets will necessarily contain reflexes from each of the languages under investigation 15 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Inference of Correspondence Patterns the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets 16 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Inference of Correspondence Patterns the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages 16 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Inference of Correspondence Patterns the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages if two or more correspondence sets are compatible, we can impute missing values by combining them 16 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compabitility of Alignment Sites Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “foot-1” p p p p f f p p ⊠ compatible □ incompatible 17 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compabitility of Alignment Sites Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 17 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs 8 Burmish languages (spoken in China and Myanmar, taken from Hill and List 2017) 18 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs 8 Burmish languages (spoken in China and Myanmar, taken from Hill and List 2017) 240 concepts 855 partial cognate sets 18 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs 8 Burmish languages (spoken in China and Myanmar, taken from Hill and List 2017) 240 concepts 855 partial cognate sets 728 cross-semantic partial cognate sets (covering one and more concepts) 18 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs 8 Burmish languages (spoken in China and Myanmar, taken from Hill and List 2017) 240 concepts 855 partial cognate sets 728 cross-semantic partial cognate sets (covering one and more concepts) 218 valid cognate sets (residues in more than one language) 18 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs pʰ tʰ tʰ tʰ tʰ pʰ tʰ tʰ pʰ tʰ pʰ ŋ ŋ ŋ ŋ ŋ ŋ ŋ ŋ tsʰ tʃʰ tsʰ tʃʰ tʃʰ tsʰ tsʰ tsʰ v j f j v v n◌̥ ŋ - ŋ ŋ ŋ ŋ ʃ ʃ ʃ ʃ ʃ ʃ ʃ ʃ s s tʃ tʃ s tʃ tʃ tʃ tʃ tʃ tʃ tʃ x x tʃ x x x t t t t t t ʃ ʃ ʃ ʃ ʃ ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t p pʰ m m p m m pʰ p m s s s s s s s s s s n l l l l l l l l l l l l s s s s s s p p p p p p p p p p pʰ p p p p p p m m m m m m m l l l l l l l l j j - j - j j - j k ɣ ɣ ɣ ɣ ʐ ɣ j j v j - w v j k k k k k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ x x x x n◌̥ n n n n n n n n n ŋ n n n n n n n n n k k k k k k m m m m m m m m m n◌̥ n n n - ŋ n ŋ nʲ m m m m m m m m 20 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs tʰ tʰ tʰ pʰ tʰ tʰ pʰ pʰ tʰ tʰ pʰ ŋ ŋ ŋ ŋ ŋ ŋ ŋ ŋ tsʰ tsʰ tʃʰ tsʰ tsʰ tʃʰ tʃʰ tsʰ j v f j v v ŋ - ŋ n◌̥ ŋ ŋ ŋ ʃ ʃ s ʃ ʃ ʃ ʃ ʃ ʃ s tʃ tʃ tʃ tʃ tʃ s tʃ tʃ tʃ tʃ x x tʃ x x x t t t t t t ʃ ʃ ʃ ʃ ʃ ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t m m m pʰ pʰ p p p m m s s s s s s s s s s n l l l l l l l l l l l l s s s s s s p p p p p p p p p p p p p p p pʰ p m m m m m m m l l l l l l l l - - j j j j j - j k ɣ ɣ ɣ ɣ ʐ ɣ w j - v v j j j k k k k k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ x x x x n n◌̥ n n n n n ŋ n n n n n n n n n n n n k k k k k k m m m m m m m m - n n ŋ ŋ n n◌̥ n nʲ m m m m m m m m m 20 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs x x x x x x x x x x good correspondence set bad correspondence set 20 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Compatibility Graphs Only fully compatible clusters (i.e., only cliques in our net- work of correspondence sets) can represent true sound corre- spondence patterns (if sound change is regular). 20 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. 21 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. 21 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. 21 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. By applying an approximation algorithm to infer a near-optimal clique cover of our data of aligned cognate sets, we can infer the most frequently recurring correspondence patterns in our data. 21 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Graph with Clique Cover tʃʰ s ʃ tʃʰ x j n◌̥ n n ŋ tsʰ tsʰ tsʰ tʃʰ pʰ pʰ pʰ j w x v x v x j x n n x x x n◌̥ x n m m m n kʰ kʰ kʰ tʰ tʰ tʰ kʰ tʰ tʰ tʰ tʰ - - - pʰ pʰ p j ɣ ɣ ɣ ɣ j j j n n n n m n n n n m m m m m m m m m m l l l p l l l l l l l kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ ʃ s s s k ʃ ʃ k s s k s ʃ ʃ ʃ t t t t t t t t t t t m m m m m m m m m p p p p p p p p p p p p p p p k k s kʰ kʰ k ʃ ŋ n m l t t l - tsʰ f tʃ k n n l ʃ tsʰ l s m t p k n kʰ m j m v j s tʃ n ts m l ŋ k kʰ v ʃ ʐ ʃ n k j - tʃ pʰ s v m k ŋ ŋ - n n l n◌̥ ŋ ŋ l l l l ʃ ʃ ts ʃ k s k s s s s s ts ts ts ts ts ts j k j ɣ k ŋ ŋ ŋ ŋ ŋ ŋ ŋ s s tʃ tʃ tʃ tʃ tʃ tʃ tʃ k k k k k ts p pʰ ts nʲ ŋ n ŋ k 22 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Graph with Clique Cover tʰ tʰ tʰ pʰ tʰ tʰ pʰ pʰ tʰ tʰ pʰ ŋ ŋ ŋ ŋ ŋ ŋ ŋ ŋ tsʰ tsʰ tʃʰ tsʰ tsʰ tʃʰ tʃʰ tsʰ j v f j v v ŋ - ŋ n◌̥ ŋ ŋ ŋ ʃ ʃ s ʃ ʃ ʃ ʃ ʃ ʃ s tʃ tʃ tʃ tʃ tʃ s tʃ tʃ tʃ tʃ x x tʃ x x x t t t t t t ʃ ʃ ʃ ʃ ʃ ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t m m m pʰ pʰ p p p m m s s s s s s s s s s n l l l l l l l l l l l l s s s s s s p p p p p p p p p p p p p p p pʰ p m m m m m m m l l l l l l l l - - j j j j j - j k ɣ ɣ ɣ ɣ ʐ ɣ w j - v v j j j k k k k k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ x x x x n n◌̥ n n n n n ŋ n n n n n n n n n n n n k k k k k k m m m m m m m m - n n ŋ ŋ n n◌̥ n nʲ m m m m m m m m m 22 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Results 104 initial consonant patterns (48 with more than one reflex, the rest highly irregular) 23 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Results 104 initial consonant patterns (48 with more than one reflex, the rest highly irregular) many patterns correspond to well-known proto-sounds in the data 23 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Results 104 initial consonant patterns (48 with more than one reflex, the rest highly irregular) many patterns correspond to well-known proto-sounds in the data some cliques are unintuitive 23 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Problems Irregular patterns: result in part from problems in cognate coding (homology assessment) 24 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Problems Irregular patterns: result in part from problems in cognate coding (homology assessment) result in part from sparseness of data 24 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Problems Irregular patterns: result in part from problems in cognate coding (homology assessment) result in part from sparseness of data result in part from the exactness of the algorithm 24 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Problems Irregular patterns: result in part from problems in cognate coding (homology assessment) result in part from sparseness of data result in part from the exactness of the algorithm Unintuitive patterns: 24 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Summary on Problems Irregular patterns: result in part from problems in cognate coding (homology assessment) result in part from sparseness of data result in part from the exactness of the algorithm Unintuitive patterns: result in part from the greediness of the algorithm 24 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Possible Improvements catch the greediness of the algorithm by adding a secondary check of cliques (calculate consensus, re-assign all alignment sites to all compatible consensus sequences, count the instances) 25 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Possible Improvements catch the greediness of the algorithm by adding a secondary check of cliques (calculate consensus, re-assign all alignment sites to all compatible consensus sequences, count the instances) allow for non-perfect cliques in which compatibility is allowed to deviate to a certain degree (e.g., one irregular cell) 25 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Possible Improvements catch the greediness of the algorithm by adding a secondary check of cliques (calculate consensus, re-assign all alignment sites to all compatible consensus sequences, count the instances) allow for non-perfect cliques in which compatibility is allowed to deviate to a certain degree (e.g., one irregular cell) provide a more fine-grained checking of proposed cliques by counting columns suffering from missing data 25 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Possible Improvements catch the greediness of the algorithm by adding a secondary check of cliques (calculate consensus, re-assign all alignment sites to all compatible consensus sequences, count the instances) allow for non-perfect cliques in which compatibility is allowed to deviate to a certain degree (e.g., one irregular cell) provide a more fine-grained checking of proposed cliques by counting columns suffering from missing data allow for a direct checking and correcting of patterns by the experts 25 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner 28 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner in contrast to many approaches proposed so far, it does not require family trees in any form, networks are just enough, but the patterns inferred can be used to study tree-like aspects of evolution (Chacon and List 2015), 28 / 29
Inferring Correspondence Patterns Inference of Correspondence Patterns Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner in contrast to many approaches proposed so far, it does not require family trees in any form, networks are just enough, but the patterns inferred can be used to study tree-like aspects of evolution (Chacon and List 2015), the algorithm is surely a good start, but it needs to be improved in several ways 28 / 29