Computer-Assisted Approaches to Linguistic Reconstruction

Embed

Start on current slide

Slide 1

Slide 1 text

Computer-Assisted Approaches to Linguistic Reconstruction A Case Study from the Burmish Languages Johann-Mattis List¹ and Nathan W. Hill² 2017-07-21 ¹ Max Planck Institute for the Science of Human History, ² SOAS, University of London

Slide 2

Slide 2 text

Outline 1. Introduction 2. Phonological Reconstruction 3. Computer-Assisted Phonological Reconstruction 4. Examples 5. Outlook 2

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

Etymology 2.0? Historical linguistics after the quantitative turn: 4

Slide 5

Slide 5 text

Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, 4

Slide 6

Slide 6 text

Etymology 2.0? Historical linguistics after the quantitative turn: • Quantitative methods in historical linguistics have received much attention of late, • but only a few (if any) of the new methods have addressed long-standing problems of classical linguistics, 4

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Etymology 3.0? Towards a “qualitative turn” in computational historical linguistics: 5

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Etymology 3.0? Towards a “qualitative turn” in computational historical linguistics: • Instead of blaming computers for our misery (no funding, institutes are begin shut down, etc.), we should start seeing computers as a chance to address the important questions which we have not solved in 200 years of research... • But we don’t a framework in which computers do our work for us, instead, we need a framework, where we tell computers to some work for us, in order to render our research more explicit, more efficient, and more rigorous. 5

Slide 11

Slide 11 text

Computer-Assisted Language Comparison very long title P(A|B)=P(B|A)... 6

Slide 12

Slide 12 text

Computer-Assisted Language Comparison CALC (MPI-SHH, Jena) • computational formalization of the classical methods for historical language comparison • establish a close collaboration between computational and classical historical linguistics by providing data in human- and machine-readable form 7

Slide 13

Slide 13 text

Slide 14

Slide 14 text

The Burmish Languages 8 Hill and List (forthcoming)

Slide 15

Slide 15 text

The Burmish Etymological Database (BED) 9

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

The Burmish Etymological Database (BED) problem current etymological accounts on Proto-Burmish have many problems (no lexical reconstruction, insufficient phonological reconstruction, unclear data, intransparent methodology) goal compile an etymological database of Proto-Burmish procedure BED as litmus test for CALC: • make the Proto-Burmish Etymological Database project a first test for the CALC framework • use existing computational methods to pre-analyze the data • develop interfaces to allow for correction and inspection by the experts 9

Slide 19

Slide 19 text

Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. 10

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Where are you now with BED? ⊠ Develop computer-assisted workflows to create and curate data in human- and machine-readable form. ⊠ Develop computer-assisted workflow for partial cognate detection and alignments. ⊟ Develop methods for automatic phonological reconstruction and workflows for the correction of the results by experts (→ THIS TALK). □ Develop methods for lexical reconstruction (first ideas, not shown in this talk). 10

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Phonological Reconstruction

Slide 25

Slide 25 text

What is phonological reconstruction? 12

Slide 26

Slide 26 text

What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. 12

Slide 27

Slide 27 text

Slide 28

Slide 28 text

What is phonological reconstruction? • Phonological reconstruction is primarily understood as the reconstruction of the sound system of a language not reflected in written sources. • More specifically, however, we see phonological reconstruction as the task of reconstructing major patterns of sound change which allow us to reconstruct tentative proto-forms from cognate sets, regardless of whether those words were really present in the Ursprache or what those forms meant. • The task of lexical reconstruction follows phonological reconstruction in projecting full lexemes to the Ursprache, thereby also assessing their meaning, and whether it is reasonable to reconstruct them at all. 12

Slide 29

Slide 29 text

Classical Workflow (Comparative Method) 13

Slide 30

Slide 30 text

Classical Workflow (Comparative Method) • assemble cognate sets and sound correspondences by comparing data on different languages 13

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Classical Workflow (Comparative Method) • assemble cognate sets and sound correspondences by comparing data on different languages • infer sound change processes (“sound laws”) from the inferred sound correspondence patterns • explain exceptions by: • refining inferred sound change processes (cf. “Verner’s law”) • borrowing (“substratum influence”) • analogy (leftovers) 13

Slide 34

Slide 34 text

Computer-Based Automatic Approaches 14

Slide 35

Slide 35 text

Computer-Based Automatic Approaches Problems of computer-based approaches: 15

Slide 36

Slide 36 text

Computer-Based Automatic Approaches Problems of computer-based approaches: (a) fail to model sound change as a systemic process (each column of an alignment is counted independently) 15

Slide 37

Slide 37 text

Computer-Based Automatic Approaches Problems of computer-based approaches: (a) fail to model sound change as a systemic process (each column of an alignment is counted independently) (b) fail to make use of linguistic knowledge on the directionality of sound change processes and have to rely on phylogenies 15

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Computer-Assisted Phonological Reconstruction

Slide 40

Slide 40 text

General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement 17

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

General Workflow Preliminary steps: • partial cognate detection and partial phonetic alignment (List et al. 2016) with manual refinement • preliminary identification of cross-semantic cognates based on partial colexifications (Hill and List forthcoming) Phonological reconstruction: • sound correspondence pattern identification (List et al. in prep.) 17

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Detailed Workflow: Preliminary Steps Fúzhōu ŋuoʔ⁵ Měixiàn ŋiat⁵ 0.44 kuoŋ⁴⁴ 0.78 0.78 Wēnzhōu y²¹ ȵ 0.30 0.35 0.67 ku ³ ɔ ⁵ 0.80 0.85 0.27 0.67 vai¹³ 0.85 0.85 0.82 0.73 0.73 Běijīng y ¹ ɛ⁵ 0.77 0.84 0.73 0.56 0.56 0.66 li ŋ¹ ɑ 0.78 0.78 0.44 0.67 0.82 0.82 0.80 ŋiat⁵ kuoŋ⁴⁴ ŋuoʔ⁵ ȵy²¹ yɛ⁵¹ kuɔ³⁵ liɑŋ¹ vai¹³ ŋiat⁵ vai¹³ kuoŋ⁴⁴ ŋuoʔ⁵ liɑŋ¹ yɛ⁵¹ ȵy²¹ kuɔ³⁵ ȵy²¹ kuɔ³⁵ ŋiat⁵ yɛ⁵¹ liɑŋ¹ ŋuoʔ⁵ kuoŋ⁴⁴ vai¹³ B C D A Partial cognate detection: List, Lopez, and Bapteste (2016) 18

Slide 46

Slide 46 text

Detailed Workflow: Preliminary Steps EDICTOR tool: List (2017) 19

Slide 47

Slide 47 text

Detailed Workflow: Preliminary Steps Language 'mountain' 'dog' 'thunder' 'wolf' 'bear (n.)' Atsi pum⁵¹ kʰui²¹ mau²¹ mjiŋ⁵¹ vam⁵¹ kʰui²¹ mo⁵⁵ vam⁵¹ mountain dog sky + thunder bear + dog + m-suff. bear Bola pam⁵⁵ kʰui³⁵ mau³¹ mjaŋ⁵⁵ mjaŋ⁵⁵ kʰui³⁵ vɛ⁵ ⁵⁵ mountain dog sky + thunder thunder + dog bear Lashi pɔm³¹ kʰui⁵⁵ mou³³ kɔm³³ wɔm³¹ kʰui⁵⁵ wɔm³¹ mountain dog sky + thunderB bear + dog bear Maru pam³¹ lə¹ ³¹ kʰa³⁵ muk⁵⁵ kum³¹ mjaŋ³¹ kʰa³⁵ vɛ⁵ ³¹ mountain ? + dog sky + thunderB thunder + dog bear Achang pum⁵⁵ xui³¹ mau³¹ ʐau³¹ pum⁵⁵ xui³¹ ɔm⁵⁵ mountain dog sky + thunderC mountain + dog bear Morpheme Annotation (Hill and List forthc.) 20

Slide 48

Slide 48 text

Detailed Workflow: Sound Correspondence Pattern Inference Clackson (2007: 37) 21

Slide 49

Slide 49 text

Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: 22

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Patterns and Phonological Reconstruction: • the most traditional way to reconstruct in the comparative-method framework is to infer patterns of regular sound correspondences across a set of languages and then assign proto-forms for each distinct pattern • correspondence patterns are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) 22

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Detailed Workflow: Sound Correspondence Pattern Inference Graphs of Compatible Correspondence Sets: 23

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Detailed Workflow: Sound Correspondence Pattern Inference Graphs of Compatible Correspondence Sets: • the main idea for the correspondence pattern inference algorithm is to derive a graph from correspondence sets in which each individual correspondence set (a site in an aligned cognate set) is a node, and links between nodes are drawn between compatible correspondence sets • if two correspondence sets are compatible, this means that they have identical non-missing values for at least one language and no conflicting data for any of the languages 23

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible 24

Slide 58

Slide 58 text

Detailed Workflow: Sound Correspondence Pattern Inference Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p f f p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 24

Slide 59

Slide 59 text

Detailed Workflow: Sound Correspondence Pattern Inference s s s s s s s s s s s k s - x x x x k k k k k k kʰ k ʃ k ʃ ʃ x ɣ ʃ k k k k k k k s s s s s n s s k k k k ʃ ʃ s s ʃ ʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ tʃ x x x x x ʃ ʃ ʃ ʃ ʃ ʃ ʃ kʰ ʃ ʃ s ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t t t t t t t t t kʰ t kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ 25

Slide 60

Slide 60 text

Detailed Workflow: Sound Correspondence Pattern Inference x x x x x x x x x x good correspondence set bad correspondence set 25

Slide 61

Slide 61 text

Detailed Workflow: Sound Correspondence Pattern Inference Only fully compatible clusters (i.e., only cliques in our network of correspondence sets) can represent true sound correspondence patterns (if sound change is regular). 25

Slide 62

Slide 62 text

Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: 26

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Detailed Workflow: Sound Correspondence Pattern Inference Sound Correspondence Pattern Inference as a Clique Cover Problem: • The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. • The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. • We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will correspond to the optimal set of sound correspondence patterns in our data. 26

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Detailed Workflow: Automatic Reconstruction We can do without trees! 27

Slide 68

Slide 68 text

Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. 27

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Detailed Workflow: Automatic Reconstruction We can do without trees! • Phonological reconstruction in the comparative-method framework usually starts from correspondence patterns. • Apart from very few exceptions, it does not require the knowledge of any specific phylogeny for the language family under investigation (at least not for most consonants). • What it requires, however, is to know the major sound change transitions, which have strong directional preferences for consonants (much less for vowels and tones). 27

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Detailed Workflow: Automatic Reconstruction əː ə̆ ɿ ə ḭ: ḭ ɤ ə̰ ĭ aː i ɑ ɛ̃ ɑ̃ ɛ ɯ ɛ̰̃ ɔ̃ a̰: ɛ̰ ɔ̰̃ ã o̰ a̰ ɑ̰ w ∼ ŋ - v ĩ e a ḛ ẽ ɔ̰ ŋ̊ ŋʲ n◌̥ʲ n◌̥ ɲ̊ m n mʲ ɲ nʲ m◌̥ ɕ ʃ ç ɬ ʂ r◌̥ l◌̥ ɔː u: ṵ õ ɔ o ʊ ṵː u ũ j ɣ ʐ kʰ x ʑ rj xʐ r l tɕ tʃ t ts c k tʰ s tθ p ʔ pʰ f tsʰ tɕʰ tʃʰ cʰ sʰ 28

Slide 73

Slide 73 text

Detailed Workflow: Automatic Reconstruction s ts tθ ʃ tʃ ʂ 28

Slide 74

Slide 74 text

Detailed Workflow: Automatic Reconstruction s ts tθ ts ʂ s s ts *ts s ts tθ ʂ 28

Slide 75

Slide 75 text

Detailed Workflow: Automatic Reconstruction ʃ tʃ ʂ s ʃ s s ʂ ʂ ʃ s ts tθ ʂ 28

Slide 76

Slide 76 text

Detailed Workflow: Automatic Reconstruction ʃ ʂ s ʃ s s ʂ ʂ ʃ *ʃ s ʂ 28

Slide 77

Slide 77 text

Detailed Workflow: Automatic Reconstruction ts tʃ tθ ʃ ʂ tʃ ts ʂ ʂ ts ts sʃ ʂ 28

Slide 78

Slide 78 text

Detailed Workflow: Automatic Reconstruction ts tʃ ʂ tʃ ts s ʂ ts ts *tʃ/ts s ʂ 28

Slide 79

Slide 79 text

Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 29

Slide 80

Slide 80 text

Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 29

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Detailed Workflow: Automatic Reconstruction Automatic Reconstruction Strategy: 1. extract the sub-graph from the sound-change graph for each distinct sound in a given correspondence pattern, 2. search for a potential source in the sub-graph, i.e., a sound that has no ancestor, 3. if • there is a source, select it as proto-form, • there are multiple sources, select all as proto-form, • the graph is disconnected or no source can be found (loops in the graph), select the most frequently recurring form as a potential proto-form (“majority rules”), 29

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Detailed Workflow: Automatic Reconstruction Advantage of the Approach: 30

Slide 88

Slide 88 text

Detailed Workflow: Automatic Reconstruction Advantage of the Approach: (a) ⊠ systemic aspects of sound change are integrated into the correspondence pattern detection algorithm 30

Slide 89

Slide 89 text

Detailed Workflow: Automatic Reconstruction Advantage of the Approach: (a) ⊠ systemic aspects of sound change are integrated into the correspondence pattern detection algorithm (b) ⊠ linguistic knowledge (even language-specific knowledge) is exhaustively used to construct the sound-change networks 30

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Examples

Slide 92

Slide 92 text

General Findings Basic Statistics: • 8 languages • 240 concepts • 855 partial cognate sets • 728 cross-semantic partial cognate sets • 218 valid cognate sets (with more than two reflexes) • 104 initial consonant patterns (48 with more than one reflex, the rest highly irregular) • well-reconstructed proto-sounds: stops and affricates *k, *kʰ, *t, *tʰ, *tʃ, *tʃʰ, *ts, *tsʰ, p, pʰ fricatives s, ʃ, x liquids and j r, l, j nasals ŋ, n, m 32

Slide 93

Slide 93 text

Specific Findings: “black” and “dark” Language black dark Old Burmese n a k - ∅ Rangoon n ɛ ʔ ⁴ ∅ Achang l ɔ k ⁵⁵ ∅ Xiandao n ɔ ʔ ⁵⁵ ∅ Atsi n o ʔ ²¹ n o ʔ ²¹ Bola n a ʔ ³¹ n a ʔ ³¹ Lashi n ɔː ʔ ³¹ ∅ Maru n ɔ ʔ ³¹ n ɔ ʔ ³¹ Proto-Burmish n *a k ³¹ *n *a|*ṵ ʔ|*k ³¹ [i] The discrepancy in the reconstructions for these two forms which were regularly recognized as cognate in all languages is due to the insufficient reconstruction by the proto-type which takes each correspondence set independently, rather than summarizing all possible reflexes for accepted cross-semantic cognate sets. 33

Slide 94

Slide 94 text

Specific Findings: “middle” and “outside” Language middle outside (“out-middle”) Old Burmese ∅ ∅ Rangoon ∅ ∅ Achang k u - ŋ ⁵⁵ ∅ Xiandao k o - ŋ ⁵⁵ ∅ Atsi k u - ŋ ²¹ ∅ Bola k a u ŋ ³¹ k a u ŋ ³¹ Lashi k u - ŋ ³¹ ∅ Maru k a u ŋ ³⁵ k a u ŋ ³⁵ Proto-Burmish *k u - *ŋ ⁵⁵ ∅ [i] The algorithm refuses to reconstruct the morpheme “middle” in the word “outside”, as it only occurs two times. If the proto-type, again, only reconstructed one time per cross-semantic cognate set, the results would be the same. 34

Slide 95

Slide 95 text

Specific Findings: “tree” and “wood” Language tree wood Old Burmese s a ts - s a ts - Rangoon tθ i ʔ ⁴ tθ i ʔ ⁴ Achang s a ŋ ³¹ ʂ ə k ⁵⁵ Xiandao ʂ ɯ k ⁵⁵ ∅ Atsi s i k ⁵⁵ s i k ⁵⁵ Bola s a k ⁵⁵ s a k ⁵⁵ Lashi s ə̰ k ⁵⁵ s ə̰ k ⁵⁵ Maru s a̰ k ⁵⁵ s a̰ k ⁵⁵ Proto-Burmish s a k ⁵⁵ *s ə̰ *k ⁵⁵ [i] Apart from the vowel, which is marked as irregular in the reconstruction for “wood”, this reconstruction is regular and also easy to compare with other Sino-Tibetan reflexes. The reconstruction for “tree”, on the other hand, is irregular, due to the wrong cognate assignment for Achang. 35

Slide 96

Slide 96 text

Outlook

Slide 97

Slide 97 text

We are only beginning to explore the potential of sound correspondence pattern analysis as a backbone for automatic linguistic reconstruction. Even now, it is straightforward to manually annotate all different sound correspondence patterns which we could infer from the data. To test the full potential of the approach, we will have to drastically increase the number of lexical items in our data, but even in this state, the approach is promising, and can serve as a starting point for a classical phonological reconstruction analysis. 37

Slide 98

Slide 98 text

Danke für Ihre Aufmerksamkeit! 38