LexStat: Automatic detection of cognates in multilingual wordlists

Embed

Start on current slide

Slide 1

Slide 1 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation . . . . . . . LexStat: Automatic Detection of Cognates in Multilingual Wordlists Johann-Mattis List∗ ∗Institute for Romance Languages and Literature Heinrich Heine University Düsseldorf April 24, 2012 1 / 28

Slide 2

Slide 2 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Structure of the Talk . . . 1 Keys to the Past . . . 2 Identification of Cognates . . . 3 LexStat . . . 4 Evaluation 2 / 28

Slide 3

Slide 3 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Keys to the Past 3 / 28

Slide 4

Slide 4 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Charles Lyell on Languages 4 / 28

Slide 5

Slide 5 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Charles Lyell on Languages The Geological Evidences of The Antiquity of Man with Remarks on Theories of The Origin of Species by Variation By Sir Charles Lyell London John Murray, Albemarle Street 1863 4 / 28

Slide 6

Slide 6 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Charles Lyell on Languages If we new not- hing of the existence of Latin, - if all historical documents previous to the fin- teenth century had been lost, - if tra- dition even was si- lent as to the former existance of a Ro- man empire, a me- re comparison of the Italian, Spanish, Portuguese, French, Wallachian, and Rhaetian dialects would enable us to say that at some time there must ha- ve been a language, from which these six modern dialects derive their origin in common. 4 / 28

Slide 7

Slide 7 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Uniformitarianism and Abduction 5 / 28

Slide 8

Slide 8 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Uniformitarianism and Abduction . Uniformitarianism . . . . . . . . 5 / 28

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Uniformitarianism and Abduction . Uniformitarianism . . . . . . . . “Universality of Change” – Change is independent of time and space “Graduality of Change” – Change is neither abrupt nor chaotic “Uniformity of Change” – Change is not heterogeneous . Abduction . . . . . . . . 5 / 28

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation h j - ä r t a - h - e - r z - - h - e a r t - - c - - o r d i s hjärta herz heart cordis Identification of Cognates 6 / 28

Slide 16

Slide 16 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Basic Procedure . . . . . . . . 7 / 28

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Basic Procedure . . . . . . . . Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and 7 / 28

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Language-Specific Similarity Measure . . . . . . . . 8 / 28

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” 8 / 28

Slide 27

Slide 27 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning German Dutch English “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ] 8 / 28

Slide 28

Slide 28 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation The Comparative Method . Language-Specific Similarity Measure . . . . . . . . Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning Shanghai Beijing Guangzhou “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵] 8 / 28

Slide 29

Slide 29 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches 9 / 28

Slide 30

Slide 30 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Alignment Analyses . . . . . . . . 9 / 28

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Alignment Analyses . . . . . . . . In alignment analyses, sequences are arranged in a matrix in such a way that corresponding elements occur in the same column, while empty cells resulting from non-corresponding elements are filled with gap symbols. t ɔ x t ə r d ɔː - t ə r C ognate identification isusuallybased on a sim - ilarity or distance score (e.g., edit-distance) calculated from the num ber of m atches and m is- m atches in the alignm ent. 9 / 28

Slide 36

Slide 36 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Sound Classes . . . . . . . . 10 / 28

Slide 37

Slide 37 text

Slide 38

Slide 38 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 10 / 28

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Sound-Class-Based Alignment (SCA) . . . . . . . . 11 / 28

Slide 44

Slide 44 text

Slide 45

Slide 45 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Automatic Approaches . Sound-Class-Based Alignment (SCA) . . . . . . . . Sound classes and alignment analyses can be easily combined by representing phonetic sequences internally as sound classes and comparing the sound classes with traditional alignment algorithms. INPUT tɔxtər dɔːtər TOKENIZATION t, ɔ, x, t, ə, r d, ɔː, t, ə, r CONVERSION t ɔ x … → T O G … d ɔː t … → T O T … ALIGNMENT T O G T E R T O - T E R CONVERSION T O G … → t ɔ x … T O - … → d oː - … OUTPUT t ɔ x t ə r d ɔː x t ə r 1 11 / 28

Slide 46

Slide 46 text

Slide 47

Slide 47 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Traditional vs. Automatic Approaches 12 / 28

Slide 48

Slide 48 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Traditional vs. Automatic Approaches . Similarity . . . . . . . . 12 / 28

Slide 49

Slide 49 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Traditional vs. Automatic Approaches . Similarity . . . . . . . . Almost all current automatic approaches are based on a language-independent similarity measure, while the comparative method applies a language-specific one. All automatic approaches will therefore yield the same scores for phenotypically identical sequences, regardless of the language systems they belong to. 12 / 28

Slide 50

Slide 50 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation LexStat 13 / 28

Slide 51

Slide 51 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Working Procedure 14 / 28

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Working Procedure Sequence Input sequences are read from specifically for- matted input files 1 Sequence Conversion sequences are converted to sound classes and prosodic profiles 2 Scoring-Scheme Creation using a permutation method, language- specific scoring schemes are determined 3 Distance Calculation based on the language-specific scoring- scheme, pairwise distances between sequences are calculated 14 / 28

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Implementation 15 / 28

Slide 59

Slide 59 text

Slide 60

Slide 60 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Implementation LexStat ist implemented as part of the LingPy Python library (see http://lingulist.de/lingpy) for automatic tasks in historical linguistics. The current release of LingPy (lingpy-1.0) provides methods for pairwise and multiple sequence alignment (SCA), automatic cognate detection (LexStat), and plotting routines (see the online documentation for details). 15 / 28

Slide 61

Slide 61 text

Slide 62

Slide 62 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Input and Output ID Items German English Swedish 1 hand hant hænd hand 2 woman fraʊ wʊmən kvina 3 know kɛnən nəʊ çɛna 3 know vɪsən - veːta … … … … … 16 / 28

Slide 63

Slide 63 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Input and Output ID Items German COG English COG Swedish COG 1 hand hant 1 hænd 1 hand 1 2 woman fraʊ 2 wʊmən 3 kvina 4 3 know kɛnən 5 nəʊ 5 çɛna 5 3 know vɪsən 6 - 0 veːta 6 … … … … … … … … 16 / 28

Slide 64

Slide 64 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Input and Output 16 / 28

Slide 65

Slide 65 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Internal Representation of Sequences 17 / 28

Slide 66

Slide 66 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Internal Representation of Sequences . Sound Classes and Prosodic Context . . . . . . . . 17 / 28

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Internal Representation of Sequences . Sound Classes and Prosodic Context . . . . . . . . All sequences are internally represented as sound classes, the default model being the one proposed in List (forthcoming). All sequences are also represented by prosodic strings which indicate the prosodic environment (initial, ascending, maximum, descending, final) of each phonetic segment (List 2012). The information regarding sound classes and prosodic context is combined, and each input sequence is further represented as a sequence of tuples, consisting of the sound class and the prosodic environment of the respective phonetic segment. 17 / 28

Slide 70

Slide 70 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Scoring-Scheme Creation 18 / 28

Slide 71

Slide 71 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Scoring-Scheme Creation . Attested Distribution . . . . . . . . 18 / 28

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Scoring-Scheme Creation . Attested Distribution . . . . . . . . carry out global and pairwise alignment analyses of all sequence pairs occuring in the same semantic slot store all corresponding segments that occur in sequences whose distance is beyond a certain threshold . Creation of the Expected Distribution . . . . . . . . shuffle the wordlists repeatedly and carry out global and pairwise alignment analyses of all sequence pairs in the randomly shuffled wordlists store all corresponding segments average the results 18 / 28

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Scoring-Scheme Creation English German Att. Exp. Score #[t,d] #[t,d] 3.0 1.24 6.3 #[t,d] #[ʦ] 3.0 0.38 6.0 #[t,d] #[ʃ,s,z] 1.0 1.99 -1.5 #[θ,ð] #[t,d] 7.0 0.72 6.3 #[θ,ð] #[ʦ] 0.0 0.25 -1.5 #[θ,ð] #[s,z] 0.0 1.33 0.5 [t,d]$ [t,d]$ 21.0 8.86 6.3 [t,d]$ [ʦ]$ 3.0 1.62 3.9 [t,d]$ [ʃ,s]$ 6.0 5.30 1.5 [θ,ð]$ [t,d]$ 4.0 1.14 4.8 [θ,ð]$ [ʦ]$ 0.0 0.20 -1.5 [θ,ð]$ [ʃ,s]$ 0.0 0.80 0.5 19 / 28

Slide 78

Slide 78 text

Slide 79

Slide 79 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Scoring-Scheme Creation Initial Final English town [taʊn] hot [hɔt] German Zaun [ʦaun] heiß [haɪs] English thorn [θɔːn] mouth [maʊθ] German Dorn [dɔrn] Mund [mʊnt] English dale [deɪl] head [hɛd] German Tal [taːl] Hut [huːt] 19 / 28

Slide 80

Slide 80 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Sequence Clustering Ger. Eng. Dan. Swe. Dut. Nor. Ger. [frau] 0.00 0.95 0.81 0.70 0.34 1.00 Eng. [wʊmən] 0.95 0.00 0.78 0.90 0.80 0.80 Dan. [kvenə] 0.81 0.78 0.00 0.17 0.96 0.13 Swe. [kvinːa] 0.70 0.90 0.17 0.00 0.86 0.10 Dut. [vrɑuʋ] 0.34 0.80 0.96 0.86 0.00 0.89 Nor. [kʋinə] 1.00 0.80 0.13 0.10 0.89 0.00 20 / 28

Slide 81

Slide 81 text

Slide 82

Slide 82 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation * * * * * * * * * * * * * v o l - d e m o r t v - l a d i m i r - v a l - d e m a r - Evaluation 21 / 28

Slide 83

Slide 83 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Gold Standard 22 / 28

Slide 84

Slide 84 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Gold Standard File Family Lng. Itm. Entr. Source GER Germanic 7 110 814 Starostin (2008) ROM Romance 5 110 589 Starostin (2008) SLV Slavic 4 110 454 Starostin (2008) PIE Indo-Eur. 18 110 2057 Starostin (2008) OUG Uralic 21 110 2055 Starostin (2008) BAI Bai 9 110 1028 Wang (2006) SIN Sinitic 9 180 1614 Hóu (2004) KSL varia 8 200 1600 Kessler (2001) JAP Japonic 10 200 1986 Shirō (1973) 22 / 28

Slide 85

Slide 85 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Evaluation Measures 23 / 28

Slide 86

Slide 86 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Evaluation Measures . Set Comparison . . . . . . . . 23 / 28

Slide 87

Slide 87 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Evaluation Measures . Set Comparison . . . . . . . . Precision, Recall, and F-Score are calculated by comparing the cognate sets proposed by the method with the cognate sets in the gold standard (see Bergsma & Kondrak 2007). 23 / 28

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Tests 24 / 28

Slide 91

Slide 91 text

Slide 92

Slide 92 text

Slide 93

Slide 93 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Tests Sound Classes – matching sound classes without alignment (based on Turchin et al. 2010) Simple Alignment – normalized edit-distance (Levenshtein 1966) SCA – language-independent distance scores derived from sound-class-based alignment analyses (List 2012) 24 / 28

Slide 94

Slide 94 text

Slide 95

Slide 95 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation General Results 25 / 28

Slide 96

Slide 96 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation General Results Score LexStat SCA Simple Alm. Sound Cl. Identical Pairs 0.85 0.82 0.76 0.74 Precision 0.59 0.51 0.39 0.39 Recall 0.68 0.57 0.47 0.55 F-Score 0.63 0.55 0.42 0.46 25 / 28

Slide 97

Slide 97 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation General Results SLV KSL GER BAI SIN PIE ROM JAP OUG 0.6 0.7 0.8 0.9 1.0 LexStat SCA NED Turchin 25 / 28

Slide 98

Slide 98 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Specific Results 26 / 28

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Slide 101

Slide 101 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Specific Results Pairwise decisions were extracted from the KSL dataset and compared with the Gold Standard. 72 borrowings were explicitly marked along with their source by Kessler (2001). 83 chance resemblances were determined automatically by taking non-cognate word pairs with an NED score less than 0.6. 26 / 28

Slide 102

Slide 102 text

Slide 103

Slide 103 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation *deh3 - ? What’s next? 27 / 28

Slide 104

Slide 104 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation Special thanks to: • The German Federal Mi- nistry of Education and Research (BMBF) for funding our research project. • Hans Geisler for his hel- pful, critical, and inspi- ring support. • James Kilbury for all the time he spent on helping me to reﬁne the manu- script. 28 / 28

Slide 105

Slide 105 text

. . Keys to the Past . . . . . . Identification of Cognates . . . . . . . LexStat . . . . . Evaluation THANK YOU 1 FOR LISTENING! 28 / 28