Slide 1

Slide 1 text

. . . . . . . Investigating the Impact of Sample Size on Cognate Detection Johann-Mattis List Research Unit Quantitative Language Comparison Philipps-University Marburg March 17, 2013 1 / 30

Slide 2

Slide 2 text

Sanscruta and Italian Sono scritte le loro scienze tutte in una lingua, che diman- dano Sanscruta, che vuol dire bene articolata. [...] et ha la lingua d’oggi molte cose comuni con quella, nella quale sono molti de’ nostri nomi, e particularmente de’ numeri il 6, 7, 8 e 9, Dio, serpe, et altri assai.(Sassetti 1855: 415) Translation: Everything that is related to science is written in a language which they call “Sanscruta”, meaning as much as “well-articulated”. Our language has much in common with it, among others many of our words, especially the numbers 6, 7 , 8, and 9, “God”, “snake”, and many more. 2 / 30

Slide 3

Slide 3 text

The Comparative Method 3 / 30

Slide 4

Slide 4 text

The Comparative Method Working Procedure Working Procedure proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification 4 / 30

Slide 5

Slide 5 text

The Comparative Method Working Procedure Working Procedure proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification 4 / 30

Slide 6

Slide 6 text

The Comparative Method Cognate Detection Cognate Detection 5 / 30

Slide 7

Slide 7 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30

Slide 8

Slide 8 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30

Slide 9

Slide 9 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30

Slide 10

Slide 10 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

Slide 11

Slide 11 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x ? n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

Slide 12

Slide 12 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

Slide 13

Slide 13 text

The Comparative Method Cognate Detection Cognate Detection Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x n n 2 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German Dorn d ɔɐ n English thorn θ ɔː n German dumm d ʊ m English dumb d ʌ m 5 / 30

Slide 14

Slide 14 text

The Comparative Method Summary Summary . Important Aspects . . . . . . . . language-specific notion of word similarity regular sound correspondences iterative character . Unspecified Parameters . . . . . . . . number of languages semantic similarity of the words size of the word lists 6 / 30

Slide 15

Slide 15 text

The Comparative Method Summary Summary . The Problem of the Sample Size . . . . . . . . Albanian English French German Albanian 0.07 0.10 0.10 English 14 0.23 0.56 French 20 46 0.23 German 20 111 46 . Numbers and proportions of shared cognates in the Swadesh-200 list (Swadesh 1952), taken from Kessler (2001). 7 / 30

Slide 16

Slide 16 text

Automatic Cognate Detection 8 / 30

Slide 17

Slide 17 text

Automatic Cognate Detection Similarity Two Types of Similarity . “Phenotypic” Similarity (Lass 1997) . . . . . . . . based on surface resemblances of phonetic segments only depends on the words under comparison . “Genotypic” Similarity (ibid.) . . . . . . . . based on sound-correspondences depends on the words and the languages under comparison 9 / 30

Slide 18

Slide 18 text

Automatic Cognate Detection Similarity Two Types of Similarity German Mund [mʊnt] English mouth [mauθ] 10 / 30

Slide 19

Slide 19 text

Automatic Cognate Detection Similarity Two Types of Similarity German Mund [mʊnt] English mouth [mauθ] German English Milch [ mɪlç] m m [ mɪlk] milk rund [ rʊnt] ʊ au [ raund] round anders [ andərs] n - [ ʌ(-)θər] other südlich [ sytlɪç] t θ [ sʌθərn] southern 10 / 30

Slide 20

Slide 20 text

Automatic Cognate Detection Language-Independent Approaches Language-Independent Approaches . Normalized Edit Distance . . . . . . . . align two words and calculate their hamming distance normalize by dividing by the length of the longer word assume cognacy for distances beyond a certain threshold . Turchin et al. (2010) . . . . . . . . convert two (or more) words to Dolgopolsky (1966) consonant classes assume cognacy if the first two classes match 11 / 30

Slide 21

Slide 21 text

Automatic Cognate Detection Language-Independent Approaches Language-Independent Approaches German Mund [mʊnt] English mouth [mauθ] 12 / 30

Slide 22

Slide 22 text

Automatic Cognate Detection Language-Independent Approaches Language-Independent Approaches German Mund [mʊnt] English mouth [mauθ] Turchin NED mʊnt → M N T m ʊ n t mauθ → M T m au - θ Matches: x 0 1 1 1 1 match => not cognate 3/4 = 0.75 => not cognate 12 / 30

Slide 23

Slide 23 text

Automatic Cognate Detection Language-Specific Approaches Language-Specific Approaches . LexStat (List 2012a) . . . . . . . . represent words as tuples of sound classes and prosodic strings use the SCA approach (List 2012b) to guess initial correspondences use a Monte-Carlo permutation test to derive language-specific similarity scores use the language-specific scores to calculate distance between words cluster words into cognate sets using a flat cluster algorithm 13 / 30

Slide 24

Slide 24 text

Automatic Cognate Detection Language-Specific Approaches LexStat 14 / 30

Slide 25

Slide 25 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Sound Classes . . . . . . . . Sounds which frequently occur in correspondence relations in genetically related languages can be divided in classes (types). It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgoposky 1986[1966]: 35). 14 / 30

Slide 26

Slide 26 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Sound Classes . . . . . . . . Sounds which frequently occur in correspondence relations in genetically related languages can be divided in classes (types). It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgoposky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30

Slide 27

Slide 27 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Sound Classes . . . . . . . . Sounds which frequently occur in correspondence relations in genetically related languages can be divided in classes (types). It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgoposky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30

Slide 28

Slide 28 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Sound Classes . . . . . . . . Sounds which frequently occur in correspondence relations in genetically related languages can be divided in classes (types). It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgoposky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30

Slide 29

Slide 29 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Sound Classes . . . . . . . . Sounds which frequently occur in correspondence relations in genetically related languages can be divided in classes (types). It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgoposky 1986[1966]: 35). K T P S 1 14 / 30

Slide 30

Slide 30 text

Automatic Cognate Detection Language-Specific Approaches LexStat 15 / 30

Slide 31

Slide 31 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. 15 / 30

Slide 32

Slide 32 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. j a b ə l k a 15 / 30

Slide 33

Slide 33 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. j a b ə l k a ↑ ↑ ↓ ↑ o strong weak 15 / 30

Slide 34

Slide 34 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. j a b ə l k a ↑ ↑ ↓ ↑ ↑ ascending maximum ↓ descending 15 / 30

Slide 35

Slide 35 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. sonority increases j a b ə l k a 15 / 30

Slide 36

Slide 36 text

Automatic Cognate Detection Language-Specific Approaches LexStat . Prosodic Strings . . . . . . . . Sound change occurs more frequently in weak positions of sound sequences (Geisler 1992). Based on a sonority profile of sound sequences, one can distinguish sound positions according to their prosodic contexts. Prosodic context can be modeled as prosodic string in which different contexts are coded by different symbols. j a b ə l k a # v C v c C > 15 / 30

Slide 37

Slide 37 text

Automatic Cognate Detection Language-Specific Approaches LexStat 16 / 30

Slide 38

Slide 38 text

Automatic Cognate Detection Language-Specific Approaches LexStat External Representation IPA j a b ə l k a Internal Representation Sound-Class String J A P E L K A Prosodic String # V C V c C > 16 / 30

Slide 39

Slide 39 text

Automatic Cognate Detection Language-Specific Approaches LexStat 17 / 30

Slide 40

Slide 40 text

Automatic Cognate Detection Language-Specific Approaches LexStat Cognate List Alignment Correspondence List German Zunge ʦ ʊ ŋ ə GER ENG Frequ. ʦ t 2 x s t 2 x h h 1 x f f 1 x n - 1 x … … … English tongue t ʌ ŋ - German Zahn ʦ aː n - English tooth t ʊː - θ German heiß h ai s English hot h ɔ t German Fuß f u ː s English foot f ʊ t 17 / 30

Slide 41

Slide 41 text

Automatic Cognate Detection Language-Specific Approaches LexStat Cognate List Alignment Correspondence List German Zunge ʦ ʊ ŋ ə GER ENG Frequ. ʦ t 2 x s t 2 x h h 1 x f f 1 x n - 1 x … … … English tongue t ʌ ŋ - German Zahn ʦ aː n - English tooth t ʊː - θ German heiß h ai s English hot h ɔ t German Fuß f u ː s English foot f ʊ t 17 / 30

Slide 42

Slide 42 text

Automatic Cognate Detection Language-Specific Approaches LexStat Cognate List Alignment Correspondence List German Zunge C U N E GER ENG Frequ. C/# T/# 2 x S/$ T/$ 2 x H/$ H/# 1 x B/$ B/# 1 x N/c - 1 x … … … English tongue T A N - German Zahn C A N - English tooth T U - T German heiß H A S English hot H O T German Fuß B U S English foot B U T 17 / 30

Slide 43

Slide 43 text

Automatic Cognate Detection Language-Specific Approaches LexStat Dataset of Kessler (2001) “to dig” (30) Turchin NED LexStat. Albanisch gërmon gərmo 1 1 1 Englisch digs dɪg 2 2 2 Französisch creuse krøze 1 3 3 Deutsch gräbt graːb 1 1 4 Hawaii ‘eli ʔeli 5 5 5 Navajo hahashgééd hahageːd 6 6 6 Türkisch kazıyor kaz 7 3 7 18 / 30

Slide 44

Slide 44 text

Automatic Cognate Detection Language-Specific Approaches LexStat Dataset of Kessler (2001) “mouth” (104) Turchin NED LexStat. Albanisch gojë goj 1 1 1 Englisch mouth mauθ 2 2 2 Französisch bouche buʃ 3 3 3 Deutsch Mund mund 4 4 2 Hawaii waha waha 5 5 5 Navajo ’azéé’ zeːʔ 6 6 6 Türkisch ağız aɣz 7 7 7 19 / 30

Slide 45

Slide 45 text

Testing the Impact of Sample Size on Cognate Detection 20 / 30

Slide 46

Slide 46 text

Testing the Impact of Sample Size on Cognate Detection Materials Gold Standard . IDS-Testset . . . . . . . . 4 languages (German, English, Dutch, French) 550 items (glosses) translations taken from the IDS (Key & Comrie 2009) orthographic entries converted into IPA transcriptions cognate judgments follow traditional literature 21 / 30

Slide 47

Slide 47 text

Testing the Impact of Sample Size on Cognate Detection Materials Subsets of Varying Samplesize . Creating the Subsets . . . . . . . . Starting from the basic dataset, subsets of the data were created by randomly deleting 5, 10, 15, etc. items from the original dataset, and taking 5 different samples for each distinct number of deletions. This process yielded 550 datasets, covering the whole range of possible sample sizes between 5 and 550 in steps of 5. 22 / 30

Slide 48

Slide 48 text

Testing the Impact of Sample Size on Cognate Detection Methods Automatic Cognate Detection . Methods for Cognate Detection . . . . . . . . Normalized Edit Distance (NED) Turchin et al. (2010, Turchin) SCA Distance (List 2012b) LexStat (List 2012a) . Implementation . . . . . . . . All methods are implemented as part of LingPy-1.0 (see http://lingpy.org), a Python library for quantitative tasks in historical linguistics. 23 / 30

Slide 49

Slide 49 text

Testing the Impact of Sample Size on Cognate Detection Methods Evaluation Measures . B-Cubed Precision and Recall (Amigó et al. 2009) . . . . . . . . Given a test (result of an analysis) and a reference (the gold standard), precision is the proportion of items in the test that also occur in the reference, and recall is the proportion of items in the reference that also occur in the test. Low precision is equivalent to high rates of false positives, low recall is equivalent to high rates of false negatives (missed cognates). 24 / 30

Slide 50

Slide 50 text

Results 25 / 30

Slide 51

Slide 51 text

Results Results Items B-Cubed Recall Turchin NED SCA LexStat 50 86.10 85.55 92.44 90.88 100 86.55 85.77 92.20 93.89 200 86.88 86.61 92.68 95.02 300 87.13 86.64 92.90 95.05 400 87.14 86.81 92.89 94.94 500 87.07 86.77 92.75 94.90 26 / 30

Slide 52

Slide 52 text

Results 100 200 300 400 500 84 86 88 90 92 94 96 Turchin 100 200 300 400 500 84 86 88 90 92 94 96 NED 100 200 300 400 500 84 86 88 90 92 94 96 SCA 100 200 300 400 500 84 86 88 90 92 94 96 LexStat 27 / 30

Slide 53

Slide 53 text

Discussion 28 / 30

Slide 54

Slide 54 text

Discussion Discussion . Are 200 words enough? . . . . . . . . Although the representativity of the data is limited, and the number of languages investigated is small, the test shows that sample size has a definite impact on the results of language-specific methods, and using 200 words is surely better than using 100 words. 29 / 30

Slide 55

Slide 55 text

30 / 30

Slide 56

Slide 56 text

Sanscruta sarpá- s a r p a Italienisch serpe s ɛ r p ə Sanscruta devá- d e v a Italienisch Dio d i - o Sanscruta saptá- s a p t a Italienisch sette s ɛ - tː ə 30 / 30

Slide 57

Slide 57 text

Спасибо за Ваше Внимание! 30 / 30