Slide 1

Slide 1 text

Automatic Inference of Sound Correspondence Patterns Across Multiple Languages Johann-Mattis List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2018-03-23 very long title P(A|B)=P(B|A)... 1 / 46

Slide 2

Slide 2 text

Comparative Linguistics 2 / 46

Slide 3

Slide 3 text

"All languages change, as long as they exist." (August Schleicher 1863) walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod Comparative Linguistics 2 / 46

Slide 4

Slide 4 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English walkman "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 5

Slide 5 text

walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 6

Slide 6 text

walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 7

Slide 7 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 8

Slide 8 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 9

Slide 9 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 10

Slide 10 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₁ L₁ L₁ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 11

Slide 11 text

iPod Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r walkman L₂ L₁ L₃ "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics 2 / 46

Slide 12

Slide 12 text

Comparative Linguistics Background Background 3 / 46

Slide 13

Slide 13 text

Comparative Linguistics Background Background 3 / 46

Slide 14

Slide 14 text

Comparative Linguistics Background Background 3 / 46

Slide 15

Slide 15 text

Comparative Linguistics Background Background 3 / 46

Slide 16

Slide 16 text

Comparative Linguistics Background Background 3 / 46

Slide 17

Slide 17 text

Comparative Linguistics Comparative Method The Comparative Method COMPA- RATIVE METHOD 4 / 46

Slide 18

Slide 18 text

Comparative Linguistics Comparative Method The Comparative Method COMPA- RATIVE METHOD 4 / 46

Slide 19

Slide 19 text

Comparative Linguistics Comparative Method The Comparative Method COMPA- RATIVE METHOD 4 / 46

Slide 20

Slide 20 text

Comparative Linguistics Comparative Method The Comparative Method COMPA- RATIVE METHOD 4 / 46

Slide 21

Slide 21 text

Comparative Linguistics Comparative Method The Comparative Method COMPA- RATIVE METHOD 4 / 46

Slide 22

Slide 22 text

Comparative Linguistics Computational Linguistics Computational Historical Linguistics COMPUTA- TIONAL HISTORICAL LINGUISTICS 5 / 46

Slide 23

Slide 23 text

Comparative Linguistics Computational Linguistics Computational Historical Linguistics COMPUTA- TIONAL HISTORICAL LINGUISTICS 5 / 46

Slide 24

Slide 24 text

Comparative Linguistics Computational Linguistics Computational Historical Linguistics COMPUTA- TIONAL HISTORICAL LINGUISTICS 5 / 46

Slide 25

Slide 25 text

Comparative Linguistics Computational Linguistics Computational Historical Linguistics COMPUTA- TIONAL HISTORICAL LINGUISTICS 5 / 46

Slide 26

Slide 26 text

Comparative Linguistics Computational Linguistics Computational Historical Linguistics COMPUTA- TIONAL HISTORICAL LINGUISTICS 5 / 46

Slide 27

Slide 27 text

Comparative Linguistics Computational Linguistics Classical vs. Computational Language Comparison LC CA COMPA- RATIVE METHOD lacks efficiency lacks consistency lacks efficiency lacks accuracy lacks flexibility high efficiency high consistency high flexibility high accuracy COMPUTA- TIONAL HISTORICAL LINGUISTICS 6 / 46

Slide 28

Slide 28 text

Comparative Linguistics Computational Linguistics Classical vs. Computational Language Comparison LC CA COMPA- RATIVE METHOD lacks efficiency lacks consistency lacks efficiency lacks accuracy lacks flexibility high efficiency high consistency high flexibility high accuracy COMPUTA- TIONAL HISTORICAL LINGUISTICS 6 / 46

Slide 29

Slide 29 text

Comparative Linguistics Computational Linguistics Classical vs. Computational Language Comparison LC CA lacks efficiency lacks consistency lacks efficiency lacks accuracy lacks flexibility high efficiency high consistency high flexibility high accuracy COMPA- RATIVE METHOD accuracy flexibility consistency efficiency COMPUTA- TIONAL HISTORICAL LINGUISTICS 6 / 46

Slide 30

Slide 30 text

Comparative Linguistics CALC Computer-Assisted Language Comparison LC CA LC CA lacks efficiency lacks consistency lacks efficiency lacks accuracy lacks flexibility high efficiency high consistency high flexibility high accuracy COMPA- RATIVE METHOD accuracy flexibility consistency efficiency COMPUTA- TIONAL HISTORICAL LINGUISTICS 7 / 46

Slide 31

Slide 31 text

Comparative Linguistics CALC Computer-Assisted Language Comparison LC CA 7 / 46

Slide 32

Slide 32 text

Historical Language Comparison 8 / 46

Slide 33

Slide 33 text

Historical Language Comparison Sequences in Biology and Linguistics Alphabets in Biology and Linguistics 9 / 46

Slide 34

Slide 34 text

Historical Language Comparison Sequences in Biology and Linguistics Alphabets in Biology and Linguistics • universal • language-specific 9 / 46

Slide 35

Slide 35 text

Historical Language Comparison Sequences in Biology and Linguistics Alphabets in Biology and Linguistics • universal • language-specific • limited • widely varying 9 / 46

Slide 36

Slide 36 text

Historical Language Comparison Sequences in Biology and Linguistics Alphabets in Biology and Linguistics • universal • language-specific • limited • widely varying • constant • mutable 9 / 46

Slide 37

Slide 37 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 38

Slide 38 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 39

Slide 39 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 40

Slide 40 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 41

Slide 41 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 42

Slide 42 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 43

Slide 43 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 44

Slide 44 text

Historical Language Comparison Sound Correspondences Inferring Correspondences Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ 10 / 46

Slide 45

Slide 45 text

Historical Language Comparison Homolog Detection Inferring Homologs Cognate List Alignment Correspondences Bola six kʰ j a u ʔ ⁵⁵ Bola Maru Freq. a(a̰) a(a̰) 3 x u u 3 x ʔ k 3 x j j 2 x k(ʰ) k(ʰ) 2 x ⁵⁵ ⁵⁵ 2 x ³¹ ³¹ 1 x Maru six kʰ j a u k ⁵⁵ Bola lip k a̰ u ʔ ⁵⁵ Maru lip k a̰ u k ⁵⁵ Bola man j a u ʔ ³¹ Maru man j a u k ³¹ 11 / 46

Slide 46

Slide 46 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ tʰ tsʰ t tʰ sʰ 12 / 46

Slide 47

Slide 47 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ tʰ tsʰ t tʰ sʰ "salt" Bola tʰ a ³⁵ Maru tsʰ ɔ ³⁵ Rangoon sʰ ɑ ⁵⁵ 12 / 46

Slide 48

Slide 48 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t ts t tʰ tθ 12 / 46

Slide 49

Slide 49 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t ts t tʰ tθ "tooth" Bola t u i ⁵⁵ Maru ts ɔ i ³¹ Rangoon tθ w a ⁵⁵ 12 / 46

Slide 50

Slide 50 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ tʰ tʰ t tʰ tʰ 12 / 46

Slide 51

Slide 51 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ tʰ tʰ t tʰ tʰ "sharp" Bola tʰ a ʔ ⁵⁵ Maru tʰ ɔ ʔ ⁵⁵ Rangoon tʰ ɛ ʔ ⁴ 12 / 46

Slide 52

Slide 52 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t t t tʰ t 12 / 46

Slide 53

Slide 53 text

Historical Language Comparison Correspondence Patterns Inferring Patterns Ø Sound Bola Maru Rangoon d Ø Ø d t t t t tʰ tʰ tʰ tʰ tθ Ø Ø tθ ts ts ts Ø tsʰ tsʰ tsʰ Ø tʃ tʃ tʃ Ø tʃʰ tʃʰ tʃʰ Ø s s s s sʰ Ø Ø sʰ ɕ Ø Ø ɕ ʃ ʃ ʃ t t t tʰ t "wing" Bola t a u ŋ ⁵⁵ Maru t a u ŋ ³¹ Rangoon t ɑ u ∼ ²² 12 / 46

Slide 54

Slide 54 text

Inferring Correspondence Patterns 13 / 46

Slide 55

Slide 55 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Sound Correspondences Gloss Proto-Germanic German English Dutch ‘dead’ *daudaz daudaz tot toːt dead dɛd doot doːt ‘deed’ *dēdiz deːdiz Tat taːt deed diːd daad daːt ‘thick’ *þekuz θekuz dick dɪk thick θɪk dik dɪk ‘thorn’ *þurnuz θurnuz Dorn dɔrn thorn θɔːn doorn doːrn ‘tongue’ *tungōn tuŋgoːn Zunge tsʊŋə tongue tʌŋ tong tɔŋ ‘tooth’ *tanþs tanθs Zahn tsaːn tooth tuːθ tand tɑnt 14 / 46

Slide 56

Slide 56 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Sound Correspondence Patterns PIE Hittite Sanskrit Avestan Greek Latin Gothic Old Church Slavonic Lithuanian Old Irish Armenian Tocharian *p p p p f p p f b p p Ø h w Ø p *b b p b bβ b b p b b b p p *bʰ b p bʱ/bh bβ pʰ/ph f b b b b b b p *t t t t θ t t θ/þ d t t t tʼ j/y t tʃ/c *d d t d d ð d d t d d d t ts ʃ/ś *dʰ d t dʰ/dh h d ð tʰ/th f d b d d d d t t tʃ/c ... ... ... ... ... ... ... ... ... ... ... ... *kʷ kʷ/ku k c k c k p t kʷ/qu hʷ/hw g k tʃ/č k c kʼ tʃʼ/čʼ k ʃʲ/ś *gʷ kʷ/u g j g j g b d gʷ/gu u q g ʒ/ž z g b k k ś *gʷʰ kʷ/ku gʷ/gu gʱ/gh h g j pʰ/ph tʰ/th kʰ/kh f gʷ/gu u g b g ʒ/ž z g g g dʒ/ǰ k ʃʲ/ś Clackson (2007: 37) 15 / 46

Slide 57

Slide 57 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Sound Correspondence Patterns and Alignments d au d ( a z ) t oː t ( - - ) d ɛ d ( - - ) d oː t ( - - ) d eː d ( i z ) t aː t ( - - ) d iː d ( - - ) d a: t ( - - ) θ e k ( u z ) d ɪ k ( - - ) θ ɪ k ( - - ) d ɪ k ( - - ) θ u r n ( u z ) d ɔ r n ( - - ) θ ɔː - n ( - - ) d oː r n ( - - ) t u ŋ ( g oː ) ts ʊ ŋ ( - ə ) t ʌ ŋ ( - - ) t ɔ ŋ ( - - ) t a n θ ( s ) ts aː n - ( - ) t uː - θ ( - ) t ɑ n t ( - ) Proto-Germanic German English Dutch Proto-Germanic German English Dutch 'dead' 'thick' 'tongue' 'deed' 'thorn' 'tooth' 16 / 46

Slide 58

Slide 58 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Sound Correspondence Patterns and Alignments A B C D E F Sanskrit y u g a m dh u h i (tar) s n u ṣ (ā) - r u dh (iras) Greek z u g o n th u g a (ter-) - n u - (os) e r u th (rós) Latin i u g u m Ø Ø Ø Ø (Ø) - n u r (us) - r u b (er) Gothic j u k - - d au h - (tar) Ø Ø Ø Ø (Ø) Ø Ø Ø Ø (Ø) Gloss 'yoke' 'daughter' 'daughter-in-law' 'red' Adapted from Anttila (1972) 17 / 46

Slide 59

Slide 59 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Summary on Sound Correspondence Patterns 18 / 46

Slide 60

Slide 60 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Summary on Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets 18 / 46

Slide 61

Slide 61 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Summary on Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets they are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) 18 / 46

Slide 62

Slide 62 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Summary on Sound Correspondence Patterns correspondence patterns in linguistics are a way to encode mappings across several different alphabets they are usually inferred manually, by inspecting “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate sets with recurring sounds) the main problem of correspondence pattern identification is the handling of missing data, since not all cognate sets will necessarily contain reflexes from each of the languages under investigation 18 / 46

Slide 63

Slide 63 text

Inferring Correspondence Patterns From Sound Correspondences to Correspondence Patterns Summary on Sound Correspondence Patterns θ u r n ( u z ) d ɔ r n ( - - ) θ ɔː - n ( - - ) d oː r n ( - - ) 'thorn' alignment site sound correspondence pattern θ e k ( u z ) d ɪ k ( - - ) θ ɪ k ( - - ) d ɪ k ( - - ) 'thick' Proto-Germanic German English Dutch θ d θ d θ u r p ( a ) Ø Ø Ø Ø Ø Ø Ø d ɔ r f ( - ) d ɔ r p ( - ) 'thorp' 19 / 46

Slide 64

Slide 64 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Compatibility of Alignment Sites A E A F E F A C C E C F Sanskrit u <=> u ------ u <=> u ------ u <=> u ------ u <=> u ------ u <=> u ------ u <=> u Greek u <=> u u <=> u u <=> u u <=> u u <=> u u <=> u Latin u <=> u u <=> u u <=> u u ? Ø Ø ? u Ø ? u Gothic u ? Ø u ? Ø Ø ? Ø u >=< au au ? Ø au ? Ø Matches 3 3 3 2 2 2 20 / 46

Slide 65

Slide 65 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Compatibility of Alignment Sites Two alignment sites are assumed to be compatible, if they (a) share at least one sound, (b) do not have any conflicting sounds. 21 / 46

Slide 66

Slide 66 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Compabitility of Alignment Sites Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “foot-1” p p p p f f p p ⊠ compatible □ incompatible 22 / 46

Slide 67

Slide 67 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Compabitility of Alignment Sites Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “foot-1” p p p p f f p p ⊠ compatible □ incompatible Cognate Set L1 L2 L3 L4 L5 L6 L7 L8 “hand-1” p p p Ø f f Ø p “leg-1” p p f pf f f p p □ compatible ⊠ incompatible 22 / 46

Slide 68

Slide 68 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Alignment Site Networks Constructing an alignment site network from a set of alignments: all sites represent a node in the network edges are drawn between compatible sites edges can (in principle) also be weighted by the number of matching sounds (but disregarded in our algorithm so far) 23 / 46

Slide 69

Slide 69 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Alignment Site Networks Sanskrit Greek Latin Gothic E u u u Ø Sanskrit Greek Latin Gothic F u u u Ø A u u u Sanskrit Greek Latin Gothic u Sanskrit Greek Latin Gothic C u u Ø au 24 / 46

Slide 70

Slide 70 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Correspondence Pattern Inference as Clique Cover Problem 25 / 46

Slide 71

Slide 71 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. 25 / 46

Slide 72

Slide 72 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. 25 / 46

Slide 73

Slide 73 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will come close to the best set of sound correspondence patterns in our data. 25 / 46

Slide 74

Slide 74 text

Inferring Correspondence Patterns Preliminaries on Correspondence Pattern Recognition Correspondence Pattern Inference as Clique Cover Problem The clique cover problem (also called clique partitioning problem, see Bhasker 1991) is the inverse of the famous graph coloring problem and has been shown to be NP-hard. The goal of the problem is to split a graph into the smallest number of cliques in which each node is represented by exactly one clique. We assume (but we cannot formally prove it) that the clique cover of our graph of compatible correspondence sets will come close to the best set of sound correspondence patterns in our data. Partitioning our alignment site network into cliques does not solve the problem of linguistic reconstruction, but it can be seen as its fundamental prerequisite. 25 / 46

Slide 75

Slide 75 text

A Method for Correspondence Pattern Recognition Description of the Method General Workflow word list A B cognates A B alignments site network 1 2 clique coverage 1 2 fuzzy sites (A) (B) (C) (D) (E) 26 / 46

Slide 76

Slide 76 text

A Method for Correspondence Pattern Recognition Description of the Method Implementation, Input, Output Full implementation is provided as plugin for LingPy (http://lingpy.org). An approximate version is shipped with the EDICTOR (http://edictor.digling.org). Input format follows the format employed by LingPy with some additional required columns which need to be submitted. Output in form of annotated word lists with assigned patterns for each word form, which can be read in and inspected with help of the EDICTOR, or in form of “normal” text files. 27 / 46

Slide 77

Slide 77 text

A Method for Correspondence Pattern Recognition Description of the Method Implementation, Input, Output ID DOCULECT CONCEPT FORM TOKENS STRUCTURE COGID ALIGNMENT 1 German tongue Zunge ts ʊ ŋ ə c v c 1 ts ʊ ŋ ( ə ) 2 English tongue tongue t ʌ ŋ c v c 1 t ʌ ŋ ( - ) 3 Dutch tongue tong t ɔ ŋ c v c 1 t ɔ ŋ ( - ) 4 German tooth Zahn ts aː n c v c 2 ts aː n - 5 English tooth tooth t uː θ c v c 2 t uː - θ 6 Dutch tooth tand t ɑ n t c v c 2 t ɑ n t 7 German thick dick d ɪ k c v c 3 d ɪ k ... ... ... ... ... ... ... ... 28 / 46

Slide 78

Slide 78 text

A Method for Correspondence Pattern Recognition Description of the Method Correspondence Pattern Recognition A three-step algorithm: (A) sort the alignment sites with a customized variant of the Quicksort algorithm (Hoare 1962) which groups compatible alignment sites closely together (sole algorithm used by EDICTOR due to restrictions of JavaScript on memory) (B) inverse version of Welsh-Powell algorithm (Welsh and Powell 1967) for graph coloring (a) sort all partitions according to size and alignment site density (b) pick first partition and compare it against all other partitions, merge with compatible partitions, put incompatible partitions into a queue (c) finish if no more partitions are in the queue (C) compare all alignment sites again with the inferred correspondence patterns and assign each alignment site to all patterns with which it is compatible to yield a fuzzy clustering 29 / 46

Slide 79

Slide 79 text

A Method for Correspondence Pattern Recognition Description of the Method Correspondence Pattern Recognition L₁ L₂ L₃ L₄ S₁ k Ø Ø k S₂ k g Ø k S₃ Ø g g k L₁ L₂ L₃ L₄ S₁ 1 0 0 1 S₂ 1 1 0 1 S₃ 0 1 1 1 2 8 / (4 · 3) = 0.66 3 3 } calculating alignment site density 30 / 46

Slide 80

Slide 80 text

A Method for Correspondence Pattern Recognition Testing Test and Training Data Dataset Source Languages Concepts Cognates Density Austronesian Greenhill et al. (2008) 20 210 2864 0.34 Bai Wang (2006) 9 110 285 0.73 Chinese Hóu (2004) 15 140 1189 0.60 IndoEuropean Dunn (2012) 20 207 1777 0.60 Japanese Hattori (1973) 10 200 460 0.70 ObUgrian Zhivlov (2011) 21 110 242 0.88 Bahnaric Sidwell (2015) 24 200 1055 0.76 Chinese Běijīng Dàxué (1964) 18 180 1231 0.68 Huon McElhanon (1967) 14 139 855 0.48 Romance Saenko (2015) 43 110 465 0.90 Tujia Starostin (2013) 5 109 179 0.63 Uralic Syrjänen et al. (2013) 7 173 870 0.39 test and training data (List et al. 2017) 31 / 46

Slide 81

Slide 81 text

A Method for Correspondence Pattern Recognition Testing Test and Training Data D = 1 − 1 m m ∑ i=1 1 ni ni ∑ j=1 1 cognates(wij) (1) calculating the cognate density for a given wordlist 32 / 46

Slide 82

Slide 82 text

A Method for Correspondence Pattern Recognition Testing General Characteristics All Patterns Consonants Vowels Dataset St. Pt. Sg. Fz. St. Pt. Sg. Fz. St. Pt. Sg. Fz. Bahnaric 2659 865 385 4.59 1651 480 222 4.53 1008 382 167 4.52 Chinese 3205 584 191 5.78 1118 207 79 3.81 1308 298 108 7.28 Huon 1572 271 104 4.07 873 154 58 2.98 699 115 40 5.42 Romance 1656 874 587 3.67 940 496 345 3.51 716 379 250 3.85 Tujia 952 272 130 2.66 323 118 62 1.71 347 84 41 2.71 Uralic 1346 326 131 3.35 763 180 74 2.75 583 141 45 4.16 St: Alignment Sites, Pt: Correspondence Patterns, Sg: Singleton Patterns, Fz: Fuzziness of alignment sites 33 / 46

Slide 83

Slide 83 text

A Method for Correspondence Pattern Recognition Testing General Characteristics short intermediate summary: the method seems to be successful in reducing patterns the number of singleton patterns is still surprising vowels seem to be “fuzzier” than consonants (not against our expectation) automatic cognates may have caused problems 34 / 46

Slide 84

Slide 84 text

A Method for Correspondence Pattern Recognition Testing General Characteristics Dataset Sites Patterns Singletons Fuzziness Gappy Non-Gappy Bahnaric 2006 516 201 4.85 0.39 0.47 Chinese 2906 475 139 5.88 0.29 0.34 Huon 1478 213 74 3.88 0.35 0.41 Romance 1174 476 270 4.70 0.57 0.68 Tujia 820 219 110 2.75 0.50 0.51 Uralic 1168 251 94 3.46 0.37 0.41 Non-gappy sites were extracted by taking those sites of the alignments in which the consensus segment was not a gap. 35 / 46

Slide 85

Slide 85 text

A Method for Correspondence Pattern Recognition Testing Specific Characteristics reasons for singleton patterns 1 errors in data (wrong transcriptions, etc.) 2 errors in cognate judgments (lookalikes, wishful thinking, too optimistic) 3 errors in alignments (partial cognates, unalignable parts, etc.) 4 irregular sound change (assimilation, metathesis, etc.) 5 analogy (word families, paradigms, etc.) 6 missing data that increases ambiguity 36 / 46

Slide 86

Slide 86 text

A Method for Correspondence Pattern Recognition Testing Specific Characteristics seeding of artificial borrowings or wrong cognates as a method for testing following Dessimoz et al. (2008), for a biological framework, randomly select pairs of languages and have them interchange words building on Dessimoz et al. (2008), create neologisms with LingPy’s built-in word generator (based on Markov Chains), to replace existing words in a given cognate set with a new (possible) word from the same language investigate the pattern regularity (PR) of the cognate sets in the data before and after the operations by (a) setting a user-defined threshold for the regularity of an alignment site derived from the density of its pattern (smoothing of singletons: they are always irregular) (b) accepting a cognate set as regular if half of its alignment sites are regular (c) splitting irregular cognate sets up into independent cognate sets 37 / 46

Slide 87

Slide 87 text

A Method for Correspondence Pattern Recognition Testing Specific Characteristics: Fake Borrowings Unmodified Modified Diff. Dataset Orig. D. PR D. Orig. D. PR D. Lg. Ev. Bahnaric 0.76 0.51 0.76 0.45 0.06 4 400 Chinese 0.68 0.55 0.68 0.50 0.05 3 270 Huon 0.48 0.19 0.48 0.22 -0.03 2 139 Romance 0.90 0.38 0.90 0.24 0.14 8 440 Tujia 0.63 0.59 0.63 0.55 0.04 1 54 Uralic 0.39 0.38 0.39 0.37 0.01 1 86 Orig Ds: original density, PR Ds: density after applying PR check, Lg: number of languages selected, Diff: difference between original and modified density, Ev: borrowing events 38 / 46

Slide 88

Slide 88 text

A Method for Correspondence Pattern Recognition Testing Specific Characteristics: Fake Neologisms Unmodified Modified Diff. Lg. Ev. Dataset Orig. Ds. PR Ds. Orig. Ds. PR Ds. Bahnaric 0.76 0.51 0.77 0.48 0.03 288 400 Chinese 0.68 0.55 0.69 0.47 0.09 162 270 Huon 0.48 0.19 0.50 0.17 0.01 98 139 Romance 0.90 0.38 0.91 0.32 0.07 924 440 Tujia 0.63 0.59 0.64 0.58 0.02 12 54 Uralic 0.39 0.38 0.41 0.40 -0.02 24 86 Orig Ds: original density, PR Ds: density after applying PR check, Lg: number of language pairs (donor-recipient), Diff: difference between original and modified density, Ev: borrowing events 39 / 46

Slide 89

Slide 89 text

A Method for Correspondence Pattern Recognition Testing Specific Characteristics: Summary the fake borrowings lead as expected to a decrease in cognate density the fake neologisms also lead as expected to a decrease in cognate density even pulling out those correspondence patterns which are singletons or marking the cognate sets which have a low density seems like a valuable enterprise as it can help linguists to have another look at their data and check the findings manually 40 / 46

Slide 90

Slide 90 text

A Method for Correspondence Pattern Recognition Examples Examples: Burmish Graph with Clique Cover (with N. Hill) tʃʰ s ʃ tʃʰ x j n◌̥ n n ŋ tsʰ tsʰ tsʰ tʃʰ pʰ pʰ pʰ j w x v x v x j x n n x x x n◌̥ x n m m m n kʰ kʰ kʰ tʰ tʰ tʰ kʰ tʰ tʰ tʰ tʰ - - - pʰ pʰ p j ɣ ɣ ɣ ɣ j j j n n n n m n n n n m m m m m m m m m m l l l p l l l l l l l kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ ʃ s s s k ʃ ʃ k s s k s ʃ ʃ ʃ t t t t t t t t t t t m m m m m m m m m p p p p p p p p p p p p p p p k k s kʰ kʰ k ʃ ŋ n m l t t l - tsʰ f tʃ k n n l ʃ tsʰ l s m t p k n kʰ m j m v j s tʃ n ts m l ŋ k kʰ v ʃ ʐ ʃ n k j - tʃ pʰ s v m k ŋ ŋ - n n l n◌̥ ŋ ŋ l l l l ʃ ʃ ts ʃ k s k s s s s s ts ts ts ts ts ts j k j ɣ k ŋ ŋ ŋ ŋ ŋ ŋ ŋ s s tʃ tʃ tʃ tʃ tʃ tʃ tʃ k k k k k ts p pʰ ts nʲ ŋ n ŋ k 41 / 46

Slide 91

Slide 91 text

A Method for Correspondence Pattern Recognition Examples Examples: Burmish Graph with Clique Cover (with N. Hill) tʰ tʰ tʰ pʰ tʰ tʰ pʰ pʰ tʰ tʰ pʰ ŋ ŋ ŋ ŋ ŋ ŋ ŋ ŋ tsʰ tsʰ tʃʰ tsʰ tsʰ tʃʰ tʃʰ tsʰ j v f j v v ŋ - ŋ n◌̥ ŋ ŋ ŋ ʃ ʃ s ʃ ʃ ʃ ʃ ʃ ʃ s tʃ tʃ tʃ tʃ tʃ s tʃ tʃ tʃ tʃ x x tʃ x x x t t t t t t ʃ ʃ ʃ ʃ ʃ ʃ ts ts ts ts ts ts ts ts ts ts t t t t t t t t m m m pʰ pʰ p p p m m s s s s s s s s s s n l l l l l l l l l l l l s s s s s s p p p p p p p p p p p p p p p pʰ p m m m m m m m l l l l l l l l - - j j j j j - j k ɣ ɣ ɣ ɣ ʐ ɣ w j - v v j j j k k k k k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ k k k k k kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ kʰ x x x x n n◌̥ n n n n n ŋ n n n n n n n n n n n n k k k k k k m m m m m m m m - n n ŋ ŋ n n◌̥ n nʲ m m m m m m m m m 41 / 46

Slide 92

Slide 92 text

A Method for Correspondence Pattern Recognition Examples Examples: Burmish Graph with Clique Cover (with N. Hill) tʃʰ tʃʰ tsʰ tsʰ tsʰ tsʰ tʰ tʰ tsʰ tʃʰ tʰ tʰ tʰ tʰ tʰ 41 / 46

Slide 93

Slide 93 text

A Method for Correspondence Pattern Recognition Examples Examples: Burmish Graph with Clique Cover (with N. Hill) tʃʰ tʃʰ tʃʰ tsʰ tsʰ tsʰ tsʰ tsʰ tʰ tʰ tʰ tʰ tʰ tʰ tʰ Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 41 659 "goat" Ø Ø Ø Ø tʃʰ tsʰ sʰ Ø 41 672 "armpit" Ø Ø tʃʰ tʃʰ tʃʰ Ø Ø Ø 41 433 "rice" tsʰ tʃʰ tʃʰ tʃʰ tʃʰ Ø sʰ tsʰ Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 74 55 "ten" Ø Ø tʰ tsʰ Ø Ø Ø tsʰ 42 53 "ten" tɕʰ tsʰ Ø Ø Ø tsʰ sʰ Ø 42 421 "salt" tɕʰ tsʰ tʰ tsʰ tsʰ tsʰ sʰ cʰ 42 129 "twenty" Ø tsʰ tʰ tsʰ tsʰ tsʰ sʰ Ø 83 287 "hair" Ø tsʰ tsʰ tsʰ tsʰ tsʰ sʰ Ø Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 17 639 "above" Ø tʰ tʰ Ø tʰ tʰ tʰ Ø 17 472 "sing" Ø tʰ tʰ Ø tʰ Ø Ø Ø 17 61 "that" tʰ Ø Ø tʰ tʰ tʰ tʰ tʰ 17 323 "sharp" tʰ tʰ tʰ tʰ tʰ tʰ tʰ Ø 17 66 "there" tʰ Ø tʰ tʰ tʰ Ø tʰ tʰ 17 547 "firewood" tʰ tʰ tʰ tʰ tʰ tʰ tʰ tʰ 17 74 "thick" Ø tʰ tʰ tʰ tʰ Ø tʰ Ø 41 / 46

Slide 94

Slide 94 text

A Method for Correspondence Pattern Recognition Examples Examples: Burmish Graph with Clique Cover (with N. Hill) Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 41 659 "goat" Ø Ø Ø Ø tʃʰ tsʰ sʰ Ø 41 672 "armpit" Ø Ø tʃʰ tʃʰ tʃʰ Ø Ø Ø 41 433 "rice" tsʰ tʃʰ tʃʰ tʃʰ tʃʰ Ø sʰ tsʰ Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 74 55 "ten" Ø Ø tʰ tsʰ Ø Ø Ø tsʰ 42 53 "ten" tɕʰ tsʰ Ø Ø Ø tsʰ sʰ Ø 42 421 "salt" tɕʰ tsʰ tʰ tsʰ tsʰ tsʰ sʰ cʰ 42 129 "twenty" Ø tsʰ tʰ tsʰ tsʰ tsʰ sʰ Ø 83 287 "hair" Ø tsʰ tsʰ tsʰ tsʰ tsʰ sʰ Ø Clique Cogn. Concept Achang Atsi Bola Lashi Maru Old B. Rang. Xiand. 17 639 "above" Ø tʰ tʰ Ø tʰ tʰ tʰ Ø 17 472 "sing" Ø tʰ tʰ Ø tʰ Ø Ø Ø 17 61 "that" tʰ Ø Ø tʰ tʰ tʰ tʰ tʰ 17 323 "sharp" tʰ tʰ tʰ tʰ tʰ tʰ tʰ Ø 17 66 "there" tʰ Ø tʰ tʰ tʰ Ø tʰ tʰ 17 547 "firewood" tʰ tʰ tʰ tʰ tʰ tʰ tʰ tʰ 17 74 "thick" Ø tʰ tʰ tʰ tʰ Ø tʰ Ø tʃʰ tʃʰ tʃʰ tsʰ tsʰ tsʰ tsʰ tsʰ tʰ tʰ tʰ tʰ tʰ tʰ tʰ 41 / 46

Slide 95

Slide 95 text

A Method for Correspondence Pattern Recognition Examples Examples: EDICTOR and Software Demo slide has been intentionally left blank 42 / 46

Slide 96

Slide 96 text

Outlook 43 / 46

Slide 97

Slide 97 text

A Method for Correspondence Pattern Recognition Examples Outlook 44 / 46

Slide 98

Slide 98 text

A Method for Correspondence Pattern Recognition Examples Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner 44 / 46

Slide 99

Slide 99 text

A Method for Correspondence Pattern Recognition Examples Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner in contrast to many approaches proposed so far, it does not require family trees in any form, networks are just enough, but the patterns inferred can be used to study tree-like aspects of evolution (Chacon and List 2015), 44 / 46

Slide 100

Slide 100 text

A Method for Correspondence Pattern Recognition Examples Outlook the proposed inference of correspondence patterns is a first attempt to account for systemic aspects of sound change in a rigorous manner in contrast to many approaches proposed so far, it does not require family trees in any form, networks are just enough, but the patterns inferred can be used to study tree-like aspects of evolution (Chacon and List 2015), the algorithm needs to be further tested we need a deeper discussion in the field about the importance of correspondence patterns for linguistic reconstruction 44 / 46

Slide 101

Slide 101 text

A Method for Correspondence Pattern Recognition Examples Acknowledgements Nathan W. Hill (essential discussions on the implications of the procedure and further applications, intensive manual inspection of the output of the method) Taraka Rama (testing the method for alignment-based phylogenetic tree reconstruction, comments on draft and code) Eric Bapteste, Philippe Lopez, and their Team AIRE (providing initial inspiration and follow-up discussions on the approach, thanks to a similar approach applied in biology) 45 / 46

Slide 102

Slide 102 text

Danke fürs Zuhören! 46 / 46