Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatic Identification of Historically Relate...

Automatic Identification of Historically Related Words

Talk held at the workshop "Strings and Structures -- Codes of Sense and Function", 20th-21st May, University of Cologne, Cologne.

Johann-Mattis List

May 20, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Automatic Identification of Historically Related Words Johann-Mattis List DFG research

    fellow Centre des recherches linguistiques sur l’Asie Orientale Team Adaptation, Integration, Reticulation, Evolution EHESS and UPMC, Paris 2015/05/20 1 / 30
  2. Lexical Change Dimensions Dimensions of Lexical Change 'soh₂-wl̩- sh₂uˈen- SUN

    Indo-European soːwel- sunːoː- SUN Germanic zɔnə SUN German suːl SUN Swedish 3 / 30
  3. Lexical Change Dimensions Dimensions of Lexical Change 'soh₂-wl̩- sh₂uˈen- SUN

    Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN Romance zɔnə SUN German suːl SUN Swedish 3 / 30
  4. Lexical Change Dimensions Dimensions of Lexical Change 'soh₂-wl̩- sh₂uˈen- SUN

    Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance zɔnə SUN German suːl SUN Swedish 3 / 30
  5. Lexical Change Dimensions Dimensions of Lexical Change 'soh₂-wl̩- sh₂uˈen- SUN

    Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish 3 / 30
  6. Lexical Change Dimensions Dimensions of Lexical Change 'soh₂-wl◌̩ - sh₂uˈen-

    SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish SEM ANTIC SHIFT M O RPH O LO G ICAL CH AN G E M O R PH O LO G ICA L CH A N G E MORPHOLOGICAL CHANGE MORPHOLOGICAL CHANGE 3 / 30
  7. Lexical Change Dimensions Dimensions of Lexical Change SEMANTIC CHANGE MORPHOLOGICAL

    CHANGE S T R A T IC C H A N G E Gévaudan (2007) 4 / 30
  8. Lexical Change Relations Relations between Historically Related Words English 'TOOTH'

    tooth Germanic 'TOOTH' *tanθ- German 'TOOTH' Zahn Direct Cognate Relation (Orthology) 5 / 30
  9. Lexical Change Relations Relations between Historically Related Words English 'BIRTH'

    birth Germanic 'BIRTH' *ga-burdi- German 'BIRTH' Geburt Indirect Cognate Relation (Paralogy) 5 / 30
  10. Lexical Change Relations Relations between Historically Related Words Germanic English

    'SILLY' silly Germanic 'HAPPY' *sæli- German 'BLESSED' selig Indirect Cognate Relation (Paralogy) 5 / 30
  11. Lexical Change Relations Relations between Historically Related Words Kopf 'HAPPY'

    *sæli- 'BLESSED' selig Germanic 'SHORT' *skurt Indo-Europ. 'CUT OFF' *(s)ker- Latin 'MUTILATED' curtus German 'SHORT' kurz English 'SHORT' short Indirect Etymological Relation (Xenology) 5 / 30
  12. Lexical Change Relations Relations between Historically Related Words Relations in

    Biology Proposed Terminology for Linguistics direct cognate relation homology orthology etymological relation cognate relation indirect cognate relation paralogy xenology indirect etymological relation 5 / 30
  13. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa 6 / 30
  14. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa l > j 6 / 30
  15. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa Meaning Latin Italian ‘TONGUE’ liŋgua liŋgwa ‘MOON’ lu:na luna ‘SLOW’ lentus lento l > j 6 / 30
  16. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa Meaning Latin Italian ‘TONGUE’ liŋgua liŋgwa ‘MOON’ lu:na luna ‘SLOW’ lentus lento l > j l > l 6 / 30
  17. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa Meaning Latin Italian ‘TONGUE’ liŋgua liŋgwa ‘MOON’ lu:na luna ‘SLOW’ lentus lento l > j l > l l > j / p _ 6 / 30
  18. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa Meaning Latin Italian ‘TONGUE’ liŋgua liŋgwa ‘MOON’ lu:na luna ‘SLOW’ lentus lento l > j l > l l > j / p _ Not sounds change, phonemes (alphabets!) change (Bloomfield 1933)! 6 / 30
  19. Lexical Change Sound Change Sound Change Meaning Latin Italian ‘FEATHER’

    pluːma pjuma ‘FLAT’ plaːnus pjano ‘SQUARE’ plateːa pjaʦːa Meaning Latin Italian ‘TONGUE’ liŋgua liŋgwa ‘MOON’ lu:na luna ‘SLOW’ lentus lento l > j l > l l > j / p _ Not sounds change, phonemes (alphabets!) change (Bloomfield 1933)! Sound change depends on the context in which the sounds occur! 6 / 30
  20. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 7 / 30
  21. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 7 / 30
  22. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 7 / 30
  23. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 7 / 30
  24. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x ? n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 7 / 30
  25. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 7 / 30
  26. Lexical Change Sound Change Sound Change Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x n n 2 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German Dorn d ɔɐ n English thorn θ ɔː n German dumm d ʊ m English dumb d ʌ m 7 / 30
  27. Lexical Change Sound Change Sound Change To identify cognate words,

    one needs a context- dependent mapping between two (or more) phoneme systems (alphabets)! Technically, one needs to infer both the scoring function and the optimal alignment between multiple words at the same time! 7 / 30
  28. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Mode Alignment

    global T H E # C A T - F I S H # H U N T S T H E # C A T # F I S H - E - - - S semiglobal T H E # C A T - F I S H - - - H U N T S T H E # C A T # F I S H E S # - - - - - local T H E # C A T - F I S H HUNTS T H E # C A T # F I S H ES diagonal T H E # C A T - F I S H - # H U N T S T H E # C A T # F I S H E - - - - - S secondary T H E # C A T F I S H # H U N T - S T H E # C A T - - - - # F I S H E S 10 / 30
  29. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Mode Alignment

    global T H E # C A T - F I S H # H U N T S T H E # C A T # F I S H - E - - - S semiglobal T H E # C A T - F I S H - - - H U N T S T H E # C A T # F I S H E S # - - - - - local T H E # C A T - F I S H HUNTS T H E # C A T # F I S H ES diagonal T H E # C A T - F I S H - # H U N T S T H E # C A T # F I S H E - - - - - S secondary T H E # C A T F I S H # H U N T - S T H E # C A T - - - - # F I S H E S 10 / 30
  30. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Mode Alignment

    global T H E # C A T - F I S H # H U N T S T H E # C A T # F I S H - E - - - S semiglobal T H E # C A T - F I S H - - - H U N T S T H E # C A T # F I S H E S # - - - - - local T H E # C A T - F I S H HUNTS T H E # C A T # F I S H ES diagonal T H E # C A T - F I S H - # H U N T S T H E # C A T # F I S H E - - - - - S secondary T H E # C A T F I S H # H U N T - S T H E # C A T - - - - # F I S H E S 10 / 30
  31. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Mode Alignment

    global T H E # C A T - F I S H # H U N T S T H E # C A T # F I S H - E - - - S semiglobal T H E # C A T - F I S H - - - H U N T S T H E # C A T # F I S H E S # - - - - - local T H E # C A T - F I S H HUNTS T H E # C A T # F I S H ES diagonal T H E # C A T - F I S H - # H U N T S T H E # C A T # F I S H E - - - - - S secondary T H E # C A T F I S H # H U N T - S T H E # C A T - - - - # F I S H E S 10 / 30
  32. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Primary Alignment

    Haikou z i - t - ³ Beijing ʐ ʅ ⁵¹ tʰ ou ¹ Secondary Alignment Haikou z i t ³ - - - Beijing ʐ ʅ - ⁵¹ tʰ ou ¹ 10 / 30
  33. Sequence Comparison Alignment Analyses Alignment Analyses: Alignment Modes Mode Alignment

    global T H E # C A T - F I S H # H U N T S T H E # C A T # F I S H - E - - - S semiglobal T H E # C A T - F I S H - - - H U N T S T H E # C A T # F I S H E S # - - - - - local T H E # C A T - F I S H HUNTS T H E # C A T # F I S H ES diagonal T H E # C A T - F I S H - # H U N T S T H E # C A T # F I S H E - - - - - S secondary T H E # C A T F I S H # H U N T - S T H E # C A T - - - - # F I S H E S 10 / 30
  34. Sequence Comparison Multiple Alignment Analyses Multiple Alignment Analyses W O

    L D E M O R T W A L D E M A R - V O L O D Y M Y R - V - L A D I M I R - 11 / 30
  35. Sequence Comparison Multiple Alignment Analyses Multiple Alignment Analyses W O

    L - D E M O R T W A L - D E M A R - V O L O D Y M Y R - V - L A D I M I R - 11 / 30
  36. Sequence Comparison Sequences in Biology and Linguistics Sequences in Biology

    and Linguistics • universal • language-specific 12 / 30
  37. Sequence Comparison Sequences in Biology and Linguistics Sequences in Biology

    and Linguistics • universal • language-specific • limited • widely varying 12 / 30
  38. Sequence Comparison Sequences in Biology and Linguistics Sequences in Biology

    and Linguistics • universal • language-specific • limited • widely varying • constant • mutable 12 / 30
  39. Sequence Modeling in Historical Linguistics Paradigmatic Aspects Paradigmatic Aspects Sound

    Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35). 14 / 30
  40. Sequence Modeling in Historical Linguistics Paradigmatic Aspects Paradigmatic Aspects Sound

    Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30
  41. Sequence Modeling in Historical Linguistics Paradigmatic Aspects Paradigmatic Aspects Sound

    Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30
  42. Sequence Modeling in Historical Linguistics Paradigmatic Aspects Paradigmatic Aspects Sound

    Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 14 / 30
  43. Sequence Modeling in Historical Linguistics Paradigmatic Aspects Paradigmatic Aspects Sound

    Classes Sounds which frequently occur in correspondence relation in genetically related languages can be clustered into classes (types), assuming that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1966]: 35). K T P S 1 14 / 30
  44. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. 15 / 30
  45. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. j a b ə l k a 15 / 30
  46. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. sonority increases j a b ə l k a 15 / 30
  47. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. j a b ə l k a ↑ ↑ ↓ ↑ ↑ ascending maximum ↓ descending 15 / 30
  48. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. j a b ə l k a ↑ ↑ ↓ ↑ o strong weak 15 / 30
  49. Sequence Modeling in Historical Linguistics Syntagmatic Aspects Syntagmatic Aspects Prosodic

    Strings Sound change occurs more frequently in prosodically weak positions of sound sequences (Geisler 1992). Based on the sonority profile of a sound sequence, we can distinguish different positions inside a string with respect to their prosodic context. Prosodic context can be modeled as prosodic string in which contexts are encoded by using specific symbols. j a b ə l k a # v C v c C > 15 / 30
  50. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation Multitiered Sequence

    Representation External Representation IPA j a b ə l k a Internal Representation Dolgopolsky Sound Classes J V P V L K V SCA Sound-Classes J A P E L K A ASJP Sound-Classes y a b I l k a Prosodic String # V C V c C > Trigrams #,j,a j,a,b a,b,ə b,ə,l ə,l,k l,k,a k,a,$ Sound-Class Trigrams #,j,V J,a,P V,b,V P,ə,L V,l,K L,k,V K,a,$ Onset-Vowel-Offset C,j V,a C,b v,ə c,l C,k >,a Sonority Profile 6 7 1 7 5 1 7 Prosodic String # v C v c C > Relative Gap-Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7 ... ... ... ... ... ... ... ... 16 / 30
  51. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation Multitiered Sequence

    Representation Cognate List Alignment Correspondence List German Zunge ʦ ʊ ŋ ə GER ENG Frequ. ʦ t 2 x s t 2 x h h 1 x f f 1 x n - 1 x … … … English tongue t ʌ ŋ - German Zahn ʦ aː n - English tooth t ʊː - θ German heiß h ai s English hot h ɔ t German Fuß f u ː s English foot f ʊ t 17 / 30
  52. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation Multitiered Sequence

    Representation Cognate List Alignment Correspondence List German Zunge ʦ ʊ ŋ ə GER ENG Frequ. ʦ t 2 x s t 2 x h h 1 x f f 1 x n - 1 x … … … English tongue t ʌ ŋ - German Zahn ʦ aː n - English tooth t ʊː - θ German heiß h ai s English hot h ɔ t German Fuß f u ː s English foot f ʊ t 17 / 30
  53. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation Multitiered Sequence

    Representation Cognate List Alignment Correspondence List German Zunge C U N E GER ENG Frequ. C/# T/# 2 x S/$ T/$ 2 x H/$ H/# 1 x B/$ B/# 1 x N/c - 1 x … … … English tongue T A N - German Zahn C A N - English tooth T U - T German heiß H A S English hot H O T German Fuß B U S English foot B U T 17 / 30
  54. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation Multitiered Sequence

    Representation Multitiered sequence representations (sound classes, prosodic strings, etc.) are of great use in automatic sequence comparison, since they guarantee comparability of otherwise incomparable alphabets, and allow to model phonetic contexts in a simple, universal, and objective way. 17 / 30
  55. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment Sound-Class-Based Phonetic Alignment (SCA, List 2012ac & 2014) IPA as input format pairwise and multiple alignment global, local, semi-global, diagonal, and secondary alignment modes three different sound-class models (Dolgopolsky, SCA, ASJP) empirically and theoretically inferred scoring functions for the sound-class alphabets secondary alignment for the alignment of data containing word or morpheme boundaries (see List 2012c & 2014 for specifics) multitiered sequence representation (prosodic strings) procedure for the detection of swaps (metathesis) in multiple alignments (List 2012a) 19 / 30
  56. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  57. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  58. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... DISTANCE CAL JAPLKU 0.00 0.14 0.34 0.12 20 / 30
  59. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  60. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  61. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment CONVERSION j japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU J A P - L K U J A P E L K A 20 / 30
  62. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  63. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  64. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU yes no 20 / 30
  65. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  66. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment _ INPUT SEQUEN- CES _ jabl̩ko jabəlka jabləkə japkɔ stage 1 SOUND-CLASS CONVERSION jabl̩ko → JAPLKU jabəlka → JAPELKA jabləkə → JAPLEKE japkɔ → JAPKU stage 2 LIBRARY CREATI- ON JAP-LKU JAPELKA JAPL-KU JAPLEKE JAPLKU JAP-KU JAPEL-KA JAP-LEKE ... ... stage 3 DISTANCE CAL- CULATION JAPLKU 0.00 0.14 0.34 0.12 JAPELKA 0.14 0.00 0.46 0.28 JAPLEKE 0.34 0.46 0.00 0.44 JAPKO 0.12 0.28 0.44 0.00 stage 4 CLUSTER ANALY- SIS . . . JAPLKU JAPELKA . JAPLEKE . . JAPKU stage 5 PROGRESSIVE ALIGNMENT J A P - L K U J A P E L K A JAPLEKE JAPKU MORE SEQUENCES? stage 6 ITERATIVE REFI- NEMENT J A P - L - K U J A P E L - K A J A P - L E K E JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ yes no 20 / 30
  67. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment Automatic

    Phonetic Alignment JAPKU stage 7 SWAP CHECK J A P - L - K U J A P E L - K A J A P - L E K E J A P - - - K U stage 8 IPA CONVERSION J A P … → j a b … J A P … → j a b … J A P … → j a b … J A P … → j a p … OUTPUT MSA j a b - l̩ - k o j a b ə l - k a j a b - l ə k ə j a p - - - k ɔ 20 / 30
  68. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection INPUT: Multilingual wordlist → semantically tagged → phonetically transcribed → tokenized into phonemes OUTPUT: Multilingual wordlist → identified cognate entries assigned to clusters → identified cognate entries multiply aligned 21 / 30
  69. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Basic Procedure for Multilingual Cognate Detection WORDLIST DATA 22 / 30
  70. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Basic Procedure for Multilingual Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS PAIRWISE COMPARISON 22 / 30
  71. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Basic Procedure for Multilingual Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS COGNATE SETS COGNATE CLUSTERING PAIRWISE COMPARISON 22 / 30
  72. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Cognate Clustering Analysis ID Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 22 / 30
  73. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Cognate Clustering Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 Analysis ID Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 22 / 30
  74. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Cognate Clustering Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini 22 / 30
  75. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Cognate Clustering Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini 22 / 30
  76. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection Cognate Clustering German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini Analysis ID Taxa Word Gloss GlossID IPA CogID ... ... ... ... ... ... ... 21 German Frau woman 20 frau 1 22 Dutch vrouw woman 20 vrɑu 1 23 English woman woman 20 wʊmən 2 24 Danish kvinde woman 20 kvenə 3 25 Swedish kvinna woman 20 kviːna 3 26 Norwegian kvine woman 20 kʋinə 3 ... ... ... ... ... ... ... 22 / 30
  77. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection Automatic

    Cognate Detection INPUT TOKENIZATION OUTPUT LexStat Algorithm (List 2012b & 2014) PREPROCESSING LOG-ODDS CORRESPONDENCE DETECTION USING PHONETIC ALIGNMENT LOOP DISTRIBUTION EXPECTED ATTESTED DISTRIBUTION D ISTANCE CALCULATION COGNATE CLUSTERING 22 / 30
  78. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: SCA (List

    2012a, c & 2014) Gold Standard for Multiple Alignment Analyses 750 multiple alignments (manually edited) 50 089 Words 528 different languages and dialects 8 language families encoded in IPA online at http://sequencecomparison.github.io 24 / 30
  79. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: SCA (List

    2012a, c & 2014) Basic Library Iterate Lib-Iter Column score 84 85 86 87 88 89 90 91 Basic Library Iterate Lib-Iter Pair score 97 98 99 DOLGO ASJP SCA Performance of the Sound-Class Based Phonetic Alignment Algorithm (Multiple Alignments) 24 / 30
  80. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: SCA (List

    2012a, c & 2014) Basic Library Iterate Lib-Iter Column score 84 85 86 87 88 89 90 91 Basic Library Iterate Lib-Iter Pair score 97 98 99 DOLGO ASJP SCA 92% 99% Performance of the Sound-Class Based Phonetic Alignment Algorithm (Multiple Alignments) 24 / 30
  81. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: SCA (List

    2012a, c & 2014) Taxon Alignment Dashi t͡ʂ - ɯ - ²¹ p - e ²¹ - - - Eryuan - - - - - p - i ³¹ ʂ e ⁴² Gongxing d͡ʐ - i - ¹² b - i ²¹ - - - Heqing - - - - - p - i ³¹ sʰ e ⁴⁴ Jianchuan - - - - - p - i ³¹ - - - Jianxing ʦ - ɯ - ³¹ p - e ²¹ - - - Lanping - - - - - p - ĩ ⁴² s e ⁴⁴ Luobenzhuo ʥ - ỹ - ⁴² - - - - - - - Mazhelong ɕ - e n ⁵⁵ p - e ²¹ - - - Qiliqiao - - - - - p - i ³¹ s e ⁴⁴ Tuoluo d j ɯ - ²¹ b - i ³⁵ - - - Yunlong - - - - - b j ɯ ²¹ s ɛ ⁵⁵ Zhoucheng ʦ - ɯ - ⁰ p - e ²¹ - - - XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX 25 / 30
  82. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Gold Standard for Automatic Cognate Detection 6 lexicostatistical datasets 10 243 cognate sets 95 different languages and dialects 8 language families incoded in IPA online at http://sequencecomparison.github.io 26 / 30
  83. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat Performance of Different Cognate Detection Algorithms 26 / 30
  84. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% 92% 81% 89% 81% Performance of Different Cognate Detection Algorithms 26 / 30
  85. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% Performance of Different Cognate Detection Algorithms 26 / 30
  86. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Dataset by Kessler (2001) “graben” (30) Turchin Levensht. LexStat. Albanisch gërmon gərmo 1 1 1 Englisch digs dɪg 2 2 2 Französisch creuse krøze 1 3 3 Deutsch gräbt graːb 1 1 4 Hawaii ‘eli ʔeli 5 5 5 Navajo hahashgééd hahageːd 6 6 6 Türkisch kazıyor kaz 7 3 7 27 / 30
  87. Automatic Sequence Comparison in Historical Linguistics Evaluation Evaluation: LexStat (List

    2012b & 2014) Dataset by Kessler (2001) “Mund” (104) Turchin Levensth. LexStat. Albanisch gojë goj 1 1 1 Englisch mouth mauθ 2 2 2 Französisch bouche buʃ 3 3 3 Deutsch Mund mund 4 4 2 Hawaii waha waha 5 5 5 Navajo ’azéé’ zeːʔ 6 6 6 Türkisch ağız aɣz 7 7 7 27 / 30
  88. Automatic Sequence Comparison in Historical Linguistics Evaluation Concluding Remarks The

    techniques for automatic sequence comparison in historical linguistics have greatly advanced during the last decade, and they are at a stage where they can actively help linguists in studying dialectal variation or carrying out initial analyses of understudied languages. There is, however, still space for improvement. So far, we cannot properly handle the major processes of lexical change, such as semantic shift, morphological processes, or borrowing. 29 / 30