Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatic Identification of Historically Related Words

Automatic Identification of Historically Related Words

Talk held at the workshop "Strings and Structures -- Codes of Sense and Function", 20th-21st May, University of Cologne, Cologne.

Johann-Mattis List

May 20, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Automatic Identification of Historically Related Words
    Johann-Mattis List
    DFG research fellow
    Centre des recherches linguistiques sur l’Asie Orientale
    Team Adaptation, Integration, Reticulation, Evolution
    EHESS and UPMC, Paris
    2015/05/20
    1 / 30

    View Slide

  2. Lexical Change
    2 / 30

    View Slide

  3. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    3 / 30

    View Slide

  4. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    3 / 30

    View Slide

  5. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    3 / 30

    View Slide

  6. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    Romance
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    3 / 30

    View Slide

  7. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    3 / 30

    View Slide

  8. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    solej
    SUN
    French
    sol
    SUN
    Spanish
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    3 / 30

    View Slide

  9. Lexical Change Dimensions
    Dimensions of Lexical Change
    'soh₂-wl◌̩
    - sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    solej
    SUN
    French
    sol
    SUN
    Spanish
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    SEM
    ANTIC
    SHIFT
    M
    O
    RPH
    O
    LO
    G
    ICAL
    CH
    AN
    G
    E
    M
    O
    R
    PH
    O
    LO
    G
    ICA
    L
    CH
    A
    N
    G
    E
    MORPHOLOGICAL
    CHANGE
    MORPHOLOGICAL
    CHANGE
    3 / 30

    View Slide

  10. Lexical Change Dimensions
    Dimensions of Lexical Change
    arbre
    4 / 30

    View Slide

  11. Lexical Change Dimensions
    Dimensions of Lexical Change
    form
    "meaning"
    4 / 30

    View Slide

  12. Lexical Change Dimensions
    Dimensions of Lexical Change
    arbre
    4 / 30

    View Slide

  13. Lexical Change Dimensions
    Dimensions of Lexical Change
    4 / 30

    View Slide

  14. Lexical Change Dimensions
    Dimensions of Lexical Change
    arbre
    MEANING
    FORM
    LANGUAGE
    4 / 30

    View Slide

  15. Lexical Change Dimensions
    Dimensions of Lexical Change
    FORM
    LANGUAGE
    MEANING
    arbre
    4 / 30

    View Slide

  16. Lexical Change Dimensions
    Dimensions of Lexical Change
    arbre
    MEANING
    FORM
    LANGUAGE
    MEANING
    FORM
    LANGUAGE
    4 / 30

    View Slide

  17. Lexical Change Dimensions
    Dimensions of Lexical Change
    SEMANTIC CHANGE
    MORPHOLOGICAL CHANGE
    S
    T
    R
    A
    T
    IC
    C
    H
    A
    N
    G
    E
    Gévaudan (2007)
    4 / 30

    View Slide

  18. Lexical Change Relations
    Relations between Historically Related Words
    English 'TOOTH' tooth
    Germanic 'TOOTH' *tanθ-
    German 'TOOTH' Zahn
    Direct
    Cognate Relation
    (Orthology)
    5 / 30

    View Slide

  19. Lexical Change Relations
    Relations between Historically Related Words
    English 'BIRTH' birth
    Germanic 'BIRTH' *ga-burdi-
    German 'BIRTH' Geburt
    Indirect
    Cognate Relation
    (Paralogy)
    5 / 30

    View Slide

  20. Lexical Change Relations
    Relations between Historically Related Words
    Germanic
    English 'SILLY' silly
    Germanic 'HAPPY' *sæli-
    German 'BLESSED' selig
    Indirect
    Cognate Relation
    (Paralogy)
    5 / 30

    View Slide

  21. Lexical Change Relations
    Relations between Historically Related Words
    Kopf
    'HAPPY' *sæli-
    'BLESSED' selig
    Germanic 'SHORT' *skurt
    Indo-Europ. 'CUT OFF' *(s)ker-
    Latin 'MUTILATED' curtus
    German 'SHORT' kurz
    English 'SHORT' short
    Indirect
    Etymological
    Relation (Xenology)
    5 / 30

    View Slide

  22. Lexical Change Relations
    Relations between Historically Related Words
    Relations in Biology Proposed Terminology for Linguistics
    direct cognate relation
    homology
    orthology
    etymological relation
    cognate relation
    indirect cognate
    relation
    paralogy
    xenology
    indirect etymological
    relation
    5 / 30

    View Slide

  23. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    6 / 30

    View Slide

  24. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    l > j
    6 / 30

    View Slide

  25. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    Meaning Latin Italian
    ‘TONGUE’ liŋgua liŋgwa
    ‘MOON’ lu:na luna
    ‘SLOW’ lentus lento
    l > j
    6 / 30

    View Slide

  26. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    Meaning Latin Italian
    ‘TONGUE’ liŋgua liŋgwa
    ‘MOON’ lu:na luna
    ‘SLOW’ lentus lento
    l > j l > l
    6 / 30

    View Slide

  27. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    Meaning Latin Italian
    ‘TONGUE’ liŋgua liŋgwa
    ‘MOON’ lu:na luna
    ‘SLOW’ lentus lento
    l > j l > l
    l > j / p _
    6 / 30

    View Slide

  28. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    Meaning Latin Italian
    ‘TONGUE’ liŋgua liŋgwa
    ‘MOON’ lu:na luna
    ‘SLOW’ lentus lento
    l > j l > l
    l > j / p _
    Not sounds change, phonemes (alphabets!) change (Bloomfield 1933)!
    6 / 30

    View Slide

  29. Lexical Change Sound Change
    Sound Change
    Meaning Latin Italian
    ‘FEATHER’ pluːma pjuma
    ‘FLAT’ plaːnus pjano
    ‘SQUARE’ plateːa pjaʦːa
    Meaning Latin Italian
    ‘TONGUE’ liŋgua liŋgwa
    ‘MOON’ lu:na luna
    ‘SLOW’ lentus lento
    l > j l > l
    l > j / p _
    Not sounds change, phonemes (alphabets!) change (Bloomfield 1933)!
    Sound change depends on the context in which the sounds occur!
    6 / 30

    View Slide

  30. Lexical Change Sound Change
    Sound Change
    7 / 30

    View Slide

  31. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    7 / 30

    View Slide

  32. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    7 / 30

    View Slide

  33. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    7 / 30

    View Slide

  34. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    7 / 30

    View Slide

  35. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x ?
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    7 / 30

    View Slide

  36. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    7 / 30

    View Slide

  37. Lexical Change Sound Change
    Sound Change
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    n n 2 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    German dumm d ʊ m
    English dumb d ʌ m
    7 / 30

    View Slide

  38. Lexical Change Sound Change
    Sound Change
    To identify cognate words, one needs a context-
    dependent mapping between two (or more)
    phoneme systems (alphabets)!
    Technically, one needs to infer both the scoring
    function and the optimal alignment between
    multiple words at the same time!
    7 / 30

    View Slide

  39. Sequence Comparison
    8 / 30

    View Slide

  40. Sequence Comparison Alignment Analyses
    Alignment Analyses
    9 / 30

    View Slide

  41. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H 0
    9 / 30

    View Slide

  42. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H 0
    9 / 30

    View Slide

  43. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H H 0
    9 / 30

    View Slide

  44. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Mode Alignment
    global
    T H E # C A T - F I S H # H U N T S
    T H E # C A T # F I S H - E - - - S
    semiglobal
    T H E # C A T - F I S H - - - H U N T S
    T H E # C A T # F I S H E S # - - - - -
    local
    T H E # C A T - F I S H HUNTS
    T H E # C A T # F I S H ES
    diagonal
    T H E # C A T - F I S H - # H U N T S
    T H E # C A T # F I S H E - - - - - S
    secondary
    T H E # C A T F I S H # H U N T - S
    T H E # C A T - - - - # F I S H E S
    10 / 30

    View Slide

  45. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Mode Alignment
    global
    T H E # C A T - F I S H # H U N T S
    T H E # C A T # F I S H - E - - - S
    semiglobal
    T H E # C A T - F I S H - - - H U N T S
    T H E # C A T # F I S H E S # - - - - -
    local
    T H E # C A T - F I S H HUNTS
    T H E # C A T # F I S H ES
    diagonal
    T H E # C A T - F I S H - # H U N T S
    T H E # C A T # F I S H E - - - - - S
    secondary
    T H E # C A T F I S H # H U N T - S
    T H E # C A T - - - - # F I S H E S
    10 / 30

    View Slide

  46. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Mode Alignment
    global
    T H E # C A T - F I S H # H U N T S
    T H E # C A T # F I S H - E - - - S
    semiglobal
    T H E # C A T - F I S H - - - H U N T S
    T H E # C A T # F I S H E S # - - - - -
    local
    T H E # C A T - F I S H HUNTS
    T H E # C A T # F I S H ES
    diagonal
    T H E # C A T - F I S H - # H U N T S
    T H E # C A T # F I S H E - - - - - S
    secondary
    T H E # C A T F I S H # H U N T - S
    T H E # C A T - - - - # F I S H E S
    10 / 30

    View Slide

  47. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Mode Alignment
    global
    T H E # C A T - F I S H # H U N T S
    T H E # C A T # F I S H - E - - - S
    semiglobal
    T H E # C A T - F I S H - - - H U N T S
    T H E # C A T # F I S H E S # - - - - -
    local
    T H E # C A T - F I S H HUNTS
    T H E # C A T # F I S H ES
    diagonal
    T H E # C A T - F I S H - # H U N T S
    T H E # C A T # F I S H E - - - - - S
    secondary
    T H E # C A T F I S H # H U N T - S
    T H E # C A T - - - - # F I S H E S
    10 / 30

    View Slide

  48. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Primary Alignment
    Haikou z i - t - ³
    Beijing ʐ ʅ ⁵¹ tʰ ou ¹
    Secondary Alignment
    Haikou z i t ³ - - -
    Beijing ʐ ʅ - ⁵¹ tʰ ou ¹
    10 / 30

    View Slide

  49. Sequence Comparison Alignment Analyses
    Alignment Analyses: Alignment Modes
    Mode Alignment
    global
    T H E # C A T - F I S H # H U N T S
    T H E # C A T # F I S H - E - - - S
    semiglobal
    T H E # C A T - F I S H - - - H U N T S
    T H E # C A T # F I S H E S # - - - - -
    local
    T H E # C A T - F I S H HUNTS
    T H E # C A T # F I S H ES
    diagonal
    T H E # C A T - F I S H - # H U N T S
    T H E # C A T # F I S H E - - - - - S
    secondary
    T H E # C A T F I S H # H U N T - S
    T H E # C A T - - - - # F I S H E S
    10 / 30

    View Slide

  50. Sequence Comparison Multiple Alignment Analyses
    Multiple Alignment Analyses
    W O L D E M O R T
    W A L D E M A R -
    V O L O D Y M Y R -
    V - L A D I M I R -
    11 / 30

    View Slide

  51. Sequence Comparison Multiple Alignment Analyses
    Multiple Alignment Analyses
    W O L - D E M O R T
    W A L - D E M A R -
    V O L O D Y M Y R -
    V - L A D I M I R -
    11 / 30

    View Slide

  52. Sequence Comparison Sequences in Biology and Linguistics
    Sequences in Biology and Linguistics
    12 / 30

    View Slide

  53. Sequence Comparison Sequences in Biology and Linguistics
    Sequences in Biology and Linguistics
    • universal • language-specific
    12 / 30

    View Slide

  54. Sequence Comparison Sequences in Biology and Linguistics
    Sequences in Biology and Linguistics
    • universal • language-specific
    • limited • widely varying
    12 / 30

    View Slide

  55. Sequence Comparison Sequences in Biology and Linguistics
    Sequences in Biology and Linguistics
    • universal • language-specific
    • limited • widely varying
    • constant • mutable
    12 / 30

    View Slide

  56. Sequence Modeling
    in Historical Linguistics
    13 / 30

    View Slide

  57. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    14 / 30

    View Slide

  58. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    Sound Classes
    Sounds which frequently occur
    in correspondence relation in
    genetically related languages
    can be clustered into classes
    (types), assuming that “phonetic
    correspondences inside a ‘type’
    are more regular than those
    between different ‘types’”
    (Dolgopolsky 1986[1966]: 35).
    14 / 30

    View Slide

  59. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    Sound Classes
    Sounds which frequently occur
    in correspondence relation in
    genetically related languages
    can be clustered into classes
    (types), assuming that “phonetic
    correspondences inside a ‘type’
    are more regular than those
    between different ‘types’”
    (Dolgopolsky 1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  60. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    Sound Classes
    Sounds which frequently occur
    in correspondence relation in
    genetically related languages
    can be clustered into classes
    (types), assuming that “phonetic
    correspondences inside a ‘type’
    are more regular than those
    between different ‘types’”
    (Dolgopolsky 1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  61. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    Sound Classes
    Sounds which frequently occur
    in correspondence relation in
    genetically related languages
    can be clustered into classes
    (types), assuming that “phonetic
    correspondences inside a ‘type’
    are more regular than those
    between different ‘types’”
    (Dolgopolsky 1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  62. Sequence Modeling in Historical Linguistics Paradigmatic Aspects
    Paradigmatic Aspects
    Sound Classes
    Sounds which frequently occur
    in correspondence relation in
    genetically related languages
    can be clustered into classes
    (types), assuming that “phonetic
    correspondences inside a ‘type’
    are more regular than those
    between different ‘types’”
    (Dolgopolsky 1986[1966]: 35).
    K
    T
    P
    S
    1
    14 / 30

    View Slide

  63. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    15 / 30

    View Slide

  64. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols.
    15 / 30

    View Slide

  65. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols. j a b ə l k a
    15 / 30

    View Slide

  66. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols.
    sonority
    increases
    j a b ə l k a
    15 / 30

    View Slide

  67. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols. j a b ə l k a
    ↑ ↑ ↓ ↑
    ↑ ascending
    maximum
    ↓ descending
    15 / 30

    View Slide

  68. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols. j a b ə l k a
    ↑ ↑ ↓ ↑
    o strong
    weak
    15 / 30

    View Slide

  69. Sequence Modeling in Historical Linguistics Syntagmatic Aspects
    Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more
    frequently in prosodically weak
    positions of sound sequences
    (Geisler 1992). Based on the
    sonority profile of a sound
    sequence, we can distinguish
    different positions inside a string
    with respect to their prosodic
    context. Prosodic context can be
    modeled as prosodic string in
    which contexts are encoded by
    using specific symbols. j a b ə l k a
    # v C v c C >
    15 / 30

    View Slide

  70. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    16 / 30

    View Slide

  71. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    External Representation
    IPA j a b ə l k a
    Internal Representation
    Dolgopolsky Sound Classes J V P V L K V
    SCA Sound-Classes J A P E L K A
    ASJP Sound-Classes y a b I l k a
    Prosodic String # V C V c C >
    Trigrams #,j,a j,a,b a,b,ə b,ə,l ə,l,k l,k,a k,a,$
    Sound-Class Trigrams #,j,V J,a,P V,b,V P,ə,L V,l,K L,k,V K,a,$
    Onset-Vowel-Offset C,j V,a C,b v,ə c,l C,k >,a
    Sonority Profile 6 7 1 7 5 1 7
    Prosodic String # v C v c C >
    Relative Gap-Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7
    ... ... ... ... ... ... ... ...
    16 / 30

    View Slide

  72. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    17 / 30

    View Slide

  73. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    Cognate List Alignment Correspondence List
    German Zunge ʦ ʊ ŋ ə GER ENG Frequ.
    ʦ t 2 x
    s t 2 x
    h h 1 x
    f f 1 x
    n - 1 x
    … … …
    English tongue t ʌ ŋ -
    German Zahn ʦ aː n -
    English tooth t ʊː - θ
    German heiß h ai s
    English hot h ɔ t
    German Fuß f u ː s
    English foot f ʊ t
    17 / 30

    View Slide

  74. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    Cognate List Alignment Correspondence List
    German Zunge ʦ ʊ ŋ ə GER ENG Frequ.
    ʦ t 2 x
    s t 2 x
    h h 1 x
    f f 1 x
    n - 1 x
    … … …
    English tongue t ʌ ŋ -
    German Zahn ʦ aː n -
    English tooth t ʊː - θ
    German heiß h ai s
    English hot h ɔ t
    German Fuß f u ː s
    English foot f ʊ t
    17 / 30

    View Slide

  75. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    Cognate List Alignment Correspondence List
    German Zunge C U N E GER ENG Frequ.
    C/# T/# 2 x
    S/$ T/$ 2 x
    H/$ H/# 1 x
    B/$ B/# 1 x
    N/c - 1 x
    … … …
    English tongue T A N -
    German Zahn C A N -
    English tooth T U - T
    German heiß H A S
    English hot H O T
    German Fuß B U S
    English foot B U T
    17 / 30

    View Slide

  76. Sequence Modeling in Historical Linguistics Multitiered Sequence Representation
    Multitiered Sequence Representation
    Multitiered sequence representations (sound
    classes, prosodic strings, etc.) are of great use
    in automatic sequence comparison, since they
    guarantee comparability of otherwise
    incomparable alphabets, and
    allow to model phonetic contexts in a
    simple, universal, and objective way.
    17 / 30

    View Slide

  77. Automatic
    Sequence Comparison
    in Historical Linguistics
    18 / 30

    View Slide

  78. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    Sound-Class-Based Phonetic Alignment (SCA, List 2012ac & 2014)
    IPA as input format
    pairwise and multiple alignment
    global, local, semi-global, diagonal, and secondary alignment
    modes
    three different sound-class models (Dolgopolsky, SCA, ASJP)
    empirically and theoretically inferred scoring functions for the
    sound-class alphabets
    secondary alignment for the alignment of data containing word
    or morpheme boundaries (see List 2012c & 2014 for specifics)
    multitiered sequence representation (prosodic strings)
    procedure for the detection of swaps (metathesis) in multiple
    alignments (List 2012a)
    19 / 30

    View Slide

  79. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  80. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  81. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    DISTANCE CAL JAPLKU 0.00 0.14 0.34 0.12
    20 / 30

    View Slide

  82. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  83. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  84. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    CONVERSION j
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    J A P - L K U
    J A P E L K A
    20 / 30

    View Slide

  85. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  86. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  87. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    yes
    no
    20 / 30

    View Slide

  88. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  89. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    _
    INPUT SEQUEN-
    CES
    _
    jabl̩ko
    jabəlka
    jabləkə
    japkɔ
    stage 1
    SOUND-CLASS
    CONVERSION
    jabl̩ko → JAPLKU
    jabəlka → JAPELKA
    jabləkə → JAPLEKE
    japkɔ → JAPKU
    stage 2
    LIBRARY CREATI-
    ON
    JAP-LKU
    JAPELKA
    JAPL-KU
    JAPLEKE
    JAPLKU
    JAP-KU
    JAPEL-KA
    JAP-LEKE
    ... ...
    stage 3
    DISTANCE CAL-
    CULATION
    JAPLKU 0.00 0.14 0.34 0.12
    JAPELKA 0.14 0.00 0.46 0.28
    JAPLEKE 0.34 0.46 0.00 0.44
    JAPKO 0.12 0.28 0.44 0.00
    stage 4
    CLUSTER ANALY-
    SIS .
    .
    .
    JAPLKU
    JAPELKA
    .
    JAPLEKE
    .
    .
    JAPKU
    stage 5
    PROGRESSIVE
    ALIGNMENT
    J A P - L K U
    J A P E L K A
    JAPLEKE
    JAPKU
    MORE
    SEQUENCES?
    stage 6
    ITERATIVE REFI-
    NEMENT
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    yes
    no
    20 / 30

    View Slide

  90. Automatic Sequence Comparison in Historical Linguistics Automatic Phonetic Alignment
    Automatic Phonetic Alignment
    JAPKU
    stage 7 SWAP CHECK
    J A P - L - K U
    J A P E L - K A
    J A P - L E K E
    J A P - - - K U
    stage 8 IPA CONVERSION
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a b …
    J A P … → j a p …
    OUTPUT MSA
    j a b - l̩ - k o
    j a b ə l - k a
    j a b - l ə k ə
    j a p - - - k ɔ
    20 / 30

    View Slide

  91. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    INPUT:
    Multilingual wordlist
    → semantically tagged
    → phonetically transcribed
    → tokenized into phonemes
    OUTPUT:
    Multilingual wordlist
    → identified cognate entries
    assigned to clusters
    → identified cognate entries
    multiply aligned
    21 / 30

    View Slide

  92. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    22 / 30

    View Slide

  93. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    PAIRWISE
    DISTANCES
    BETWEEN
    WORDS
    PAIRWISE
    COMPARISON
    22 / 30

    View Slide

  94. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    PAIRWISE
    DISTANCES
    BETWEEN
    WORDS
    COGNATE
    SETS
    COGNATE
    CLUSTERING
    PAIRWISE
    COMPARISON
    22 / 30

    View Slide

  95. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Cognate Clustering
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    22 / 30

    View Slide

  96. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    22 / 30

    View Slide

  97. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    22 / 30

    View Slide

  98. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    22 / 30

    View Slide

  99. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    Cognate Clustering
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    Analysis
    ID Taxa Word Gloss GlossID IPA CogID
    ... ... ... ... ... ... ...
    21 German Frau woman 20 frau 1
    22 Dutch vrouw woman 20 vrɑu 1
    23 English woman woman 20 wʊmən 2
    24 Danish kvinde woman 20 kvenə 3
    25 Swedish kvinna woman 20 kviːna 3
    26 Norwegian kvine woman 20 kʋinə 3
    ... ... ... ... ... ... ...
    22 / 30

    View Slide

  100. Automatic Sequence Comparison in Historical Linguistics Automatic Cognate Detection
    Automatic Cognate Detection
    INPUT TOKENIZATION
    OUTPUT
    LexStat Algorithm (List 2012b & 2014)
    PREPROCESSING
    LOG-ODDS
    CORRESPONDENCE
    DETECTION USING
    PHONETIC
    ALIGNMENT
    LOOP
    DISTRIBUTION
    EXPECTED
    ATTESTED
    DISTRIBUTION
    D ISTANCE
    CALCULATION
    COGNATE
    CLUSTERING
    22 / 30

    View Slide

  101. Automatic Sequence Comparison in Historical Linguistics Implementation
    Implementation
    LingPy
    http://lingpy.org http://sequencecomparison.github.io
    23 / 30

    View Slide

  102. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: SCA (List 2012a, c & 2014)
    Gold Standard for Multiple Alignment Analyses
    750 multiple alignments (manually edited)
    50 089 Words
    528 different languages and dialects
    8 language families
    encoded in IPA
    online at
    http://sequencecomparison.github.io
    24 / 30

    View Slide

  103. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: SCA (List 2012a, c & 2014)
    Basic
    Library
    Iterate
    Lib-Iter
    Column score
    84
    85
    86
    87
    88
    89
    90
    91
    Basic
    Library
    Iterate
    Lib-Iter
    Pair score
    97
    98
    99
    DOLGO
    ASJP
    SCA
    Performance of the Sound-Class Based Phonetic Alignment Algorithm
    (Multiple Alignments)
    24 / 30

    View Slide

  104. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: SCA (List 2012a, c & 2014)
    Basic
    Library
    Iterate
    Lib-Iter
    Column score
    84
    85
    86
    87
    88
    89
    90
    91
    Basic
    Library
    Iterate
    Lib-Iter
    Pair score
    97
    98
    99
    DOLGO
    ASJP
    SCA
    92%
    99%
    Performance of the Sound-Class Based Phonetic Alignment Algorithm
    (Multiple Alignments)
    24 / 30

    View Slide

  105. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: SCA (List 2012a, c & 2014)
    Taxon Alignment
    Dashi t͡ʂ - ɯ - ²¹ p - e ²¹ - - -
    Eryuan - - - - - p - i ³¹ ʂ e ⁴²
    Gongxing d͡ʐ - i - ¹² b - i ²¹ - - -
    Heqing - - - - - p - i ³¹ sʰ e ⁴⁴
    Jianchuan - - - - - p - i ³¹ - - -
    Jianxing ʦ - ɯ - ³¹ p - e ²¹ - - -
    Lanping - - - - - p - ĩ ⁴² s e ⁴⁴
    Luobenzhuo ʥ - ỹ - ⁴² - - - - - - -
    Mazhelong ɕ - e n ⁵⁵ p - e ²¹ - - -
    Qiliqiao - - - - - p - i ³¹ s e ⁴⁴
    Tuoluo d j ɯ - ²¹ b - i ³⁵ - - -
    Yunlong - - - - - b j ɯ ²¹ s ɛ ⁵⁵
    Zhoucheng ʦ - ɯ - ⁰ p - e ²¹ - - -
    XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX
    25 / 30

    View Slide

  106. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Gold Standard for Automatic Cognate Detection
    6 lexicostatistical datasets
    10 243 cognate sets
    95 different languages and dialects
    8 language families
    incoded in IPA
    online at
    http://sequencecomparison.github.io
    26 / 30

    View Slide

  107. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    Performance of Different Cognate Detection Algorithms
    26 / 30

    View Slide

  108. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    92%
    81%
    89%
    81%
    Performance of Different Cognate Detection Algorithms
    26 / 30

    View Slide

  109. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    Performance of Different Cognate Detection Algorithms
    26 / 30

    View Slide

  110. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Dataset by Kessler (2001)
    “graben” (30) Turchin Levensht. LexStat.
    Albanisch gërmon gərmo 1 1 1
    Englisch digs dɪg 2 2 2
    Französisch creuse krøze 1 3 3
    Deutsch gräbt graːb 1 1 4
    Hawaii ‘eli ʔeli 5 5 5
    Navajo hahashgééd hahageːd 6 6 6
    Türkisch kazıyor kaz 7 3 7
    27 / 30

    View Slide

  111. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Evaluation: LexStat (List 2012b & 2014)
    Dataset by Kessler (2001)
    “Mund” (104) Turchin Levensth. LexStat.
    Albanisch gojë goj 1 1 1
    Englisch mouth mauθ 2 2 2
    Französisch bouche buʃ 3 3 3
    Deutsch Mund mund 4 4 2
    Hawaii waha waha 5 5 5
    Navajo ’azéé’ zeːʔ 6 6 6
    Türkisch ağız aɣz 7 7 7
    27 / 30

    View Slide

  112. Concluding Remarks
    28 / 30

    View Slide

  113. Automatic Sequence Comparison in Historical Linguistics Evaluation
    Concluding Remarks
    The techniques for automatic sequence comparison in historical
    linguistics have greatly advanced during the last decade, and they
    are at a stage where they can actively help linguists in studying
    dialectal variation or carrying out initial analyses of understudied
    languages.
    There is, however, still space for improvement. So far, we cannot
    properly handle the major processes of lexical change, such as
    semantic shift, morphological processes, or borrowing.
    29 / 30

    View Slide

  114. Thanks for Your Attention!
    30 / 30

    View Slide