$30 off During Our Annual Pro Sale. View Details »

Computer-Assisted Approaches to Linguistic Reconstruction

Computer-Assisted Approaches to Linguistic Reconstruction

Talk, held at the "Workshop on the regularity of sound change" (2017/07/20-21, Cologne, University of Cologne)

Johann-Mattis List

July 21, 2017
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Computer-Assisted Approaches to Linguistic
    Reconstruction
    A Case Study from the Burmish Languages
    Johann-Mattis List¹ and Nathan W. Hill²
    2017-07-21
    ¹ Max Planck Institute for the Science of Human History, ² SOAS, University of London

    View Slide

  2. Outline
    1. Introduction
    2. Phonological Reconstruction
    3. Computer-Assisted Phonological Reconstruction
    4. Examples
    5. Outlook
    2

    View Slide

  3. Introduction

    View Slide

  4. Etymology 2.0?
    Historical linguistics after the quantitative turn:
    4

    View Slide

  5. Etymology 2.0?
    Historical linguistics after the quantitative turn:
    • Quantitative methods in historical linguistics have received much
    attention of late,
    4

    View Slide

  6. Etymology 2.0?
    Historical linguistics after the quantitative turn:
    • Quantitative methods in historical linguistics have received much
    attention of late,
    • but only a few (if any) of the new methods have addressed
    long-standing problems of classical linguistics,
    4

    View Slide

  7. Etymology 2.0?
    Historical linguistics after the quantitative turn:
    • Quantitative methods in historical linguistics have received much
    attention of late,
    • but only a few (if any) of the new methods have addressed
    long-standing problems of classical linguistics,
    • and as a result, many classical linguists are very sceptical of the new
    approaches.
    4

    View Slide

  8. Etymology 3.0?
    Towards a “qualitative turn” in computational historical linguistics:
    5

    View Slide

  9. Etymology 3.0?
    Towards a “qualitative turn” in computational historical linguistics:
    • Instead of blaming computers for our misery (no funding, institutes
    are begin shut down, etc.), we should start seeing computers as a
    chance to address the important questions which we have not solved
    in 200 years of research...
    5

    View Slide

  10. Etymology 3.0?
    Towards a “qualitative turn” in computational historical linguistics:
    • Instead of blaming computers for our misery (no funding, institutes
    are begin shut down, etc.), we should start seeing computers as a
    chance to address the important questions which we have not solved
    in 200 years of research...
    • But we don’t a framework in which computers do our work for us,
    instead, we need a framework, where we tell computers to some
    work for us, in order to render our research more explicit, more
    efficient, and more rigorous.
    5

    View Slide

  11. Computer-Assisted Language Comparison
    very
    long
    title
    P(A|B)=P(B|A)...
    6

    View Slide

  12. Computer-Assisted Language Comparison
    CALC (MPI-SHH, Jena)
    • computational formalization of the classical methods for historical
    language comparison
    • establish a close collaboration between computational and classical
    historical linguistics by providing data in human- and
    machine-readable form
    7

    View Slide

  13. Computer-Assisted Language Comparison
    CALC (MPI-SHH, Jena)
    • computational formalization of the classical methods for historical
    language comparison
    • establish a close collaboration between computational and classical
    historical linguistics by providing data in human- and
    machine-readable form
    ASIA (SOAS, London)
    • reconstruction of Proto-Burmish
    7

    View Slide

  14. The Burmish Languages
    8
    Hill and List (forthcoming)

    View Slide

  15. The Burmish Etymological Database (BED)
    9

    View Slide

  16. The Burmish Etymological Database (BED)
    problem current etymological accounts on Proto-Burmish have
    many problems (no lexical reconstruction, insufficient
    phonological reconstruction, unclear data, intransparent
    methodology)
    9

    View Slide

  17. The Burmish Etymological Database (BED)
    problem current etymological accounts on Proto-Burmish have
    many problems (no lexical reconstruction, insufficient
    phonological reconstruction, unclear data, intransparent
    methodology)
    goal compile an etymological database of Proto-Burmish
    9

    View Slide

  18. The Burmish Etymological Database (BED)
    problem current etymological accounts on Proto-Burmish have
    many problems (no lexical reconstruction, insufficient
    phonological reconstruction, unclear data, intransparent
    methodology)
    goal compile an etymological database of Proto-Burmish
    procedure BED as litmus test for CALC:
    • make the Proto-Burmish Etymological Database
    project a first test for the CALC framework
    • use existing computational methods to pre-analyze
    the data
    • develop interfaces to allow for correction and
    inspection by the experts
    9

    View Slide

  19. Where are you now with BED?
    ⊠ Develop computer-assisted workflows to create and curate data in
    human- and machine-readable form.
    10

    View Slide

  20. Where are you now with BED?
    ⊠ Develop computer-assisted workflows to create and curate data in
    human- and machine-readable form.
    ⊠ Develop computer-assisted workflow for partial cognate detection
    and alignments.
    10

    View Slide

  21. Where are you now with BED?
    ⊠ Develop computer-assisted workflows to create and curate data in
    human- and machine-readable form.
    ⊠ Develop computer-assisted workflow for partial cognate detection
    and alignments.
    ⊟ Develop methods for automatic phonological reconstruction and
    workflows for the correction of the results by experts (→ THIS
    TALK).
    10

    View Slide

  22. Where are you now with BED?
    ⊠ Develop computer-assisted workflows to create and curate data in
    human- and machine-readable form.
    ⊠ Develop computer-assisted workflow for partial cognate detection
    and alignments.
    ⊟ Develop methods for automatic phonological reconstruction and
    workflows for the correction of the results by experts (→ THIS
    TALK).
    □ Develop methods for lexical reconstruction (first ideas, not shown in
    this talk).
    10

    View Slide

  23. Where are you now with BED?
    ⊠ Develop computer-assisted workflows to create and curate data in
    human- and machine-readable form.
    ⊠ Develop computer-assisted workflow for partial cognate detection
    and alignments.
    ⊟ Develop methods for automatic phonological reconstruction and
    workflows for the correction of the results by experts (→ THIS
    TALK).
    □ Develop methods for lexical reconstruction (first ideas, not shown in
    this talk).
    □ Write a data-driven etymological dictionary (table of contents is
    half-written).
    10

    View Slide

  24. Phonological Reconstruction

    View Slide

  25. What is phonological reconstruction?
    12

    View Slide

  26. What is phonological reconstruction?
    • Phonological reconstruction is primarily understood as the
    reconstruction of the sound system of a language not reflected in
    written sources.
    12

    View Slide

  27. What is phonological reconstruction?
    • Phonological reconstruction is primarily understood as the
    reconstruction of the sound system of a language not reflected in
    written sources.
    • More specifically, however, we see phonological reconstruction as the
    task of reconstructing major patterns of sound change which allow
    us to reconstruct tentative proto-forms from cognate sets, regardless
    of whether those words were really present in the Ursprache or what
    those forms meant.
    12

    View Slide

  28. What is phonological reconstruction?
    • Phonological reconstruction is primarily understood as the
    reconstruction of the sound system of a language not reflected in
    written sources.
    • More specifically, however, we see phonological reconstruction as the
    task of reconstructing major patterns of sound change which allow
    us to reconstruct tentative proto-forms from cognate sets, regardless
    of whether those words were really present in the Ursprache or what
    those forms meant.
    • The task of lexical reconstruction follows phonological
    reconstruction in projecting full lexemes to the Ursprache, thereby
    also assessing their meaning, and whether it is reasonable to
    reconstruct them at all.
    12

    View Slide

  29. Classical Workflow (Comparative Method)
    13

    View Slide

  30. Classical Workflow (Comparative Method)
    • assemble cognate sets and sound correspondences by comparing
    data on different languages
    13

    View Slide

  31. Classical Workflow (Comparative Method)
    • assemble cognate sets and sound correspondences by comparing
    data on different languages
    • infer sound change processes (“sound laws”) from the inferred sound
    correspondence patterns
    13

    View Slide

  32. Classical Workflow (Comparative Method)
    • assemble cognate sets and sound correspondences by comparing
    data on different languages
    • infer sound change processes (“sound laws”) from the inferred sound
    correspondence patterns
    • explain exceptions by:
    13

    View Slide

  33. Classical Workflow (Comparative Method)
    • assemble cognate sets and sound correspondences by comparing
    data on different languages
    • infer sound change processes (“sound laws”) from the inferred sound
    correspondence patterns
    • explain exceptions by:
    • refining inferred sound change processes (cf. “Verner’s law”)
    • borrowing (“substratum influence”)
    • analogy (leftovers)
    13

    View Slide

  34. Computer-Based Automatic Approaches
    14

    View Slide

  35. Computer-Based Automatic Approaches
    Problems of computer-based approaches:
    15

    View Slide

  36. Computer-Based Automatic Approaches
    Problems of computer-based approaches:
    (a) fail to model sound change as a systemic process (each column of an
    alignment is counted independently)
    15

    View Slide

  37. Computer-Based Automatic Approaches
    Problems of computer-based approaches:
    (a) fail to model sound change as a systemic process (each column of an
    alignment is counted independently)
    (b) fail to make use of linguistic knowledge on the directionality of
    sound change processes and have to rely on phylogenies
    15

    View Slide

  38. Computer-Based Automatic Approaches
    Problems of computer-based approaches:
    (a) fail to model sound change as a systemic process (each column of an
    alignment is counted independently)
    (b) fail to make use of linguistic knowledge on the directionality of
    sound change processes and have to rely on phylogenies
    (c) fail to handle unattested sounds, as only sounds which are in the
    data can be reconstructed
    15

    View Slide

  39. Computer-Assisted Phonological
    Reconstruction

    View Slide

  40. General Workflow
    Preliminary steps:
    • partial cognate detection and partial phonetic alignment (List et al.
    2016) with manual refinement
    17

    View Slide

  41. General Workflow
    Preliminary steps:
    • partial cognate detection and partial phonetic alignment (List et al.
    2016) with manual refinement
    • preliminary identification of cross-semantic cognates based on partial
    colexifications (Hill and List forthcoming)
    17

    View Slide

  42. General Workflow
    Preliminary steps:
    • partial cognate detection and partial phonetic alignment (List et al.
    2016) with manual refinement
    • preliminary identification of cross-semantic cognates based on partial
    colexifications (Hill and List forthcoming)
    Phonological reconstruction:
    17

    View Slide

  43. General Workflow
    Preliminary steps:
    • partial cognate detection and partial phonetic alignment (List et al.
    2016) with manual refinement
    • preliminary identification of cross-semantic cognates based on partial
    colexifications (Hill and List forthcoming)
    Phonological reconstruction:
    • sound correspondence pattern identification (List et al. in prep.)
    17

    View Slide

  44. General Workflow
    Preliminary steps:
    • partial cognate detection and partial phonetic alignment (List et al.
    2016) with manual refinement
    • preliminary identification of cross-semantic cognates based on partial
    colexifications (Hill and List forthcoming)
    Phonological reconstruction:
    • sound correspondence pattern identification (List et al. in prep.)
    • automatic reconstruction using weighted directed networks (→ this
    talk)
    17

    View Slide

  45. Detailed Workflow: Preliminary Steps
    Fúzhōu ŋuoʔ⁵
    Měixiàn
    ŋiat⁵ 0.44
    kuoŋ⁴⁴ 0.78 0.78
    Wēnzhōu
    y²¹
    ȵ 0.30 0.35 0.67
    ku ³
    ɔ ⁵ 0.80 0.85 0.27 0.67
    vai¹³ 0.85 0.85 0.82 0.73 0.73
    Běijīng y ¹
    ɛ⁵ 0.77 0.84 0.73 0.56 0.56 0.66
    li ŋ¹
    ɑ 0.78 0.78 0.44 0.67 0.82 0.82 0.80
    ŋiat⁵
    kuoŋ⁴⁴
    ŋuoʔ⁵
    ȵy²¹
    yɛ⁵¹
    kuɔ³⁵
    liɑŋ¹
    vai¹³
    ŋiat⁵
    vai¹³
    kuoŋ⁴⁴
    ŋuoʔ⁵
    liɑŋ¹
    yɛ⁵¹
    ȵy²¹
    kuɔ³⁵
    ȵy²¹
    kuɔ³⁵
    ŋiat⁵
    yɛ⁵¹
    liɑŋ¹
    ŋuoʔ⁵
    kuoŋ⁴⁴
    vai¹³
    B C
    D
    A
    Partial cognate detection: List, Lopez, and Bapteste (2016)
    18

    View Slide

  46. Detailed Workflow: Preliminary Steps
    EDICTOR tool: List (2017)
    19

    View Slide

  47. Detailed Workflow: Preliminary Steps
    Language 'mountain' 'dog' 'thunder' 'wolf' 'bear (n.)'
    Atsi pum⁵¹ kʰui²¹ mau²¹ mjiŋ⁵¹ vam⁵¹ kʰui²¹ mo⁵⁵ vam⁵¹
    mountain dog sky + thunder bear + dog + m-suff. bear
    Bola pam⁵⁵ kʰui³⁵ mau³¹ mjaŋ⁵⁵ mjaŋ⁵⁵ kʰui³⁵ vɛ⁵
    ⁵⁵
    mountain dog sky + thunder thunder + dog bear
    Lashi pɔm³¹ kʰui⁵⁵ mou³³ kɔm³³ wɔm³¹ kʰui⁵⁵ wɔm³¹
    mountain dog sky + thunderB bear + dog bear
    Maru pam³¹ lə¹
    ³¹ kʰa³⁵ muk⁵⁵ kum³¹ mjaŋ³¹ kʰa³⁵ vɛ⁵
    ³¹
    mountain ? + dog sky + thunderB thunder + dog bear
    Achang
    pum⁵⁵ xui³¹ mau³¹ ʐau³¹ pum⁵⁵ xui³¹ ɔm⁵⁵
    mountain dog sky + thunderC mountain + dog bear
    Morpheme Annotation (Hill and List forthc.) 20

    View Slide

  48. Detailed Workflow: Sound Correspondence Pattern Inference
    Clackson (2007: 37)
    21

    View Slide

  49. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Patterns and Phonological Reconstruction:
    22

    View Slide

  50. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Patterns and Phonological Reconstruction:
    • the most traditional way to reconstruct in the comparative-method
    framework is to infer patterns of regular sound correspondences
    across a set of languages and then assign proto-forms for each
    distinct pattern
    22

    View Slide

  51. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Patterns and Phonological Reconstruction:
    • the most traditional way to reconstruct in the comparative-method
    framework is to infer patterns of regular sound correspondences
    across a set of languages and then assign proto-forms for each
    distinct pattern
    • correspondence patterns are usually inferred manually, by inspecting
    “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate
    sets with recurring sounds)
    22

    View Slide

  52. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Patterns and Phonological Reconstruction:
    • the most traditional way to reconstruct in the comparative-method
    framework is to infer patterns of regular sound correspondences
    across a set of languages and then assign proto-forms for each
    distinct pattern
    • correspondence patterns are usually inferred manually, by inspecting
    “correspondence sets” (Clackson 2007: 29f) of words (i.e., cognate
    sets with recurring sounds)
    • the main problem of correspondence pattern identification is the
    handling of missing data, since not all cognate sets will necessarily
    contain reflexes from each of the languages under investigation
    22

    View Slide

  53. Detailed Workflow: Sound Correspondence Pattern Inference
    Graphs of Compatible Correspondence Sets:
    23

    View Slide

  54. Detailed Workflow: Sound Correspondence Pattern Inference
    Graphs of Compatible Correspondence Sets:
    • the main idea for the correspondence pattern inference algorithm is
    to derive a graph from correspondence sets in which each individual
    correspondence set (a site in an aligned cognate set) is a node, and
    links between nodes are drawn between compatible correspondence
    sets
    23

    View Slide

  55. Detailed Workflow: Sound Correspondence Pattern Inference
    Graphs of Compatible Correspondence Sets:
    • the main idea for the correspondence pattern inference algorithm is
    to derive a graph from correspondence sets in which each individual
    correspondence set (a site in an aligned cognate set) is a node, and
    links between nodes are drawn between compatible correspondence
    sets
    • if two correspondence sets are compatible, this means that they have
    identical non-missing values for at least one language and no
    conflicting data for any of the languages
    23

    View Slide

  56. Detailed Workflow: Sound Correspondence Pattern Inference
    Graphs of Compatible Correspondence Sets:
    • the main idea for the correspondence pattern inference algorithm is
    to derive a graph from correspondence sets in which each individual
    correspondence set (a site in an aligned cognate set) is a node, and
    links between nodes are drawn between compatible correspondence
    sets
    • if two correspondence sets are compatible, this means that they have
    identical non-missing values for at least one language and no
    conflicting data for any of the languages
    • if two or more correspondence sets are compatible, we can impute
    missing values by combining them
    23

    View Slide

  57. Detailed Workflow: Sound Correspondence Pattern Inference
    Cognate Set L1 L2 L3 L4 L5 L6 L7 L8
    “hand-1” p p p f f p
    “foot-1” p p p p f f p p
    ⊠ compatible
    □ incompatible
    24

    View Slide

  58. Detailed Workflow: Sound Correspondence Pattern Inference
    Cognate Set L1 L2 L3 L4 L5 L6 L7 L8
    “hand-1” p p p f f p
    “foot-1” p p p p f f p p
    ⊠ compatible
    □ incompatible
    Cognate Set L1 L2 L3 L4 L5 L6 L7 L8
    “hand-1” p p p f f p
    “leg-1” p p f pf f f p p
    □ compatible
    ⊠ incompatible
    24

    View Slide

  59. Detailed Workflow: Sound Correspondence Pattern Inference
    s
    s
    s
    s
    s
    s
    s
    s
    s
    s
    s
    k
    s
    -
    x
    x
    x
    x
    k
    k
    k
    k
    k
    k

    k
    ʃ
    k
    ʃ
    ʃ
    x
    ɣ
    ʃ
    k
    k
    k
    k
    k
    k
    k
    s
    s
    s
    s
    s
    n
    s
    s
    k
    k
    k
    k
    ʃ
    ʃ
    s
    s
    ʃ
    ʃ










    x
    x
    x
    x
    x
    ʃ
    ʃ
    ʃ
    ʃ
    ʃ
    ʃ
    ʃ

    ʃ
    ʃ
    s
    ʃ
    ts
    ts
    ts
    ts
    ts
    ts
    ts
    ts
    ts
    ts
    t
    t
    t
    t
    t
    t
    t
    t
    t
    t
    t
    t
    t t
    t
    t

    t










    kʰ kʰ






    25

    View Slide

  60. Detailed Workflow: Sound Correspondence Pattern Inference
    x
    x
    x
    x
    x
    x
    x
    x
    x
    x
    good
    correspondence
    set
    bad
    correspondence
    set
    25

    View Slide

  61. Detailed Workflow: Sound Correspondence Pattern Inference
    Only fully compatible clusters (i.e., only cliques in our network of
    correspondence sets) can represent true sound correspondence pat-
    terns (if sound change is regular).
    25

    View Slide

  62. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Pattern Inference as a Clique Cover Problem:
    26

    View Slide

  63. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Pattern Inference as a Clique Cover Problem:
    • The clique cover problem (also called clique partitioning problem,
    see Bhasker 1991) is the inverse of the famous graph coloring
    problem and has been shown to be NP-hard.
    26

    View Slide

  64. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Pattern Inference as a Clique Cover Problem:
    • The clique cover problem (also called clique partitioning problem,
    see Bhasker 1991) is the inverse of the famous graph coloring
    problem and has been shown to be NP-hard.
    • The goal of the problem is to split a graph into the smallest number
    of cliques in which each node is represented by exactly one clique.
    26

    View Slide

  65. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Pattern Inference as a Clique Cover Problem:
    • The clique cover problem (also called clique partitioning problem,
    see Bhasker 1991) is the inverse of the famous graph coloring
    problem and has been shown to be NP-hard.
    • The goal of the problem is to split a graph into the smallest number
    of cliques in which each node is represented by exactly one clique.
    • We assume (but we cannot formally prove it) that the clique cover
    of our graph of compatible correspondence sets will correspond to
    the optimal set of sound correspondence patterns in our data.
    26

    View Slide

  66. Detailed Workflow: Sound Correspondence Pattern Inference
    Sound Correspondence Pattern Inference as a Clique Cover Problem:
    • The clique cover problem (also called clique partitioning problem,
    see Bhasker 1991) is the inverse of the famous graph coloring
    problem and has been shown to be NP-hard.
    • The goal of the problem is to split a graph into the smallest number
    of cliques in which each node is represented by exactly one clique.
    • We assume (but we cannot formally prove it) that the clique cover
    of our graph of compatible correspondence sets will correspond to
    the optimal set of sound correspondence patterns in our data.
    • By applying an approximation algorithm to infer a near-optimal
    clique cover of our data of aligned cognate sets, we can infer the
    most frequently recurring correspondence patterns in our data.
    26

    View Slide

  67. Detailed Workflow: Automatic Reconstruction
    We can do without trees!
    27

    View Slide

  68. Detailed Workflow: Automatic Reconstruction
    We can do without trees!
    • Phonological reconstruction in the comparative-method framework
    usually starts from correspondence patterns.
    27

    View Slide

  69. Detailed Workflow: Automatic Reconstruction
    We can do without trees!
    • Phonological reconstruction in the comparative-method framework
    usually starts from correspondence patterns.
    • Apart from very few exceptions, it does not require the knowledge of
    any specific phylogeny for the language family under investigation
    (at least not for most consonants).
    27

    View Slide

  70. Detailed Workflow: Automatic Reconstruction
    We can do without trees!
    • Phonological reconstruction in the comparative-method framework
    usually starts from correspondence patterns.
    • Apart from very few exceptions, it does not require the knowledge of
    any specific phylogeny for the language family under investigation
    (at least not for most consonants).
    • What it requires, however, is to know the major sound change
    transitions, which have strong directional preferences for consonants
    (much less for vowels and tones).
    27

    View Slide

  71. Detailed Workflow: Automatic Reconstruction
    We can do without trees!
    • Phonological reconstruction in the comparative-method framework
    usually starts from correspondence patterns.
    • Apart from very few exceptions, it does not require the knowledge of
    any specific phylogeny for the language family under investigation
    (at least not for most consonants).
    • What it requires, however, is to know the major sound change
    transitions, which have strong directional preferences for consonants
    (much less for vowels and tones).
    • We can use this knowledge as a proxy to select which of the sounds
    in a given correspondence pattern is the best candidate for the
    proto-sound.
    27

    View Slide

  72. Detailed Workflow: Automatic Reconstruction
    əː ə̆
    ɿ
    ə
    ḭ:

    ɤ
    ə̰
    ĭ

    i
    ɑ
    ɛ̃
    ɑ̃
    ɛ
    ɯ
    ɛ̰̃
    ɔ̃
    a̰:
    ɛ̰
    ɔ̰̃
    ã


    ɑ̰
    w
    ∼ ŋ
    -
    v
    ĩ
    e
    a


    ɔ̰
    ŋ̊
    ŋʲ
    n◌̥ʲ
    n◌̥
    ɲ̊
    m
    n
    mʲ ɲ

    m◌̥
    ɕ
    ʃ
    ç
    ɬ
    ʂ r◌̥
    l◌̥
    ɔː
    u:

    õ
    ɔ
    o
    ʊ
    ṵː
    u
    ũ
    j
    ɣ
    ʐ

    x
    ʑ
    rj

    r
    l
    tɕ tʃ
    t
    ts
    c
    k

    s

    p
    ʔ

    f
    tsʰ
    tɕʰ
    tʃʰ


    28

    View Slide

  73. Detailed Workflow: Automatic Reconstruction
    s
    ts

    ʃ tʃ
    ʂ
    28

    View Slide

  74. Detailed Workflow: Automatic Reconstruction
    s ts

    ts ʂ
    s s ts
    *ts
    s
    ts

    ʂ
    28

    View Slide

  75. Detailed Workflow: Automatic Reconstruction
    ʃ tʃ
    ʂ
    s ʃ s s
    ʂ ʂ ʃ
    s
    ts

    ʂ
    28

    View Slide

  76. Detailed Workflow: Automatic Reconstruction
    ʃ
    ʂ
    s ʃ s s
    ʂ ʂ ʃ

    s
    ʂ
    28

    View Slide

  77. Detailed Workflow: Automatic Reconstruction
    ts


    ʃ
    ʂ
    tʃ ts ʂ ʂ ts ts

    ʂ
    28

    View Slide

  78. Detailed Workflow: Automatic Reconstruction
    ts

    ʂ
    tʃ ts s ʂ ts ts
    *tʃ/ts
    s
    ʂ
    28

    View Slide

  79. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    29

    View Slide

  80. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    29

    View Slide

  81. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    29

    View Slide

  82. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    3. if
    • there is a source, select it as proto-form,
    29

    View Slide

  83. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    3. if
    • there is a source, select it as proto-form,
    • there are multiple sources, select all as proto-form,
    29

    View Slide

  84. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    3. if
    • there is a source, select it as proto-form,
    • there are multiple sources, select all as proto-form,
    • the graph is disconnected or no source can be found (loops in the
    graph), select the most frequently recurring form as a potential
    proto-form (“majority rules”),
    29

    View Slide

  85. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    3. if
    • there is a source, select it as proto-form,
    • there are multiple sources, select all as proto-form,
    • the graph is disconnected or no source can be found (loops in the
    graph), select the most frequently recurring form as a potential
    proto-form (“majority rules”),
    4. label the “quality” of the respective proto-form, specifically marking
    correspondence patterns which occur only one time in the data,
    29

    View Slide

  86. Detailed Workflow: Automatic Reconstruction
    Automatic Reconstruction Strategy:
    1. extract the sub-graph from the sound-change graph for each distinct
    sound in a given correspondence pattern,
    2. search for a potential source in the sub-graph, i.e., a sound that has
    no ancestor,
    3. if
    • there is a source, select it as proto-form,
    • there are multiple sources, select all as proto-form,
    • the graph is disconnected or no source can be found (loops in the
    graph), select the most frequently recurring form as a potential
    proto-form (“majority rules”),
    4. label the “quality” of the respective proto-form, specifically marking
    correspondence patterns which occur only one time in the data,
    5. have the expert clean up the mess.
    29

    View Slide

  87. Detailed Workflow: Automatic Reconstruction
    Advantage of the Approach:
    30

    View Slide

  88. Detailed Workflow: Automatic Reconstruction
    Advantage of the Approach:
    (a) ⊠ systemic aspects of sound change are integrated into the
    correspondence pattern detection algorithm
    30

    View Slide

  89. Detailed Workflow: Automatic Reconstruction
    Advantage of the Approach:
    (a) ⊠ systemic aspects of sound change are integrated into the
    correspondence pattern detection algorithm
    (b) ⊠ linguistic knowledge (even language-specific knowledge) is
    exhaustively used to construct the sound-change networks
    30

    View Slide

  90. Detailed Workflow: Automatic Reconstruction
    Advantage of the Approach:
    (a) ⊠ systemic aspects of sound change are integrated into the
    correspondence pattern detection algorithm
    (b) ⊠ linguistic knowledge (even language-specific knowledge) is
    exhaustively used to construct the sound-change networks
    (c) □ unattested sounds need to be manually handled by assigning them
    to specific correspondence patterns
    30

    View Slide

  91. Examples

    View Slide

  92. General Findings
    Basic Statistics:
    • 8 languages
    • 240 concepts
    • 855 partial cognate sets
    • 728 cross-semantic partial cognate sets
    • 218 valid cognate sets (with more than two reflexes)
    • 104 initial consonant patterns (48 with more than one reflex, the
    rest highly irregular)
    • well-reconstructed proto-sounds:
    stops and affricates *k, *kʰ, *t, *tʰ, *tʃ, *tʃʰ, *ts, *tsʰ, p, pʰ
    fricatives s, ʃ, x
    liquids and j r, l, j
    nasals ŋ, n, m
    32

    View Slide

  93. Specific Findings: “black” and “dark”
    Language black dark
    Old Burmese n a k - ∅
    Rangoon n ɛ ʔ ⁴ ∅
    Achang l ɔ k ⁵⁵ ∅
    Xiandao n ɔ ʔ ⁵⁵ ∅
    Atsi n o ʔ ²¹ n o ʔ ²¹
    Bola n a ʔ ³¹ n a ʔ ³¹
    Lashi n ɔː ʔ ³¹ ∅
    Maru n ɔ ʔ ³¹ n ɔ ʔ ³¹
    Proto-Burmish n *a k ³¹ *n *a|*ṵ ʔ|*k ³¹
    [i] The discrepancy in the reconstructions for these two forms which were
    regularly recognized as cognate in all languages is due to the insufficient
    reconstruction by the proto-type which takes each correspondence set in-
    dependently, rather than summarizing all possible reflexes for accepted
    cross-semantic cognate sets.
    33

    View Slide

  94. Specific Findings: “middle” and “outside”
    Language middle outside (“out-middle”)
    Old Burmese ∅ ∅
    Rangoon ∅ ∅
    Achang k u - ŋ ⁵⁵ ∅
    Xiandao k o - ŋ ⁵⁵ ∅
    Atsi k u - ŋ ²¹ ∅
    Bola k a u ŋ ³¹ k a u ŋ ³¹
    Lashi k u - ŋ ³¹ ∅
    Maru k a u ŋ ³⁵ k a u ŋ ³⁵
    Proto-Burmish *k u - *ŋ ⁵⁵ ∅
    [i] The algorithm refuses to reconstruct the morpheme “middle” in the
    word “outside”, as it only occurs two times. If the proto-type, again, only
    reconstructed one time per cross-semantic cognate set, the results would
    be the same.
    34

    View Slide

  95. Specific Findings: “tree” and “wood”
    Language tree wood
    Old Burmese s a ts - s a ts -
    Rangoon tθ i ʔ ⁴ tθ i ʔ ⁴
    Achang s a ŋ ³¹ ʂ ə k ⁵⁵
    Xiandao ʂ ɯ k ⁵⁵ ∅
    Atsi s i k ⁵⁵ s i k ⁵⁵
    Bola s a k ⁵⁵ s a k ⁵⁵
    Lashi s ə̰ k ⁵⁵ s ə̰ k ⁵⁵
    Maru s a̰ k ⁵⁵ s a̰ k ⁵⁵
    Proto-Burmish s a k ⁵⁵ *s ə̰ *k ⁵⁵
    [i] Apart from the vowel, which is marked as irregular in the reconstruc-
    tion for “wood”, this reconstruction is regular and also easy to compare
    with other Sino-Tibetan reflexes. The reconstruction for “tree”, on the
    other hand, is irregular, due to the wrong cognate assignment for Achang.
    35

    View Slide

  96. Outlook

    View Slide

  97. We are only beginning to explore the potential of sound correspon-
    dence pattern analysis as a backbone for automatic linguistic recon-
    struction. Even now, it is straightforward to manually annotate all
    different sound correspondence patterns which we could infer from
    the data.
    To test the full potential of the approach, we will have to drastically
    increase the number of lexical items in our data, but even in this
    state, the approach is promising, and can serve as a starting point
    for a classical phonological reconstruction analysis.
    37

    View Slide

  98. Danke für Ihre Aufmerksamkeit!
    38

    View Slide