Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer-Assisted Language Comparison ... and its Application to Chinese Dialectology

Computer-Assisted Language Comparison ... and its Application to Chinese Dialectology

Presentation given at the séminaire du CRLAO (Centre des recherches linguistiques sur l'Asie Orientale), Paris.

Johann-Mattis List

February 04, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Computer-Assisted Language Comparison...
    ... and its Application to Chinese Dialectology
    Johann-Mattis List
    DFG Research Fellow
    Centre des recherches linguistiques sur l’Asie Orientale
    Team AIRE (Adaptation, Integration, Reticulation, Evolution) at UPMC
    2015-02-04
    1 / 50

    View Slide

  2. Murder on the Orient Express…
    2 / 50

    View Slide

  3. 2 / 50

    View Slide

  4. - 1 victim (dead)
    - 12 stab wounds
    - all stab wounds differ from each other
    - 12 suspects
    - all have an alibi
    2 / 50

    View Slide

  5. - 1 victim (dead)
    - 12 stab wounds
    - all stab wounds differ from each other
    - 12 suspects
    - all have an alibi
    ?
    2 / 50

    View Slide

  6. Eh oui, Hastings,
    it seems to me
    that there is only
    one possible soluti-
    on, no matter how
    impossible it may
    seem: There was not
    one murderer, but
    twelve…
    2 / 50

    View Slide

  7. Well, why not? My
    reasoning has al-
    ways been based on
    the assumption that
    when you have ex-
    cluded the impos-
    sible, whatever re-
    mains, however im-
    probable, must be
    the truth.
    2 / 50

    View Slide

  8. Criminology Historical Linguistics
    Goal find the murderer find the proto-language
    Proceeds by reconstructing the reconstructing the
    circumstances of a crime history of a language
    Method inference based on inference based on
    circumstantial evidence circumstantial evidence
    Mode of Reasoning abduction abduction
    2 / 50

    View Slide

  9. Criminology Historical Linguistics
    *ʔˁak
    2 / 50

    View Slide

  10. Mode of Reasoning in Historical Linguistics
    The starting point of historical linguistics is syn-
    chronic language data. Based on models of lan-
    guage and sound change, an inference of relati-
    ons among different data points is carried out. An
    analysis of these relations yields historical scena-
    rios that explain the current structure of the data.
    2 / 50

    View Slide

  11. Mode of Reasoning in Historical Linguistics
    The starting point of historical linguistics is syn-
    chronic language data. Based on models of lan-
    guage and sound change, an inference of relati-
    ons among different data points is carried out. An
    analysis of these relations yields historical scena-
    rios that explain the current structure of the data.
    2 / 50

    View Slide

  12. Traditional Historical Linguistics
    3 / 50

    View Slide

  13. Traditional Historical Linguistics Characteristics
    Characteristics
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    4 / 50

    View Slide

  14. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    5 / 50

    View Slide

  15. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 50

    View Slide

  16. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 50

    View Slide

  17. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n - -
    * Proto-Germanic t a n d
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - - -
    5 / 50

    View Slide

  18. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n - -
    Proto-Germanic t a n θ -
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - - -
    5 / 50

    View Slide

  19. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - -
    5 / 50

    View Slide

  20. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    Proto-Indo-European d e n t -
    Italian d ɛ n t ə
    Proto-Romance d e n t e
    French d ɑ̃ - -
    5 / 50

    View Slide

  21. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː - θ
    Proto-Indo-European d e n t
    Italian d ɛ n t ə
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 50

    View Slide

  22. Traditional Historical Linguistics Characteristics
    Research Object
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    1
    5 / 50

    View Slide

  23. Traditional Historical Linguistics Characteristics
    Research Object FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    History
    individual events (description)
    individual processes (description, inference)
    general processes (modeling, analysis)
    Language History
    individual language states (description of sound system, grammar,
    lexicon)
    individual instances of language development (description and
    inference of sound change patterns, grammaticalization, lexical
    change)
    general language development (modeling and analysis of sound
    change processes, grammaticalization, lexical change)
    6 / 50

    View Slide

  24. Traditional Historical Linguistics Characteristics
    Research Object FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Internal Language History (ontogenesis)
    etymology
    historical grammar
    historical phonology
    External Language History (phylogenesis)
    linguistic reconstruction (modeling and inference)
    proof of language relationship (inference)
    genetic classification (analysis)
    General Tendencies in Language History
    processes and mechanisms of sound change
    grammaticalization
    lexical change
    7 / 50

    View Slide

  25. Traditional Historical Linguistics Characteristics
    Origins FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Uniformitarianism
    “universality of change” – change is independent of time and space
    “graduality of change” – change is neither abrupt nor chaotic
    “uniformity of change” – change is not heterogeneous, but uniform
    Founding Fathers
    Franz Bopp (1791–1867): language comparison (Bopp 1816)
    Rasmus Rask (1787-1832) and Jacob Grimm (1785-1863): sound
    law (Rask 1818, Grimm 1822)
    August Schleicher (1821–1868): family tree and linguistic
    reconstruction (Schleicher 1853 & 1861)
    8 / 50

    View Slide

  26. Traditional Historical Linguistics Achievements
    Achievements
    9 / 50

    View Slide

  27. Traditional Historical Linguistics Achievements
    Methods and Theories
    Comparative Method (Meillet 1925)
    Basic procedure for proving language relationship and reconstructing
    unattested ancestral language states, etymologies, and genetic
    classifications.
    Family Tree Model and Wave Theory (Schleicher 1853, Schmidt 1872)
    Two partially incompatible models to describe historical language
    relations.
    Regularity Hypothesis (Osthoff & Brugmann 1878)
    Fundamental working hypothesis that states that certain sound change
    processes proceed regularly (universally, gradually, and in a uniform
    manner).
    10 / 50

    View Slide

  28. Traditional Historical Linguistics Achievements
    Comparative Method
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    11 / 50

    View Slide

  29. Traditional Historical Linguistics Achievements
    Comparative Method
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    11 / 50

    View Slide

  30. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    11 / 50

    View Slide

  31. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    11 / 50

    View Slide

  32. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    11 / 50

    View Slide

  33. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    11 / 50

    View Slide

  34. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x ?
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    11 / 50

    View Slide

  35. Traditional Historical Linguistics Achievements
    Comparative Method
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    11 / 50

    View Slide

  36. Traditional Historical Linguistics Achievements
    Insights
    Internal Language History
    Thanks to historical linguistics, the history of a considerable (but still
    small) amount of languages has been thoroughly investigated.
    External Language History
    Thanks to historical linguistics, a considerable amount of the languages
    in the world has been genetically classified (although there remain many
    unsolved and controversially discussed questions).
    General Language History
    Some work on general processes of language history has been done, yet
    many questions still remain unsolved or are controversially debated.
    12 / 50

    View Slide

  37. Traditional Historical Linguistics Problems
    Problems
    13 / 50

    View Slide

  38. Traditional Historical Linguistics Problems
    Transparency
    14 / 50

    View Slide

  39. Traditional Historical Linguistics Problems
    Transparency
    Part of the process of “becoming” a competent
    Indo-Europeanist has always been recognized as coming to
    grasp “intuitively” concepts and types of changes in language
    so as to be able to pick and choose between alternative
    explanations for the history and development of specific
    features of the reconstructed language and its offspring.
    Schwink (1994)
    14 / 50

    View Slide

  40. Traditional Historical Linguistics Problems
    Applicability
    15 / 50

    View Slide

  41. Traditional Historical Linguistics Problems
    Applicability
    – 7,106 languages (Lewis & Fennig 2013)
    – 147 language families (ibid.)
    – 25244065 languages which could be compared
    15 / 50

    View Slide

  42. Traditional Historical Linguistics Problems
    Applicability
    The amount of digitally available data for the lan-
    guages of the world is growing from day to day,
    while there are only a few historical linguists who
    are trained to carry out the comparison of the-
    se languages. It seems impossible to handle this
    task when relying only on the traditional, time-
    consuming manual procedures developed in tra-
    ditional historical linguistics.
    15 / 50

    View Slide

  43. Traditional Historical Linguistics Problems
    Adequacy
    16 / 50

    View Slide

  44. Traditional Historical Linguistics Problems
    Adequacy
    One time is never, two times is ever!
    (a mathematician friend on the treatment of probability in
    Indo-European linguistics)
    16 / 50

    View Slide

  45. Traditional Historical Linguistics Problems
    Comparability
    17 / 50

    View Slide

  46. Traditional Historical Linguistics Problems
    Comparability
    Frucht. Sf std. (9. Jh.), mhd. vruht, ahd. fruht, as. fruht. Ent-
    lehnt aus l. frūctus m. gleicher Bedeutung (zu l. fruī “genie-
    ße”). Das deutsche Wort ist Femininum geworden im An-
    schluß an die ti- Abstrakta wie Flucht² usw. Adjektive: fruch-
    tig, fruchtbar; Verb: (be-)fruchten. Ebenso nndl. vrucht, ne.
    fruit, nfrz. fruit, nschw. frukt, nnorw. frukt; frugal.
    (Kluge und Seebold 2002)
    17 / 50

    View Slide

  47. Traditional Historical Linguistics Problems
    Comparability
    inherited from
    borrowed from
    derived from
    PIE *bhreu
    ̯ Hg
    ̑ -
    “to use”
    PIE *bhruHg
    ̑ -ié-
    “to use” (present tense)
    PGM *ƀrūkan-
    “to use”
    OHG brūhhan
    “to use”
    G brauchen
    “to use”
    G Brauch
    “custom”
    OHG fruht
    “profit, fruit”
    G frugal
    “nourishing”
    Fr fruit
    “profit,fruit”
    Fr frugal
    “modest (food)”
    Lt fruor, fruī
    “I enjoy”
    Lt frūctus
    “profit”
    Lt frux
    “fruit, grain”
    Lt frugalis
    “bring profit”
    17 / 50

    View Slide

  48. Traditional Historical Linguistics Problems
    Summary
    Aspect Current State Desired State
    transparency intuition formalization
    applicability difficult and slow fast enough to analyze
    our heritage before we
    loose it
    adequacy lack of probabilistic rea-
    soning
    inclusion of probabilistic
    arguments
    comparability philological, irrepro-
    ducible knowledge
    representation
    formal, digital, and
    reproducible knowledge
    representation
    18 / 50

    View Slide

  49. Quantitative Historical Linguistics
    19 / 50

    View Slide

  50. Quantitative Historical Linguistics Characteristics
    Characteristics
    P(A|B)=(P(B|A)P(A))/(P(B)
    20 / 50

    View Slide

  51. Quantitative Historical Linguistics Characteristics
    Characteristics
    P(A|B)=(P(B|A)P(A))/(P(B)
    “Indo-European and computational cladistics” (Ringe, Warnow and Taylor
    2002)
    “Language-tree divergence times support the Anatolian theory of
    Indo-European origin” (Gray und Atkinson 2003)
    “Language classification by numbers” (McMahon und McMahon 2005)
    “Curious Parallels and Curious Connections: Phylogenetic Thinking in
    Biology and Historical Linguistics” (Atkinson und Gray 2005)
    “Automated classification of the world’s languages” (Brown et al. 2008)
    “Computational Feature-Sensitive Reconstruction of Language
    Relationships: Developing the ALINE Distance for Comparative Historical
    Linguistic Reconstruction” (Downey et al. 2008)
    “Networks uncover hidden lexical borrowing in Indo-European language
    evolution” (Nelson-Sathi et al. 2011)
    “A pipeline for computational historical linguistics” (Steiner, Stadler, und
    Cysouw 2011)
    21 / 50

    View Slide

  52. Quantitative Historical Linguistics Characteristics
    Points of Interest and Goals
    P(A|B)=(P(B|A)P(A))/(P(B)
    phylogenetic reconstruction
    sequence comparison
    general questions of language development
    22 / 50

    View Slide

  53. Quantitative Historical Linguistics Characteristics
    Points of Interest and Goals
    P(A|B)=(P(B|A)P(A))/(P(B)
    phylogenetic reconstruction
    sequence comparison
    general questions of language development
    Goals
    If we cannot guarantee getting the same results from the same data
    considered by different linguists, we jeopardize the essential scientific
    criterion of repeatability. (McMahon & McMahon 2005)
    22 / 50

    View Slide

  54. Quantitative Historical Linguistics Characteristics
    Methods and Theories
    P(A|B)=(P(B|A)P(A))/(P(B)
    phylogenetic reconstruction (cf., among others, Gray & Atkinson
    2003 Ringe et al. 2002, Brown et al. 2008)
    phonetic alignment (cf., among others, Kondrak 2000, Prokić et al.
    2009, List 2012a)
    cognate detection (cf. Steiner et al. 2011, List 2012b)
    borrowing detection (cf. Nelson-Sathi et al. 2011, List et al. 2014a)
    23 / 50

    View Slide

  55. Quantitative Historical Linguistics Achievements
    Achievements
    24 / 50

    View Slide

  56. Quantitative Historical Linguistics Achievements
    New Perspectives
    external language history receives more attention than before
    “Indo-Euro-Centrism” is replaced by a more cross-linguistic
    paradigm
    new questions regarding general language history
    new proposals to model language history
    25 / 50

    View Slide

  57. Quantitative Historical Linguistics Achievements
    New Approaches
    empirical data becomes the center of interest
    probabilistic approaches replace “historical” approaches
    databases replace philological knowledge representation
    “informal” methods are formalized and automatized
    26 / 50

    View Slide

  58. Quantitative Historical Linguistics Achievements
    Examples
    27 / 50

    View Slide

  59. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Alignment Analyses
    Alignment analyses display sequence similarities by representing
    multiple sequences as rows of a matrix in which common segments are
    placed in the same column. Alignments are a formal way to deal with
    general tasks of sequence comparison. Although never explicitly labeled
    or displayed, alignments are virtually present in all analyses in historical
    linguistics dealing with the comparison of sound sequences (words,
    morphemes).
    27 / 50

    View Slide

  60. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Alignment Analyses
    Alignment analyses display sequence similarities by representing
    multiple sequences as rows of a matrix in which common segments are
    placed in the same column. Alignments are a formal way to deal with
    general tasks of sequence comparison. Although never explicitly labeled
    or displayed, alignments are virtually present in all analyses in historical
    linguistics dealing with the comparison of sound sequences (words,
    morphemes).
    t ɔ x t ə r
    d ɔː t ə r
    27 / 50

    View Slide

  61. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Alignment Analyses
    Alignment analyses display sequence similarities by representing
    multiple sequences as rows of a matrix in which common segments are
    placed in the same column. Alignments are a formal way to deal with
    general tasks of sequence comparison. Although never explicitly labeled
    or displayed, alignments are virtually present in all analyses in historical
    linguistics dealing with the comparison of sound sequences (words,
    morphemes).
    t ɔ x t ə r
    d ɔː t ə r
    27 / 50

    View Slide

  62. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Alignment Analyses
    Alignment analyses display sequence similarities by representing
    multiple sequences as rows of a matrix in which common segments are
    placed in the same column. Alignments are a formal way to deal with
    general tasks of sequence comparison. Although never explicitly labeled
    or displayed, alignments are virtually present in all analyses in historical
    linguistics dealing with the comparison of sound sequences (words,
    morphemes).
    t ɔ x t ə r
    d ɔː - t ə r
    27 / 50

    View Slide

  63. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound Classes
    Sounds which frequently occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes. It is
    thereby assumed that “phonetic
    correspondences inside a ‘type’ are
    more regular than those between
    different ‘types’” (Dolgopolsky
    1986[1964]: 35).
    28 / 50

    View Slide

  64. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound Classes
    Sounds which frequently occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes. It is
    thereby assumed that “phonetic
    correspondences inside a ‘type’ are
    more regular than those between
    different ‘types’” (Dolgopolsky
    1986[1964]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    28 / 50

    View Slide

  65. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound Classes
    Sounds which frequently occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes. It is
    thereby assumed that “phonetic
    correspondences inside a ‘type’ are
    more regular than those between
    different ‘types’” (Dolgopolsky
    1986[1964]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    28 / 50

    View Slide

  66. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound Classes
    Sounds which frequently occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes. It is
    thereby assumed that “phonetic
    correspondences inside a ‘type’ are
    more regular than those between
    different ‘types’” (Dolgopolsky
    1986[1964]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    28 / 50

    View Slide

  67. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound Classes
    Sounds which frequently occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes. It is
    thereby assumed that “phonetic
    correspondences inside a ‘type’ are
    more regular than those between
    different ‘types’” (Dolgopolsky
    1986[1964]: 35).
    K
    T
    P
    S
    1
    28 / 50

    View Slide

  68. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound-Class-Based Phonetic Alignment (SCA, List 2012 a)
    Sound classes and alignment analyses can be combined. Sound
    sequences are internally represented as sound classes. Alignments are
    carried out using standard algorithms developed in evolutionary biology.
    29 / 50

    View Slide

  69. Quantitative Historical Linguistics Achievements
    Examples: Phonetic Alignment
    Sound-Class-Based Phonetic Alignment (SCA, List 2012 a)
    Sound classes and alignment analyses can be combined. Sound
    sequences are internally represented as sound classes. Alignments are
    carried out using standard algorithms developed in evolutionary biology.
    INPUT
    tɔxtər
    dɔːtər
    TOKENIZATION
    t, ɔ, x, t, ə, r
    d, ɔː, t, ə, r
    CONVERSION
    t ɔ x … → T O G …
    d ɔː t … → T O T …
    ALIGNMENT
    T O G T E R
    T O - T E R
    CONVERSION
    T O G … → t ɔ x …
    T O - … → d oː - …
    OUTPUT
    t ɔ x t ə r
    d ɔː - t ə r
    1
    29 / 50

    View Slide

  70. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    LexStat (List 2012, List 2014)
    LexStat is a method for automatic cognate detection in multilingual
    wordlists. It uses on sound-class-based sequence alignment (SCA)
    analyses as a proxy to infer language-specific sound similarities (similar
    to the notion of sound correspondences in historical linguistics). Using
    the automatically inferred sound similarities, LexStat partitions words
    into cognate sets.
    30 / 50

    View Slide

  71. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    Cognate Clustering
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    30 / 50

    View Slide

  72. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    30 / 50

    View Slide

  73. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    30 / 50

    View Slide

  74. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    30 / 50

    View Slide

  75. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    Cognate Clustering
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    Analysis
    ID Taxa Word Gloss GlossID IPA CogID
    ... ... ... ... ... ... ...
    21 German Frau woman 20 frau 1
    22 Dutch vrouw woman 20 vrɑu 1
    23 English woman woman 20 wʊmən 2
    24 Danish kvinde woman 20 kvenə 3
    25 Swedish kvinna woman 20 kviːna 3
    26 Norwegian kvine woman 20 kʋinə 3
    ... ... ... ... ... ... ...
    30 / 50

    View Slide

  76. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    INPUT
    TOKENIZATION
    PREPROCESSING
    LOG-ODDS
    D ISTANCE
    COGNATE
    OUTPUT
    CORRESPONDENCE
    DETECTION USING
    PHONETIC
    ALIGNMENT
    LOOP
    DISTRIBUTION
    LexStat Algorithm (List 2014)
    EXPECTED
    ATTESTED
    DISTRIBUTION
    CALCULATION
    CLUSTERING
    30 / 50

    View Slide

  77. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    31 / 50

    View Slide

  78. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    92%
    81%
    89%
    81%
    31 / 50

    View Slide

  79. Quantitative Historical Linguistics Achievements
    Examples: Automatic Cognate Detection
    B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    31 / 50

    View Slide

  80. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    Hugo Schuchardt
    (1842-1927)
    32 / 50

    View Slide

  81. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    Hugo Schuchardt
    (1842-1927)
    “We connect the branches and twigs
    of the tree with countless horizon-
    tal lines and it ceases to be a tree.”
    (Schuchardt 1870 [1900]: 11)
    32 / 50

    View Slide

  82. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    Hugo Schuchardt
    (1842-1927)
    32 / 50

    View Slide

  83. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    Hugo Schuchardt
    (1842-1927)
    32 / 50

    View Slide

  84. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    33 / 50

    View Slide

  85. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    -- Spanish
    --
    French
    --
    Italian
    Danish
    --
    English --
    German
    --
    33 / 50

    View Slide

  86. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    -- Spanish
    --
    French
    --
    Italian
    Danish
    --
    English --
    German
    --
    33 / 50

    View Slide

  87. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    -- Spanish
    --
    French
    --
    Italian
    Danish
    --
    English --
    German
    --
    33 / 50

    View Slide

  88. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    -- Spanish
    --
    French
    --
    Italian
    Danish
    --
    English --
    German
    --
    33 / 50

    View Slide

  89. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    List et al. (2014b)
    .
    .
    ---Lánzhōu
    .
    Fùzhōu --
    .
    Xiāngtàn --
    .
    M
    ěixiàn
    --
    .
    H
    ongkong
    --
    .
    ---Wǔhàn
    .
    ---Běijīng
    .
    ---Kùnmíng
    .
    Hángzhōu
    --
    .
    Xiàmén --
    .
    ---Chéngdū
    .
    Sùzhōu
    --
    .
    Shànghǎi --
    .
    Táiběi --
    .
    ---Zhèngzhōu
    .
    Shèxiàn --
    .
    ---Nánjīng
    .
    ---Guìyáng
    .
    W
    énzhōu
    --
    .
    N
    ánníng
    --
    .
    Tūnxī --
    .
    ---Tiānjìn
    .
    Shāntóu --
    .
    ---Xīníng
    .
    ---Q
    īngdǎo
    .
    ---Ürüm
    qi
    .
    ---Píngyáo
    .
    Nánchàng --
    .
    ---Tàiyuán
    .
    Chángshā --
    .
    Hǎikǒu --
    .
    ---Héfèi
    .
    Jiàn'ǒu --
    .
    ---Yīnchuàn
    .
    ---Hohhot
    .
    Táoyuán --
    .
    ---Xī'ān
    .
    G
    uǎngzhōu
    --
    .
    ---Harbin
    .
    ---Jìnán
    .
    0
    .
    0
    .
    0
    .
    Inferred Links
    33 / 50

    View Slide

  90. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    List et al. (2014b)
    .
    .
    ---Lánzhōu
    .
    Fùzhōu --
    .
    Xiāngtàn --
    .
    M
    ěixiàn
    --
    .
    H
    ongkong
    --
    .
    ---Wǔhàn
    .
    ---Běijīng
    .
    ---Kùnmíng
    .
    Hángzhōu
    --
    .
    Xiàmén --
    .
    ---Chéngdū
    .
    Sùzhōu
    --
    .
    Shànghǎi --
    .
    Táiběi --
    .
    ---Zhèngzhōu
    .
    Shèxiàn --
    .
    ---Nánjīng
    .
    ---Guìyáng
    .
    W
    énzhōu
    --
    .
    N
    ánníng
    --
    .
    Tūnxī --
    .
    ---Tiānjìn
    .
    Shāntóu --
    .
    ---Xīníng
    .
    ---Q
    īngdǎo
    .
    ---Ürüm
    qi
    .
    ---Píngyáo
    .
    Nánchàng --
    .
    ---Tàiyuán
    .
    Chángshā --
    .
    Hǎikǒu --
    .
    ---Héfèi
    .
    Jiàn'ǒu --
    .
    ---Yīnchuàn
    .
    ---Hohhot
    .
    Táoyuán --
    .
    ---Xī'ān
    .
    G
    uǎngzhōu
    --
    .
    ---Harbin
    .
    ---Jìnán
    .
    0
    .
    0
    .
    0
    .
    Inferred Links
    33 / 50

    View Slide

  91. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    List et al. (2014b)
    .
    .
    ---Lánzhōu
    .
    Fùzhōu --
    .
    Xiāngtàn --
    .
    M
    ěixiàn
    --
    .
    H
    ongkong
    --
    .
    ---Wǔhàn
    .
    ---Běijīng
    .
    ---Kùnmíng
    .
    Hángzhōu
    --
    .
    Xiàmén --
    .
    ---Chéngdū
    .
    Sùzhōu
    --
    .
    Shànghǎi --
    .
    Táiběi --
    .
    ---Zhèngzhōu
    .
    Shèxiàn --
    .
    ---Nánjīng
    .
    ---Guìyáng
    .
    W
    énzhōu
    --
    .
    N
    ánníng
    --
    .
    Tūnxī --
    .
    ---Tiānjìn
    .
    Shāntóu --
    .
    ---Xīníng
    .
    ---Q
    īngdǎo
    .
    ---Ürüm
    qi
    .
    ---Píngyáo
    .
    Nánchàng --
    .
    ---Tàiyuán
    .
    Chángshā --
    .
    Hǎikǒu --
    .
    ---Héfèi
    .
    Jiàn'ǒu --
    .
    ---Yīnchuàn
    .
    ---Hohhot
    .
    Táoyuán --
    .
    ---Xī'ān
    .
    G
    uǎngzhōu
    --
    .
    ---Harbin
    .
    ---Jìnán
    .
    1
    .
    10
    .
    20
    .
    Inferred Links
    33 / 50

    View Slide

  92. Quantitative Historical Linguistics Achievements
    Examples: Phylogenetic Networks
    List et al. (2014b)
    .
    .
    ---Lánzhōu
    .
    Fùzhōu --
    .
    Xiāngtàn --
    .
    M
    ěixiàn
    --
    .
    H
    ongkong
    --
    .
    ---Wǔhàn
    .
    ---Běijīng
    .
    ---Kùnmíng
    .
    Hángzhōu
    --
    .
    Xiàmén --
    .
    ---Chéngdū
    .
    Sùzhōu
    --
    .
    Shànghǎi --
    .
    Táiběi --
    .
    ---Zhèngzhōu
    .
    Shèxiàn --
    .
    ---Nánjīng
    .
    ---Guìyáng
    .
    W
    énzhōu
    --
    .
    N
    ánníng
    --
    .
    Tūnxī --
    .
    ---Tiānjìn
    .
    Shāntóu --
    .
    ---Xīníng
    .
    ---Q
    īngdǎo
    .
    ---Ürüm
    qi
    .
    ---Píngyáo
    .
    Nánchàng --
    .
    ---Tàiyuán
    .
    Chángshā --
    .
    Hǎikǒu --
    .
    ---Héfèi
    .
    Jiàn'ǒu --
    .
    ---Yīnchuàn
    .
    ---Hohhot
    .
    Táoyuán --
    .
    ---Xī'ān
    .
    G
    uǎngzhōu
    --
    .
    ---Harbin
    .
    ---Jìnán
    .
    1
    .
    4
    .
    8
    .
    Inferred Links
    33 / 50

    View Slide

  93. Quantitative Historical Linguistics Problems
    Problems
    34 / 50

    View Slide

  94. Quantitative Historical Linguistics Problems
    Transparency
    35 / 50

    View Slide

  95. Quantitative Historical Linguistics Problems
    Transparency
    Computational methods tend to be presented in a blackbox fashion.
    This does not mean that the methods are actually blackboxes upon
    which scholars do not have control, but rather that researchers avo-
    id the effort of increasing the transparency of their algorithms in such
    a way that they log all relevant decisions that lead to a certain re-
    sult. As a result, the majority of research published in quantitative
    historical linguistics is irreproducible and not falsifiable.
    35 / 50

    View Slide

  96. Quantitative Historical Linguistics Problems
    Applicability
    36 / 50

    View Slide

  97. Quantitative Historical Linguistics Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & List 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    36 / 50

    View Slide

  98. Quantitative Historical Linguistics Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    36 / 50

    View Slide

  99. Quantitative Historical Linguistics Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    36 / 50

    View Slide

  100. Quantitative Historical Linguistics Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    36 / 50

    View Slide

  101. Quantitative Historical Linguistics Problems
    Accuracy
    37 / 50

    View Slide

  102. Quantitative Historical Linguistics Problems
    Accuracy
    Data Problems (Geisler & List forthcoming)
    Comparing two independently produced lexicostatistical datasets:
    database # languages # concepts
    Dyen et al. 1997 95 200
    Tower of Babel 98 110
    intersection 46 103
    37 / 50

    View Slide

  103. Quantitative Historical Linguistics Problems
    Accuracy
    Data Problems (Geisler & List forthcoming)
    Comparing two independently produced lexicostatistical datasets:
    database # languages # concepts
    Dyen et al. 1997 95 200
    Tower of Babel 98 110
    intersection 46 103
    Results
    up to 10 % difference in concept translations
    many undetected borrowings in both datasets
    up to 30 % differences in tree topologies for Bayesian analyses
    37 / 50

    View Slide

  104. Quantitative Historical Linguistics Problems
    Accuracy
    37 / 50

    View Slide

  105. Quantitative Historical Linguistics Problems
    Comparability
    38 / 50

    View Slide

  106. Quantitative Historical Linguistics Problems
    Comparability
    The data which is used to test new quantitative methods varies
    greatly, and there seems to be little interest in standardization. As a
    result, it is difficult if not impossible to compare the accuracy of dif-
    ferent methods in quantitative historical linguistics, and it drastically
    hampers the improvement of existing applications.
    38 / 50

    View Slide

  107. Quantitative Historical Linguistics Problems
    Comparability
    Data of an analysis of Turkic languages (Hruschka et al. 2015)
    38 / 50

    View Slide

  108. Quantitative Historical Linguistics Problems
    Summary
    Aspect Current State Desired State
    transparency blackbox methods pre-
    vail
    ALL results of should
    be presented in human-
    readable format
    applicability lack of free software ap-
    plications with transpa-
    rent source code
    ALL software proposed
    should be free, and
    source code should be
    well commented
    adequacy adequacy of results is
    often questionable
    ALL methods need to be
    rigorously tested before
    publication
    comparability lack of standardization,
    inadequate biological
    formats prevail
    data representation and
    method testing needs to
    be standardized
    39 / 50

    View Slide

  109. Computer-Assisted
    Language Comparison
    40 / 50

    View Slide

  110. Computer-Assisted Language Comparison Bridging the Gap
    Bridging the Gap
    Computational approaches in historical linguistics largely dis-
    regard the actual needs of historical linguistics. Despite the
    frequent claims that the algorithms are intended to supple-
    ment traditional research, many of them are mere attempts to
    prove the power of modern machine learning approaches.
    On the other hand, traditional approaches often disregard the
    fascinating possibilities that the digital age offers, and fail to
    leave the “realm of intuition”.
    41 / 50

    View Slide

  111. Computer-Assisted Language Comparison Bridging the Gap
    Bridging the Gap
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    41 / 50

    View Slide

  112. Computer-Assisted Language Comparison Bridging the Gap
    Bridging the Gap
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    41 / 50

    View Slide

  113. Computer-Assisted Language Comparison Bridging the Gap
    Bridging the Gap
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Apart from “computational historical
    linguistics”, we need to establish
    a new discipline of “computer-
    assisted historical linguistics”
    (CALC).
    41 / 50

    View Slide

  114. Computer-Assisted Language Comparison Bridging the Gap
    Excursus: When do we need computers?
    Aspect Computers Needed? Example
    data Yes, if the 10 times rule
    of repeated action in data
    preparation applies!
    segmentation of IPA-
    encoded strings into
    meaningful sound units
    modeling Yes, since formalization
    serves as a proof of con-
    cept, guarantees compa-
    rability, and helps avoiding
    errors!
    family tree model in its cur-
    rent form
    inference Yes, since a workflow of
    computational analysis ac-
    companied by manual cor-
    rection is much faster!
    phonetic alignment and
    cognate detection
    analysis Yes, if the analysis invol-
    ves calculations or repea-
    ted actions!
    finding the tree that ex-
    plains a dataset best
    42 / 50

    View Slide

  115. Computer-Assisted Language Comparison Examples
    Examples: Data Preparation with help of the EDICTOR
    data preparation with help of the EDICTOR (List forthcoming,
    http://tsv.lingpy.org)
    data modeling (phonetic alignments, cognate judgments) with help
    of the EDICTOR
    inference of probable semantic shift patterns with help of CLICS
    (List et al. 2014, http://clics.lingpy.org)
    analysis of lexical change patterns with help of the MLN approach
    (List et al. 2014)
    43 / 50

    View Slide

  116. Computer-Assisted Language Comparison Challenges
    Challenges
    44 / 50

    View Slide

  117. Computer-Assisted Language Comparison Challenges
    Challenges
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    44 / 50

    View Slide

  118. Computer-Assisted Language Comparison Challenges
    Challenges
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    44 / 50

    View Slide

  119. Computer-Assisted Language Comparison Challenges
    Challenges
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    "MOON"
    "MOON"
    "SHINE" "LIGHT"
    44 / 50

    View Slide

  120. Computer-Assisted Language Comparison Challenges
    Challenges
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    44 / 50

    View Slide

  121. Computer-Assisted Language Comparison Challenges
    Challenges
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BO
    RRO
    W
    ING
    LO
    SS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    44 / 50

    View Slide

  122. Computer-Assisted Language Comparison Challenges
    Challenges
    SEMANTIC CHANGE
    MORPHOLOGICAL CHANGE
    S
    T
    R
    A
    T
    IC
    C
    H
    A
    N
    G
    E
    Three Dimensions of Lexical Change (Gévaudan 2007)
    44 / 50

    View Slide


  123. 1

    1

    1
    ?
    首首 首 首
    ... and its application to
    Chinese dialect History
    45 / 50

    View Slide

  124. Application to Chinese Dialect History DFG Research Project
    DFG Research Project on Chinese Dialect History
    My research project “Vertical and lateral aspects of Chinese dialect
    history”, funded by the German Research Foundation (DFG), pro-
    poses an interdisciplinary, data-driven approach to Chinese dialec-
    tology and has two major goals:
    1 Based on the cooperation with sinologists (CLRAO) and
    biologists (UMPC), quantitative methods originally designed
    to study lateral gene transfer in evolutionary biology shall be
    used to explore vertical (inheritance-related) and lateral
    (contact-related) aspects of Chinese dialect history.
    2 In order to bridge the gap between traditional and quantitative
    approaches in historical linguistics the research will be
    carried out within the new framework of “computer-assisted
    language comparison”.
    46 / 50

    View Slide

  125. Application to Chinese Dialect History Data Preparation
    Data Preparation
    Hànyǔ Fāngyán Cíhuì 汉语方言词汇 (Běijīng Dàxué 1964, 904
    concepts, translated into 17 dialect varieties) serves as the
    fundamental data basis, hopefully expanded by further sources
    provided by colleagues or taken from additional literature
    digital version of the Cíhuì has already been prepared
    about 80 percent of all Chinese glosses have already been
    translated to Chinese
    during the next two weeks, all data will be systematically extracted,
    cleaned (segmentation of phonetic entries, e.g. write "thai51jaŋ35"
    as "tʰ a j ⁵¹ j a ŋ ³⁵"), and uploaded to the EDICTOR tool for further
    analysis (as part of a larger database on Sino-Tibetan languages)
    additional sources will be subsequently added (web-based tools for
    quick identification of identical concepts in new sources have
    already been written)
    47 / 50

    View Slide

  126. Application to Chinese Dialect History Modeling
    Modeling
    all lexical entries will be searched for cognate parts
    cognate sets will be aligned by first carrying out automatic analyses
    and correcting them later manually
    each lexical entry will further be aligned with its Middle Chinese and
    Old Chinese ancestral forms
    => In this way, I hope to make sure that a maximal amount of
    knowledge about Chinese dialect history is represented in a human-
    and machine-readable way and can later be accessed and
    analysed by both humans and computers.
    48 / 50

    View Slide

  127. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    In Chinese dialectology, it is often not very hard to detect whether two
    or more words are etymologically related. What is hard, however, is to
    determine their exact relationship, that is, it is hard to resolve which
    dimensions of lexical change were involved during their history.
    In order to cope with the multiple dimensions of lexical change, we
    need new methods and models in historical linguistics, which explicitly
    deal with borrowing, partial cognacy, and semantic change. Following
    the lead of evolutionary biology, these methods could be combined
    under a unified framework of tree reconciliation (Page & Cotton 2002)
    in historical linguistics.
    49 / 50

    View Slide

  128. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    49 / 50

    View Slide

  129. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    49 / 50

    View Slide

  130. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    49 / 50

    View Slide

  131. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    49 / 50

    View Slide

  132. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    LOSS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BORROWING
    49 / 50

    View Slide

  133. Application to Chinese Dialect History Inference and Analysis
    Inference and Analysis
    LOSS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BORROWING
    Tree-conciliation is but
    one method that can be
    used to provide first heu-
    ristics to disentangle the
    complicated etymolo-
    gical relations between
    Chinese dialect words.
    Alternative methods
    will be discussed and
    tested with the biologi-
    cal cooperators during
    my stay.
    49 / 50

    View Slide

  134. In the end of the 19th century, many dialectologists
    began to oppose the Neogrammarian claim that
    sound change proceeds without exceptions. Under
    the slogan “chaque mot a son histoire”, they de-
    manded that linguists should take all available data,
    especially dialect data, into account when propo-
    sing their theories on language variation and chan-
    ge. This triggered some debates in historical lin-
    guistics, but they were soon forgotten and the me-
    thods were never changed.
    50 / 50

    View Slide

  135. Not long ago, evolutionary biologists realized that
    the evolution of genes may follow very idiosyncratic
    patterns, so complex indeed, that one could likewi-
    se claim that each gene has its own history. In order
    to account for this, they started to develop compu-
    tational tools that allow them to study what they now
    sometimes call “mosaic history”, that is, the indivi-
    dual histories of genes that do not follow a major
    evolutionary path.
    50 / 50

    View Slide

  136. In the digital age we are confronted with an abun-
    dance of language data that no one person can re-
    ad in a lifetime. It seems to be time that historical
    linguists start following the example of evolutiona-
    ry biology in using computers as a helpful tool to
    investigate our data and not giving easily up if our
    theories are threatened by reality. If it is really true,
    that every word has its own history, then let’s try to
    assemble all these histories in databases and look
    which new and interesting patterns we may find!
    50 / 50

    View Slide

  137. Merci pour votre attention!
    50 / 50

    View Slide