Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Phonological and Etymological Relations in Computer-Based and Computer-Assisted Frameworks

Handling Phonological and Etymological Relations in Computer-Based and Computer-Assisted Frameworks

alk presented at the workshop "Towards a Global Language Phylogeny", 23-26 February, Max Planck Institute for the Science of Human History, Waiheke Island.

Johann-Mattis List

February 23, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Handling Phonological and Etymological Relations in
    Computer-Based and Computer-Assisted
    Frameworks
    Theoretical Aspects
    Johann-Mattis List
    DFG Research Fellow
    Centre des Recherches Linguistiques sur l’Asie Orientale (EHESS)
    Team “Adaptation, Integration, Reticulation, Evolution” (UPMC)
    Paris
    2015-02-23
    1 / 30

    View Slide

  2. LANGUAGE MODELING
    2.0
    2 / 30

    View Slide

  3. LANGUAGE MODELING
    2.0
    2 / 30

    View Slide

  4. LANGUAGE MODELING
    2.0
    2 / 30

    View Slide

  5. LANGUAGE MODELING
    2.0
    2 / 30

    View Slide

  6. LANGUAGE MODELING
    2.0
    2 / 30

    View Slide

  7. LANGUAGE MODELING
    2.0
    State of the Art
    2 / 30

    View Slide

  8. State of the Art New Models
    New Models
    1
    0
    Gain-Loss Models
    3 / 30

    View Slide

  9. State of the Art New Models
    New Models
    1
    0 p
    p͡f
    f v
    h
    ø
    Gain-Loss Models Multistate Models
    3 / 30

    View Slide

  10. State of the Art New Models
    New Models
    1
    0 p
    p͡f
    f v
    h
    ø
    Gain-Loss Models Multistate Models
    ===>
    3 / 30

    View Slide

  11. State of the Art Examples
    Examples
    2014
    4 / 30

    View Slide

  12. State of the Art Examples
    Examples
    Historical linguistics as a sequence optimization problem: the
    evolution and biogeography of Uto-Aztecan languages
    Ward C. Wheelera,* and Peter M. Whiteleyb
    aDivision of Invertebrate Zoology, Am
    erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA;
    bDivision of Anthropology, Am
    erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA
    Accepted 18 March 2014
    Abstract
    Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees
    has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators.
    Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses.
    Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating
    words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the
    Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and
    should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution.
    ©The Willi Hennig Society 2014.
    Introduction
    How languages evolve has long been a central ques-
    tion for the human sciences. Linguistic elements may
    be transmitted horizontally (“borrowing”) among
    neighbouring languages, but most language transmis-
    sion obviously occurs via lineal descent with modif ca-
    tion. Linguistic and biological evolution are thus
    analogous in important respects; constructing trees of
    languages “genetically” related in families is well estab-
    lished (e.g. Greenhill et al., 2009). Recently, phyloge-
    netic models have enhanced both methodology and
    hypothesis-testing for language ancestry (e.g. Forster
    and Renfrew, 2006). Approaches now engage archaeol-
    ogy, anthropology, genetics, and computational sci-
    ence, as well as historical linguistics itself.
    Notwithstanding advances, disputes remain vigorous
    in both methods and results, including for well-studied
    language families such as Indo-European (see, for
    example, Forster and Renfrew, 2006; Campbell and
    Poser, 2008). Often, reconstructions are untestable—
    hence the vigour of disputation. The approach
    adopted here, by contrast, involves an inspectable set
    of procedures applied directly to empirical linguistic
    data. We use analytical methods derived from DNA
    sequence optimization algorithms, treating words as
    sequences of sounds. We demonstrate this with
    Uto-Aztecan (UA) languages of North and Middle
    America.
    The basic approach articulated here is to remove the
    inferential overburden of hypothesized “proto-forms”
    (discussed below), and perform analysis solely using
    the observed sound content of words. In this way, the
    sequences of sounds that constitute all human lan-
    guages form the empirical basis upon which language
    trees are built. To accomplish this, we have adapted
    techniques more usually applied to the analysis of
    DNA and protein sequence data, but are readily
    applied to sound sequences as well (as with other non-
    molecular sequence data; Schulmeister and Wheeler,
    2004; Robillard et al., 2006). In moving from proto-
    forms to sound sequences, a transition occurs analo-
    gous to the advances forged in organismic systematic
    *Corresponding author:
    E-m
    ail address: [email protected]
    Cladistics
    Cladistics (2014) 1–13
    10.1111/cla.12078
    ©The Willi Hennig Society 2014
    4 / 30

    View Slide

  13. State of the Art Examples
    Examples
    Historical linguistics as a sequence optimization problem: the
    evolution and biogeography of Uto-Aztecan languages
    Ward C. Wheelera,* and Peter M. Whiteleyb
    aDivision of Invertebrate Zoology, Am
    erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA;
    bDivision of Anthropology, Am
    erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA
    Accepted 18 March 2014
    Abstract
    Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees
    has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators.
    Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses.
    Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating
    words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the
    Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and
    should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution.
    ©The Willi Hennig Society 2014.
    Introduction
    How languages evolve has long been a central ques-
    tion for the human sciences. Linguistic elements may
    be transmitted horizontally (“borrowing”) among
    neighbouring languages, but most language transmis-
    sion obviously occurs via lineal descent with modif ca-
    tion. Linguistic and biological evolution are thus
    analogous in important respects; constructing trees of
    languages “genetically” related in families is well estab-
    lished (e.g. Greenhill et al., 2009). Recently, phyloge-
    netic models have enhanced both methodology and
    hypothesis-testing for language ancestry (e.g. Forster
    and Renfrew, 2006). Approaches now engage archaeol-
    ogy, anthropology, genetics, and computational sci-
    ence, as well as historical linguistics itself.
    Notwithstanding advances, disputes remain vigorous
    in both methods and results, including for well-studied
    language families such as Indo-European (see, for
    example, Forster and Renfrew, 2006; Campbell and
    Poser, 2008). Often, reconstructions are untestable—
    hence the vigour of disputation. The approach
    adopted here, by contrast, involves an inspectable set
    of procedures applied directly to empirical linguistic
    data. We use analytical methods derived from DNA
    sequence optimization algorithms, treating words as
    sequences of sounds. We demonstrate this with
    Uto-Aztecan (UA) languages of North and Middle
    America.
    The basic approach articulated here is to remove the
    inferential overburden of hypothesized “proto-forms”
    (discussed below), and perform analysis solely using
    the observed sound content of words. In this way, the
    sequences of sounds that constitute all human lan-
    guages form the empirical basis upon which language
    trees are built. To accomplish this, we have adapted
    techniques more usually applied to the analysis of
    DNA and protein sequence data, but are readily
    applied to sound sequences as well (as with other non-
    molecular sequence data; Schulmeister and Wheeler,
    2004; Robillard et al., 2006). In moving from proto-
    forms to sound sequences, a transition occurs analo-
    gous to the advances forged in organismic systematic
    *Corresponding author:
    E-m
    ail address: [email protected]
    Cladistics
    Cladistics (2014) 1–13
    10.1111/cla.12078
    ©The Willi Hennig Society 2014
    Data
    32 Uto-Aztekan languages
    Swadesh 100 concept lists
    manually extracted cognate sets
    IPA-encoding, 148 unique symbols
    Method
    simultaneous sequence optimization and
    phylogenetic inference (ML framework)
    varying scoring functions for the matching of
    sound symbols
    Output
    phylogenies, (?) transition frequencies (?)
    Software
    POY 5.0 (Wheeler 2013): Phylogenetic Analysis
    of DNA and other data using dynamic homology
    4 / 30

    View Slide

  14. State of the Art Examples
    Examples
    2015
    5 / 30

    View Slide

  15. State of the Art Examples
    Examples
    2015
    Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064
    Article
    Detecting Regular Sound Changes
    in Linguistics
    as Events of Concerted Evolution
    Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4
    J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,*
    and Tanmoy Bhattacharya3,6,*
    1School of Human Evolution and Social Change, Arizona State
    University, PO Box 872402, Tempe, AZ 85287-2402, USA
    2SchoolofBiologicalSciences, UniversityofReading, Reading
    RG6 6BX, UK
    3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM
    87501, USA
    4Krasnow Institute for Advanced Study, George Mason
    University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA
    22030, USA
    5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA
    6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545,
    USA
    Summary
    Background: Concerted evolution is normally used to
    describe parallel changes at different sites in a genome, but
    it is also observed in languages where a specif c phoneme
    changes to the same other phoneme in many words in the
    lexicon—a phenomenon known as regular sound change.
    We develop a general statistical model that can detect
    concerted changes in aligned sequence data and apply it to
    study regular sound changes in the Turkic language family.
    Results: Linguistic evolution, unlike the genetic substitutional
    process, is dominated by events of concerted evolutionary
    change. Our model identif ed more than 70 historical events
    ofregularsoundchangethatoccurredthroughouttheevolution
    of the Turkic language family, while simultaneously inferring a
    dated phylogenetic tree. Including regular sound changes
    yielded an approximately4-fold improvement in thecharacter-
    ization of linguistic change over a simpler model of sporadic
    change, improved phylogenetic inference, and returned more
    reliable and plausible dates for events on the phylogenies.
    The historical timings of the concerted changes closely follow
    a Poisson process model, and the sound transition networks
    derived fromour model mirror linguistic expectations.
    Conclusions: We demonstrate that a model with no prior
    knowledge of complex concerted or regular changes can
    nevertheless infer the historical timings and genealogical
    placements of events of concerted change from the signals
    left in contemporary data. Our model can be applied wherever
    discrete elements—such as genes, words, cultural trends,
    technologies, or morphological traits—can change in parallel
    within an organism or other evolving group.
    Introduction
    Concerted evolutionary change is widespread in genetic
    systems, being implicated in the genome-wide control of
    repetitive elements [1–3], the evolution of gene families [2],
    and homogenization of Y chromosome sequences [4, 5] and
    as a means by which asexual organisms might escape the
    debilitating consequences of Muller’s ratchet [3]. It might arise
    from several mechanisms, including homologous recombi-
    nation, that allow certain favorable elements to spread or
    damaging elements to be neutralized.
    Linguists have long recognized concerted change that
    affects copies of the same sound (or phoneme) appearing in
    different words as a central feature of linguistic evolution [6].
    A well-known example is the *p>
    f sound change in the
    Germanic languages wherein an older Indo-European p sound
    was replaced by an f sound, such as in *pater>
    father, or *pes,
    *pedis>
    foot (linguistic convention is to use the ‘‘>
    ’’ symbol to
    indicate a transition from one sound to another, and here
    the * symbol denotes a reconstructed ancestral form). These
    multipleinstances ofonephonemechanging to thesameother
    phoneme yield regular sound correspondences between
    pairs or groups of languages. Linguists have proposed several
    explanations for the regularity of changes grounded in a
    number of basic processes, including speech production,
    perception, and cognition [7–9].
    Can events of concerted change be detected statistically in
    sequence data, and do they improve the characterization of
    evolutionand theinferenceof evolutionaryhistories? Although
    previous researchers working in a linguistic setting have used
    the concept of regular changes to build algorithms for auto-
    matically inferring cognacy, to our knowledge the model we
    report here is the f rst probabilistic description of concerted
    change. This places concerted evolution in a statistical setting
    that allows for formal hypothesis testing about the nature and
    rates of concerted changes. For example, the question of how
    many parallel changes are required to be recognized as an
    instance of concerted change is naturally dealt with in our
    model: the statistical signature of concerted or regular change
    is that the multiple parallel events are more probable if treated
    as a single coordinated change than as a collection of inde-
    pendent changes (Box 1).
    Usefully, the genetic and linguistic phenomena share funda-
    mental properties relevant to their statistical characterization.
    Phonemes are the units of sound that make up words and
    distinguish one word from another, just as the four nucleotide
    bases (A, C, T, G) make up DNA gene sequences or the
    20 amino acids make up protein sequences. The number of
    distinct sounds in a language varies greatly, but somewhere
    around 30–60 phonemes are commonly suff cient to describe
    the range of distinctive sounds in a language’s words [10].
    Collections of words can therefore be thought of as providing
    phonemic ‘‘sequence information’’ that might be informative
    as to the history, rate, and patterns of concerted evolutionary
    change in language, and in a manner analogous to sequences
    of DNA.
    5 / 30

    View Slide

  16. State of the Art Examples
    Examples
    2015
    Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064
    Article
    Detecting Regular Sound Changes
    in Linguistics
    as Events of Concerted Evolution
    Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4
    J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,*
    and Tanmoy Bhattacharya3,6,*
    1School of Human Evolution and Social Change, Arizona State
    University, PO Box 872402, Tempe, AZ 85287-2402, USA
    2SchoolofBiologicalSciences, UniversityofReading, Reading
    RG6 6BX, UK
    3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM
    87501, USA
    4Krasnow Institute for Advanced Study, George Mason
    University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA
    22030, USA
    5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA
    6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545,
    USA
    Summary
    Background: Concerted evolution is normally used to
    describe parallel changes at different sites in a genome, but
    it is also observed in languages where a specif c phoneme
    changes to the same other phoneme in many words in the
    lexicon—a phenomenon known as regular sound change.
    We develop a general statistical model that can detect
    concerted changes in aligned sequence data and apply it to
    study regular sound changes in the Turkic language family.
    Results: Linguistic evolution, unlike the genetic substitutional
    process, is dominated by events of concerted evolutionary
    change. Our model identif ed more than 70 historical events
    ofregularsoundchangethatoccurredthroughouttheevolution
    of the Turkic language family, while simultaneously inferring a
    dated phylogenetic tree. Including regular sound changes
    yielded an approximately4-fold improvement in thecharacter-
    ization of linguistic change over a simpler model of sporadic
    change, improved phylogenetic inference, and returned more
    reliable and plausible dates for events on the phylogenies.
    The historical timings of the concerted changes closely follow
    a Poisson process model, and the sound transition networks
    derived fromour model mirror linguistic expectations.
    Conclusions: We demonstrate that a model with no prior
    knowledge of complex concerted or regular changes can
    nevertheless infer the historical timings and genealogical
    placements of events of concerted change from the signals
    left in contemporary data. Our model can be applied wherever
    discrete elements—such as genes, words, cultural trends,
    technologies, or morphological traits—can change in parallel
    within an organism or other evolving group.
    Introduction
    Concerted evolutionary change is widespread in genetic
    systems, being implicated in the genome-wide control of
    repetitive elements [1–3], the evolution of gene families [2],
    and homogenization of Y chromosome sequences [4, 5] and
    as a means by which asexual organisms might escape the
    debilitating consequences of Muller’s ratchet [3]. It might arise
    from several mechanisms, including homologous recombi-
    nation, that allow certain favorable elements to spread or
    damaging elements to be neutralized.
    Linguists have long recognized concerted change that
    affects copies of the same sound (or phoneme) appearing in
    different words as a central feature of linguistic evolution [6].
    A well-known example is the *p>
    f sound change in the
    Germanic languages wherein an older Indo-European p sound
    was replaced by an f sound, such as in *pater>
    father, or *pes,
    *pedis>
    foot (linguistic convention is to use the ‘‘>
    ’’ symbol to
    indicate a transition from one sound to another, and here
    the * symbol denotes a reconstructed ancestral form). These
    multipleinstances ofonephonemechanging to thesameother
    phoneme yield regular sound correspondences between
    pairs or groups of languages. Linguists have proposed several
    explanations for the regularity of changes grounded in a
    number of basic processes, including speech production,
    perception, and cognition [7–9].
    Can events of concerted change be detected statistically in
    sequence data, and do they improve the characterization of
    evolutionand theinferenceof evolutionaryhistories? Although
    previous researchers working in a linguistic setting have used
    the concept of regular changes to build algorithms for auto-
    matically inferring cognacy, to our knowledge the model we
    report here is the f rst probabilistic description of concerted
    change. This places concerted evolution in a statistical setting
    that allows for formal hypothesis testing about the nature and
    rates of concerted changes. For example, the question of how
    many parallel changes are required to be recognized as an
    instance of concerted change is naturally dealt with in our
    model: the statistical signature of concerted or regular change
    is that the multiple parallel events are more probable if treated
    as a single coordinated change than as a collection of inde-
    pendent changes (Box 1).
    Usefully, the genetic and linguistic phenomena share funda-
    mental properties relevant to their statistical characterization.
    Phonemes are the units of sound that make up words and
    distinguish one word from another, just as the four nucleotide
    bases (A, C, T, G) make up DNA gene sequences or the
    20 amino acids make up protein sequences. The number of
    distinct sounds in a language varies greatly, but somewhere
    around 30–60 phonemes are commonly suff cient to describe
    the range of distinctive sounds in a language’s words [10].
    Collections of words can therefore be thought of as providing
    phonemic ‘‘sequence information’’ that might be informative
    as to the history, rate, and patterns of concerted evolutionary
    change in language, and in a manner analogous to sequences
    of DNA.
    Data
    26 Turkic languages
    225 cognate sets from etymological dictionaries
    (Tower of Babel)
    manually compiled alignment analyses
    ASCII-encoding, 62 unique symbols
    Method
    Bayesian Markov chain Monte Carlo
    statistical model allowing for sporadic (irregular)
    and concerted (regular) changes along a
    phylogenetic tree that produces the alignments
    Output
    phylogenies, change rates
    Software
    Bayes Phylogenies
    5 / 30

    View Slide

  17. State of the Art Problems
    Problems
    !
    6 / 30

    View Slide

  18. State of the Art Problems
    Data Representation
    File ”UT-root.fas”, lines 1-20 (Wheeler et al. 2014)
    7 / 30

    View Slide

  19. State of the Art Problems
    Data Representation
    File ”TurkicFull.txt”, lines 1-20 (Hruschka et al. 2015)
    7 / 30

    View Slide

  20. State of the Art Problems
    Model Restrictions
    8 / 30

    View Slide

  21. State of the Art Problems
    Model Restrictions
    approach Wheeler et al. 2014 Hruschka et al. 2015
    data size taxa 32 26
    cognate sets 100 225
    data structure phonetic strings ✓ ✓
    segmentation ✓ ✓
    cognate sets ✓ ✓
    alignments inferred prescribed
    models sound change transition graph transition graph
    transition rates prescribed inferred
    unobserved states ✗ ✗
    context ✗ ✗
    8 / 30

    View Slide

  22. State of the Art Problems
    Model Restrictions
    ˈs o h₂ w l̩
    s h₂ u ˈe n -
    Indo-European
    "sun"
    s oː l -
    s oː l i k u l u s
    "sun"
    "small sun"
    s oː l ɛ j
    z ɔ n ə
    s u nː ɔ̃ː Germanic
    German
    Latin
    French
    "sun"
    "sun"
    "sun"
    9 / 30

    View Slide

  23. h j - ä r t a -
    h - e - r z - -
    h - e a r t - -
    c - - o r d i s
    hjärta
    herz
    heart
    cordis
    Modeling Phonological
    R
    elations
    10 / 30

    View Slide

  24. Modeling Phonological Relations Alignments
    Alignments
    11 / 30

    View Slide

  25. Modeling Phonological Relations Alignments
    Alignments
    11 / 30

    View Slide

  26. Modeling Phonological Relations Alignments
    Alignments
    11 / 30

    View Slide

  27. Modeling Phonological Relations Alignments
    Alignments
    11 / 30

    View Slide

  28. Modeling Phonological Relations Alignments
    Alignments
    11 / 30

    View Slide

  29. Modeling Phonological Relations Problems
    Problems
    !
    12 / 30

    View Slide

  30. Modeling Phonological Relations Problems
    Problems
    unalignable cognates
    13 / 30

    View Slide

  31. Modeling Phonological Relations Problems
    Problems
    unalignable cognates
    t u ŋ a b iː t ə -
    context
    13 / 30

    View Slide

  32. Modeling Phonological Relations Problems
    Problems
    unalignable cognates
    t u ŋ a b iː t ə -
    context
    site dependencies
    13 / 30

    View Slide

  33. Modeling Phonological Relations Solutions
    Solutions
    ?
    14 / 30

    View Slide

  34. Modeling Phonological Relations Solutions
    LingPy and EDICTOR
    LingPy (List et al. 2014, http://lingpy.org)
    EDICTOR (List in prep., http://tsv.lingpy.org)
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    15 / 30

    View Slide

  35. Modeling Phonological Relations Solutions
    Unalignable Parts
    16 / 30

    View Slide

  36. Modeling Phonological Relations Solutions
    Unalignable Parts
    16 / 30

    View Slide

  37. Modeling Phonological Relations Solutions
    Unalignable Parts
    16 / 30

    View Slide

  38. Modeling Phonological Relations Solutions
    Unalignable Parts
    implemented (LingPy and EDICTOR)
    16 / 30

    View Slide

  39. Modeling Phonological Relations Solutions
    Phonetic Context
    17 / 30

    View Slide

  40. Modeling Phonological Relations Solutions
    Phonetic Context
    17 / 30

    View Slide

  41. Modeling Phonological Relations Solutions
    Phonetic Context
    implemented (LingPy and EDICTOR)
    17 / 30

    View Slide

  42. Modeling Phonological Relations Solutions
    Site Dependence
    18 / 30

    View Slide

  43. Modeling Phonological Relations Solutions
    Site Dependence
    18 / 30

    View Slide

  44. Modeling Phonological Relations Solutions
    Site Dependence
    implementation pending
    18 / 30

    View Slide

  45. Modeling Phonological Relations Solutions
    Summary
    Problem Subproblem LingPy EDICTOR
    unalignable cognates
    linear relations ✓ ✓
    non-linear relations ✗ ✗
    context
    prosodic context ✓ ✓
    user-defined context (✓) ✓
    site dependencies
    neighboring columns ✗ (✓)
    non-neighboring columns ✗ ✗
    19 / 30

    View Slide

  46. Modeling Phonological Relations Solutions
    Summary
    PhonoBank?
    Yes! If we thoroughly collect
    aligned cognate sets, along with
    proto-forms, and
    marked context (datasets of Paul and Thiago)
    we will be able to automatically extract the sound changes and cre-
    ate our PhonoBank!
    19 / 30

    View Slide

  47. PIE *bhreu◌̯
    Hg◌
    ̑
    -
    “to use”
    PIE *bhruHg◌
    ̑
    -ié-
    “to use” (present tense)
    PGM *ƀrūkan-
    “to use”
    OHG brūhhan
    “to use”
    G brauchen
    “to use”
    G Brauch
    “custom”
    OHG fruht
    “profit, fruit”
    G frugal
    “nourishing”
    Fr fruit
    “profit,fruit”
    Fr frugal
    “modest (food)”
    Lt fruor, fruī
    “I enjoy”
    Lt frūctus
    “profit”
    Lt frux
    “fruit, grain”
    Lt frugalis
    “bring profit”
    inherited from
    borrowed from
    derived from
    Modeling Etymological
    R
    elations
    20 / 30

    View Slide

  48. Modeling Etymological Relations Dimensions of Lexical Change
    Dimensions of Lexical Change
    21 / 30

    View Slide

  49. Modeling Etymological Relations Dimensions of Lexical Change
    Dimensions of Lexical Change
    SEMANTIC CHANGE
    MORPHOLOGICAL CHANGE
    S
    T
    R
    A
    T
    IC
    C
    H
    A
    N
    G
    E
    Dimensions of Lexical Change
    (Gévaudan 2007)
    21 / 30

    View Slide

  50. Modeling Etymological Relations Dimensions of Lexical Change
    Dimensions of Lexical Change
    *kuppa- Kopf
    Kopf köpfen
    world cup Weltcup
    semantic change
    morphological
    change
    stratic change
    22 / 30

    View Slide

  51. Modeling Etymological Relations Inference
    Inference
    word Wort слово
    cuvînt palabra
    mot adottszó slovo verbum
    focal 词 parola λόγος
    शब◌्
    द ord
    λόγος Wort слово
    cuvînt palabra
    mot adottszó slovo verbum
    focal 词 parola
    शब◌्
    द ord
    word
    ord
    ord
    word
    23 / 30

    View Slide

  52. Modeling Etymological Relations Inference
    Semantic Shift and Colexification Networks
    Key Concept Russian German ...
    1.1 world mir, svet Welt ...
    1.21 earth, land zemlja Erde, Land ...
    1.212 ground, soil počva Erde, Boden ...
    1.420 tree derevo Baum ...
    1.430 wood derevo Wald ...
    ... ... ... ... ...
    24 / 30

    View Slide

  53. Modeling Etymological Relations Inference
    Semantic Shift and Colexification Networks
    post, pole
    staff, walking stick
    doorpost, jamb
    tree stump
    mast
    club
    firewood
    root
    tree trunk
    woods, forest
    banana tree
    tree
    wood
    CLICS: Database of Cross-Linguistic Colexifications
    (List et al. 2014, http://clics.lingpy.org)
    24 / 30

    View Slide

  54. Modeling Etymological Relations Inference
    Semantic Shift and Concept Comparison
    GLOSS OCCS
    Blust-
    2008-210
    Dolgopolsky-
    1964-15
    Dunn-2012-207
    4118 WATER 3 water water water
    5440
    TONGUE 3 tongue tongue tongue
    5450
    I 3 I
    first person
    marker
    I
    5456
    YOU 3 you
    second person
    marker
    you
    5490 WHO 3 who? who/what who
    CONCEPTICON: A resource for the linking of concept lists
    (List and Cysouw, forthcoming, http://concepticon.github.io)
    Concept ID
    arbitrarité
    25 / 30

    View Slide

  55. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    ---Lánzhōu
    .
    Fùzhōu --
    .
    Xiāngtàn --
    .
    M
    ěixiàn
    --
    .
    H
    ongkong
    --
    .
    ---Wǔhàn
    .
    ---Běijīng
    .
    ---Kùnmíng
    .
    Hángzhōu
    --
    .
    Xiàmén --
    .
    ---Chéngdū
    .
    Sùzhōu
    --
    .
    Shànghǎi --
    .
    Táiběi --
    .
    ---Zhèngzhōu
    .
    Shèxiàn --
    .
    ---Nánjīng
    .
    ---Guìyáng
    .
    W
    énzhōu
    --
    .
    N
    ánníng
    --
    .
    Tūnxī --
    .
    ---Tiānjìn
    .
    Shāntóu --
    .
    ---Xīníng
    .
    ---Q
    īngdǎo
    .
    ---Ürüm
    qi
    .
    ---Píngyáo
    .
    Nánchàng --
    .
    ---Tàiyuán
    .
    Chángshā --
    .
    Hǎikǒu --
    .
    ---Héfèi
    .
    Jiàn'ǒu --
    .
    ---Yīnchuàn
    .
    ---Hohhot
    .
    Táoyuán --
    .
    ---Xī'ān
    .
    G
    uǎngzhōu
    --
    .
    ---Harbin
    .
    ---Jìnán
    .
    0
    .
    0
    .
    0
    .
    Inferred Links
    Minimal Lateral Networks of Chinese dialects (List et al. 2014)
    26 / 30

    View Slide

  56. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    Tūnxī -
    .
    Fùzhōu -
    .
    --Héfèi
    .
    Hǎikǒu -
    .
    --Wǔhàn
    .
    --Kùnmíng
    .
    --Zhèngzhōu
    .
    --Xī'ān
    .
    Nánchàng -
    .
    G
    uǎngzhōu
    -
    .
    Sùzhōu
    -
    .
    --Ürüm
    qi
    .
    --Q
    īngdǎo
    .
    M
    ěixiàn
    -
    .
    N
    ánníng
    -
    .
    Chángshā -
    .
    --Lánzhōu
    .
    --Tàiyuán
    .
    --Tiānjìn
    .
    --Harbin
    .
    Táoyuán -
    .
    W
    énzhōu
    -
    .
    --Xīníng
    .
    H
    ongkong
    -
    .
    Xiàmén -
    .
    Hángzhōu
    -
    .
    --Yīnchuàn
    .
    --Chéngdū
    .
    --Nánjīng
    .
    Táiběi -
    .
    --Guìyáng
    .
    --Hohhot
    .
    --Běijīng
    .
    Shànghǎi -
    .
    Xiāngtàn -
    .
    --Jìnán
    .
    Shāntóu -
    .
    Shèxiàn -
    .
    Jiàn'ǒu -
    .
    --Píngyáo
    .
    1
    .
    4
    .
    8
    .
    Inferred Links
    Minimal Lateral Networks of Chinese dialects (List et al. 2014)
    26 / 30

    View Slide

  57. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    Guānhuà
    .
    Xiàng
    .
    Mǐn
    .
    Yuè
    .

    .
    Jìn
    .
    Kèjiā
    .
    Gàn
    .
    Huī
    .
    1
    .
    2
    .
    3
    .
    4
    .
    5
    .
    6
    .
    7
    .
    8
    .
    9
    .
    10
    .
    11
    .
    12
    .
    13
    .
    14
    .
    15
    .
    16
    .
    17
    .
    18
    .
    19
    .
    20
    .
    21
    .
    22
    .
    23
    .
    24
    .
    25
    .
    26
    .
    27
    .
    28
    .
    29
    .
    30
    .
    31
    .
    32
    .
    33
    .
    34
    .
    35
    .
    36
    .
    37
    .
    38
    .
    39
    .
    40
    .
    1
    .
    Běijīng 北京
    .
    2
    .
    Chángshā 长沙
    .
    3
    .
    Chéngdū 成都
    .
    4
    .
    Fùzhōu 福州
    .
    5
    .
    Guǎngzhōu 广州
    .
    6
    .
    Guìyáng 贵阳
    .
    7
    .
    Harbin 哈尔滨
    .
    8
    .
    Hǎikǒu 海口
    .
    9
    .
    Hángzhōu 杭州
    .
    10
    .
    Héfèi 合肥
    .
    11
    .
    Hohhot 呼和浩特
    .
    12
    .
    Jiàn'ōu 建瓯
    .
    13
    .
    Jìnán 济南
    .
    14
    .
    Kùnmíng 昆明
    .
    15
    .
    Lánzhōu 兰州
    .
    16
    .
    Měixiàn 梅县
    .
    17
    .
    Nánchàng 南昌
    .
    18
    .
    Nánjīng 南京
    .
    19
    .
    Nánníng 南宁
    .
    20
    .
    Píngyáo 平遥
    .
    21
    .
    Qīngdǎo 青岛
    .
    22
    .
    Shànghǎi 上海
    .
    23
    .
    Shāntóu 汕头
    .
    24
    .
    Shèxiàn 歙县
    .
    25
    .
    Sùzhōu 苏州
    .
    26
    .
    Táiběi 台北
    .
    27
    .
    Tàiyuán 太原
    .
    28
    .
    Táoyuán 桃园
    .
    29
    .
    Tiānjìn 天津
    .
    30
    .
    Tūnxī 屯溪
    .
    31
    .
    Wénzhōu 温州
    .
    32
    .
    Wǔhàn 武汉
    .
    33
    .
    Ürümqi 乌鲁木齐
    .
    34
    .
    Xiàmén 厦门
    .
    35
    .
    Hongkong 香港
    .
    36
    .
    Xiāngtàn 湘潭
    .
    37
    .
    Xīníng 西宁
    .
    38
    .
    Xī'ān 西安
    .
    39
    .
    Yīnchuàn 银川
    .
    40
    .
    Zhèngzhōu 郑州
    .
    1
    .
    7
    .
    15
    .
    Inferred Links
    Minimal Lateral Networks of Chinese dialects (List et al. 2014)
    26 / 30

    View Slide

  58. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    -----Jìnán
    .
    -----Harbin
    .
    -----Héfèi
    .
    Chángshā ----
    .
    Sùzhōu
    ----
    .
    -----Yīnchuàn
    .
    -----Běijīng
    .
    Hángzhōu
    ----
    .
    -----Chéngdū
    .
    -----Hohhot
    .
    -----Lánzhōu
    .
    Xiāngtàn ----
    .
    -----Ürüm
    qi
    .
    M
    ěixiàn
    ----
    .
    -----Xī'ān
    .
    G
    uǎngzhōu
    ----
    .
    -----Nánjīng
    .
    Táoyuán ----
    .
    -----Zhèngzhōu
    .
    -----Kùnmíng
    .
    Táiběi ----
    .
    Shànghǎi ----
    .
    Xiàmén ----
    .
    Jiàn'ǒu ----
    .
    Shèxiàn ----
    .
    -----Q
    īngdǎo
    .
    -----Xīníng
    .
    Fùzhōu ----
    .
    -----Tàiyuán
    .
    -----Píngyáo
    .
    Nánchàng ----
    .
    H
    ongkong
    ----
    .
    N
    ánníng
    ----
    .
    W
    énzhōu
    ----
    .
    -----Guìyáng
    .
    Shāntóu ----
    .
    -----Tiānjìn
    .
    Tūnxī ----
    .
    Hǎikǒu ----
    .
    -----Wǔhàn
    .
    太阳
    .
    日头
    .
    热头
    .
    阳婆
    .

    .
    Loss Event
    .
    Gain Event
    Inferred evolution of „sun” (List et al. submitted)
    26 / 30

    View Slide

  59. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    Shànghǎi ----
    .
    Hongkong ----
    .
    Táiběi ----
    .
    Nánjīng ----
    .
    Táoyuán ----
    .
    Běijīng ----
    .
    Měixiàn ----
    .
    Xiàmén ----
    .
    Fùzhōu ----
    .
    Guǎngzhōu ----
    .
    太阳
    .
    日头
    .
    Loss Event
    .
    Gain Event
    Inferred evolution of „sun” (List et al. submitted)
    26 / 30

    View Slide

  60. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    Shànghǎi ----
    .
    Hongkong ----
    .
    Táiběi ----
    .
    Nánjīng ----
    .
    Táoyuán ----
    .
    Běijīng ----
    .
    Měixiàn ----
    .
    Xiàmén ----
    .
    Fùzhōu ----
    .
    Guǎngzhōu ----
    .
    太阳
    .
    日头
    .
    Loss Event
    .
    Gain Event
    Inferred evolution of „sun” (List et al. submitted)
    26 / 30

    View Slide

  61. Modeling Etymological Relations Inference
    Language Contact and Minimal Lateral Networks
    .
    .
    Shànghǎi ----
    .
    Hongkong ----
    .
    Táiběi ----
    .
    Nánjīng ----
    .
    Táoyuán ----
    .
    Běijīng ----
    .
    Měixiàn ----
    .
    Xiàmén ----
    .
    Fùzhōu ----
    .
    Guǎngzhōu ----
    .
    太阳
    .
    日头
    .
    Loss Event
    .
    Gain Event
    Inferred evolution of „sun” (List et al. submitted)
    26 / 30

    View Slide

  62. Modeling Etymological Relations Models
    Models
    LOSS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BORROWING
    27 / 30

    View Slide

  63. Modeling Etymological Relations Models
    New Models for Lexical Change
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    28 / 30

    View Slide

  64. Modeling Etymological Relations Models
    New Models for Lexical Change
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    28 / 30

    View Slide

  65. Modeling Etymological Relations Models
    New Models for Lexical Change
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    "MOON"
    "MOON"
    "SHINE" "LIGHT"
    28 / 30

    View Slide

  66. Modeling Etymological Relations Models
    New Models for Lexical Change
    ŋ u o ʔ ⁵ - - - - - - - - - -
    ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    j - y t ² l - œ ŋ ²² - - - - -
    - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    "MOON" "SHINE" "LIGHT"
    28 / 30

    View Slide

  67. Modeling Etymological Relations Models
    New Models for Lexical Change
    28 / 30

    View Slide

  68. Modeling Etymological Relations Models
    New Models for Lexical Change
    28 / 30

    View Slide

  69. Modeling Etymological Relations Models
    New Models for Lexical Change
    28 / 30

    View Slide

  70. Modeling Etymological Relations Models
    New Models for Lexical Change
    transition priors for etymologically related words can be
    provided manually, automatically, or semi-automatically
    transitions priors can be simultaneously defined for semantic
    change and morphological change (including complex
    paradigms)
    all we need in order to use the rich information inherent in
    complex etymological relations are software “beasts” that
    provide multistate models accepting n states along with
    user-specified individual transition priors for each set of
    etymologically related words
    28 / 30

    View Slide

  71. Modeling Etymological Relations Models
    New Data for Lexical Change
    Database of Lexical Change Patterns in Sino-Tibetan Languages
    project of CRLAO (Paris, L. Sagart and G. Jacques) in collaboration
    with SOAS (London, N. Hill)
    50+ doculects
    250+ concepts
    alignment analyses for alignable cognates
    detailed annotation of morphological relations (“linguistic priors”)
    cross-semantic search for cognate words
    to be launched before autumn 2015
    29 / 30

    View Slide

  72. Modeling Etymological Relations Models
    New Data for Lexical Change
    Database of Lexical Change Patterns in Sino-Tibetan Languages
    project of CRLAO (Paris, L. Sagart and G. Jacques) in collaboration
    with SOAS (London, N. Hill)
    50+ doculects
    250+ concepts
    alignment analyses for alignable cognates
    detailed annotation of morphological relations (“linguistic priors”)
    cross-semantic search for cognate words
    to be launched before autumn 2015
    A Comparative Database of Tukanoan and Northwestern South
    American Languages
    Thiago will soon present you with the details!
    29 / 30

    View Slide

  73. THANK YOU FOR
    LISTENING!
    30 / 30

    View Slide