Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context

Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context

Talk, held at the workshop "Historical Phonology and Phonological Theory" (2015-09-04, Leiden, Leiden University).

Johann-Mattis List

September 04, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Towards a Cross-Linguistic Database for Historical
    Phonology?
    A Proposal for a Machine-Readable Modeling of Phonetic Context
    Johann-Mattis List¹, Thiago Chacon²
    ¹Centre des recherches linguistiques sur l’Asie Orientale, Paris
    ²University of Brasilia, Brasilia
    2015-09-04
    1 / 20

    View Slide

  2. Prologue
    2 / 20

    View Slide

  3. CHALLENGES, DREAMS,
    AND REALITY
    3 / 20

    View Slide

  4. WHAT WE
    PROMISED IN
    THE ABSTRACT
    CHALLENGES, DREAMS,
    AND REALITY
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    3 / 20

    View Slide

  5. WHERE WE ARE
    RIGHT NOW...
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    CHALLENGES, DREAMS,
    AND REALITY
    3 / 20

    View Slide

  6. Challenges
    Challenges
    4 / 20

    View Slide

  7. Challenges Preliminary Thoughts
    Preliminary Thoughts
    Structural Challenges:
    - transparency: real data, no abstract patterns
    - comparability: unified human- and machine-readable format
    - flexibility: no restriction to only one theory or feature system
    Concrete Challenges
    - context: How to handle context of sound change patterns in a trans-
    parent, flexible, and comparable way?
    - language-specific aspects: How to allow for a modeling of family- or
    language-specific sound change conditions without sacrificing compara-
    bility?
    - quantifiability: How to quantify patterns inside and across languages?
    - computability: How to formalize the process of modeling in such a way
    that we can get as much help as possible from computational applicati-
    ons?
    5 / 20

    View Slide

  8. Handling Context with Computers
    Handling Context with Computers
    6 / 20

    View Slide

  9. Handling Context with Computers Alignments
    Alignments
    Main Idea:
    Start by aligning cognate word forms of ancestral (source)
    and descendant (target) languages.
    Advantages:
    - comparability across languages and language families.
    - concrete, transparent statements regarding relatedness
    - easy to produce with new interactive software tools
    - no restrictions to data size
    Challenges:
    - handling of supra-segmental aspects of pronunciation
    - handling of unalignable patterns (morphology)
    - handling of non-linear patterns (metathesis, mergers)
    7 / 20

    View Slide

  10. Handling Context with Computers Alignments
    Alignments
    German Schwert < Proto-Germanic *swerd-
    German schwellen < Proto-Germanic *swellan
    German schwimmen < Proto-Germanic *swemman
    German Schwester < Proto-Germanic *swestēr
    ... ...
    8 / 20

    View Slide

  11. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Main Idea:
    We add context to alignments by defining different types of context to
    each segment of the ancestral word form. Each alignment of ancestral
    and descendant form is then additionally represented by multiple context
    tiers.
    Three Technical Types of Context:
    - preceding: the element is preceded by one or more other elements
    - following: the element is followed by one or more other elements
    - abstract: the element’s context is specified otherwise
    Advantages:
    - multi-tiers can handle suprasegmental aspects (stress, tone)
    - multi-tiers are extremely flexible, but also comparable across languages
    - multi-tiers are human- and machine-readable
    - multi-tiers are easy to compute automatically
    - multi-tiers are an excellent heuristic for initial language comparison
    9 / 20

    View Slide

  12. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    æ
    k
    10 / 20

    View Slide

  13. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    æ
    k
    10 / 20

    View Slide

  14. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    #
    PRECEDING
    æ
    k
    10 / 20

    View Slide

  15. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    #
    PRECEDING
    FO
    LLO
    W
    IN
    G
    æ
    k
    æ
    10 / 20

    View Slide

  16. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    #
    PRECEDING
    FO
    LLO
    W
    IN
    G
    æ
    k
    æ
    A
    10 / 20

    View Slide

  17. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    #
    PRECEDING
    FO
    LLO
    W
    IN
    G
    æ
    k
    æ
    A
    V
    10 / 20

    View Slide

  18. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling

    ŋ
    w
    # $
    #
    PRECEDING
    FO
    LLO
    W
    IN
    G
    ABSTRACT
    tone
    palatal
    nasal
    æ
    k
    æ
    A
    V

    i
    N
    10 / 20

    View Slide

  19. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Tier Alignment
    SOURCE s w e r d
    CV / x_ # C C V C
    CV / _x C V C C $
    SOUND CLASSES / x_ # S W V R
    SOUND CLASSES / _x W V R T $
    PROSODIC STRENGTH 7 5 4 3 2
    WORD LENGTH 1 1 1 1 1
    FEATURES F G V L P
    ACCENT 1 1 1 1 1
    TARGET ʃ v eː r t
    10 / 20

    View Slide

  20. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Tier Proto-Germanic *p
    CV preceding # V V C C C V V C V V C C C C
    CV following C $ V V V $ $ C V V V V V V V
    CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C
    CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV
    Sound Class preceding # V V R R R V V R V V S S S S
    Sound Class following R $ V V V $ $ J V V V V V V V
    Sound preceding # a e l l r e a r u i s s s s
    Sound following l $ a o a $ $ j a a a e u i a
    Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5
    Reflex in German p͡f f f f f f f f f f f p p p p
    Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6
    Proto-Germanic *p > p͡f / #_
    Proto-Germanic *p > f / [R,V]_
    Proto-Germanic *p > p / [S]_
    11 / 20

    View Slide

  21. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Tier Proto-Germanic *p
    CV preceding # V V C C C V V C V V C C C C
    CV following C $ V V V $ $ C V V V V V V V
    CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C
    CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV
    Sound Class preceding # V V R R R V V R V V S S S S
    Sound Class following R $ V V V $ $ J V V V V V V V
    Sound preceding # a e l l r e a r u i s s s s
    Sound following l $ a o a $ $ j a a a e u i a
    Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5
    Reflex in German p͡f f f f f f f f f f f p p p p
    Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6
    Proto-Germanic *p > p͡f / #_
    Proto-Germanic *p > f / [R,V]_
    Proto-Germanic *p > p / [S]_
    11 / 20

    View Slide

  22. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Tier Proto-Germanic *p
    CV preceding # V V C C C V V C V V C C C C
    CV following C $ V V V $ $ C V V V V V V V
    CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C
    CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV
    Sound Class preceding # V V R R R V V R V V S S S S
    Sound Class following R $ V V V $ $ J V V V V V V V
    Sound preceding # a e l l r e a r u i s s s s
    Sound following l $ a o a $ $ j a a a e u i a
    Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5
    Reflex in German p͡f f f f f f f f f f f p p p p
    Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6
    Proto-Germanic *p > p͡f / #_
    Proto-Germanic *p > f / [R,V]_
    Proto-Germanic *p > p / [S]_
    11 / 20

    View Slide

  23. Handling Context with Computers Multi-Tiered Sequence Modeling
    Multi-Tiered Sequence Modeling
    Tier Proto-Germanic *p
    CV preceding # V V C C C V V C V V C C C C
    CV following C $ V V V $ $ C V V V V V V V
    CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C
    CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV
    Sound Class preceding # V V R R R V V R V V S S S S
    Sound Class following R $ V V V $ $ J V V V V V V V
    Sound preceding # a e l l r e a r u i s s s s
    Sound following l $ a o a $ $ j a a a e u i a
    Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5
    Reflex in German p͡f f f f f f f f f f f p p p p
    Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6
    Proto-Germanic *p > p͡f / #_
    Proto-Germanic *p > f / [R,V]_
    Proto-Germanic *p > p / [S]_
    11 / 20

    View Slide

  24. Handling Context with Computers Optimizing Multi-Tiered Alignments
    Optimizing Multi-Tiered Alignments
    Multi-tiers are only a representation!
    Mult-tiered alignments alone won’t give us the sound-change conditio-
    ning context. In order to infer the patterns relevant to condition sound
    change processes, a careful inspection of relevant contexts is needed.
    Multi-tier-Systems can be automatically optimized
    Automated procedures can provide a great help here, since they can be
    used to seek the optimal tier-system which is needed in order to explain a
    given datasets as having evolved from regular sound change processes.
    Multi-tiers also provide great help in identifying erroneous cognate sets,
    undetected borrowings, or wrong alignments in a given dataset.
    12 / 20

    View Slide

  25. Examples
    Demo
    13 / 20

    View Slide

  26. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    14 / 20

    View Slide

  27. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    14 / 20

    View Slide

  28. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    14 / 20

    View Slide

  29. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    14 / 20

    View Slide

  30. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    check consistency
    of alignments
    and determine
    exceptions
    14 / 20

    View Slide

  31. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    check consistency
    of alignments
    and determine
    exceptions
    inspect exceptions
    and provide information
    on additional tiers
    14 / 20

    View Slide

  32. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    check consistency
    of alignments
    and determine
    exceptions
    inspect exceptions
    and provide information
    on additional tiers
    infer sound change
    processes by applying
    multi-tier optimization
    techniques
    14 / 20

    View Slide

  33. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    check consistency
    of alignments
    and determine
    exceptions
    inspect exceptions
    and provide information
    on additional tiers
    check the inferred
    sound change
    processes and refine
    the patterns
    infer sound change
    processes by applying
    multi-tier optimization
    techniques
    14 / 20

    View Slide

  34. Examples Workflow
    Workflow
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    provide
    data
    align data
    automatically
    revise
    alignments
    manually
    check consistency
    of alignments
    and determine
    exceptions
    inspect exceptions
    and provide information
    on additional tiers
    check the inferred
    sound change
    processes and refine
    the patterns
    infer sound change
    processes by applying
    multi-tier optimization
    techniques
    adjust the inferred
    patterns to the tier
    system and add them
    to the database
    14 / 20

    View Slide

  35. Examples Demo
    Demo
    Data Basis
    - three Germanic languages (German, English, Dutch, with 428, 386, and
    288 words) aligned with Proto-Germanic protoforms
    - 1584 words in total, 481 cognate sets
    Data preparation
    - ancestral and descendant forms were taken from Orel (2003) and checked
    for errors using further literature (Kluge 2002, Pfeiffer 1992)
    - phonetic transcriptions were taken from independent sources
    - preliminary algorithms for the construction of tier systems were program-
    med (currently at http://github.com/lingpy/tiers/), and will later
    be included in LingPy (http://lingpy.org)
    - alignments were carried out automatically using LingPy and then manually
    checked using the EDICTOR tool (http://tsv.lingpy.org)
    - irregular forms were automatically identified with help of a multi-tier opti-
    mization procedure and excluded from the demo
    - demo online available at http://dighl.github.io/tiers/
    germanic-testset.html
    15 / 20

    View Slide

  36. Examples Further Steps
    Further Steps
    expand the analyses to further datasets (Tukano, Indo-European,
    Chinese)
    work out and test ways to model sound sequences beyond their
    IPA types (gestural mechanics, features systems, etc.)
    explore and adjust the multi-tier-systems based on discussions
    with computer scientists and experts on diachronic phonology (can
    we classify tier systems in a less technical way?)
    seek solutions for problems which multi-tiers cannot answer so far
    (non-linear sound change patterns, secondary changes)
    try to find methods and models to further structure the inferred
    sound change patterns and reconcile them with existing theories of
    language change
    find solutions to get the time dimension into the sound change
    modeling process (using trees or networks)
    seek collaborations with scholars willing to share their data and
    with scholars willing to share and test their theories
    16 / 20

    View Slide

  37. Examples
    Concluding Remarks
    17 / 20

    View Slide

  38. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    It’s a very long
    way up
    to
    the top...
    18 / 20

    View Slide

  39. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    ... but together we can
    m
    ake it!
    18 / 20

    View Slide

  40. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    Concluding Remarks
    Collecting cross-linguistic data on regu-
    lar sound change patterns is very chal-
    lenging.The way towards a cross-linguistic
    database for historical phonology is long
    and stony. But we are making progress,
    and with combined efforts involving the
    collaboration of historical linguists and
    computer scientists within a computer-
    assisted framework of language collabo-
    ration, we may succeed.
    19 / 20

    View Slide

  41. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    Concluding Remarks
    So far, computational approaches only
    use the data, not the theories, and lin-
    guistic theories are often only developed
    on small amounts of data. Putting the
    construction of the database in a frame-
    work of computer-assisted – as opposed
    to computer-based – language compari-
    son could reconcile the achievements of
    historical linguistics with the most recent
    developments in computer sciences.
    19 / 20

    View Slide

  42. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    Thanks for Your Attention!
    20 / 20

    View Slide