Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards a cross-linguistic database for histori...

Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context

Talk, held at the workshop "Historical Phonology and Phonological Theory" (2015-09-04, Leiden, Leiden University).

Johann-Mattis List

September 04, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Towards a Cross-Linguistic Database for Historical Phonology? A Proposal for

    a Machine-Readable Modeling of Phonetic Context Johann-Mattis List¹, Thiago Chacon² ¹Centre des recherches linguistiques sur l’Asie Orientale, Paris ²University of Brasilia, Brasilia 2015-09-04 1 / 20
  2. WHAT WE PROMISED IN THE ABSTRACT CHALLENGES, DREAMS, AND REALITY

    P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE 3 / 20
  3. WHERE WE ARE RIGHT NOW... P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY,

    VERY LO NG TI TLE CHALLENGES, DREAMS, AND REALITY 3 / 20
  4. Challenges Preliminary Thoughts Preliminary Thoughts Structural Challenges: - transparency: real

    data, no abstract patterns - comparability: unified human- and machine-readable format - flexibility: no restriction to only one theory or feature system Concrete Challenges - context: How to handle context of sound change patterns in a trans- parent, flexible, and comparable way? - language-specific aspects: How to allow for a modeling of family- or language-specific sound change conditions without sacrificing compara- bility? - quantifiability: How to quantify patterns inside and across languages? - computability: How to formalize the process of modeling in such a way that we can get as much help as possible from computational applicati- ons? 5 / 20
  5. Handling Context with Computers Alignments Alignments Main Idea: Start by

    aligning cognate word forms of ancestral (source) and descendant (target) languages. Advantages: - comparability across languages and language families. - concrete, transparent statements regarding relatedness - easy to produce with new interactive software tools - no restrictions to data size Challenges: - handling of supra-segmental aspects of pronunciation - handling of unalignable patterns (morphology) - handling of non-linear patterns (metathesis, mergers) 7 / 20
  6. Handling Context with Computers Alignments Alignments German Schwert < Proto-Germanic

    *swerd- German schwellen < Proto-Germanic *swellan German schwimmen < Proto-Germanic *swemman German Schwester < Proto-Germanic *swestēr ... ... 8 / 20
  7. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Main Idea: We add context to alignments by defining different types of context to each segment of the ancestral word form. Each alignment of ancestral and descendant form is then additionally represented by multiple context tiers. Three Technical Types of Context: - preceding: the element is preceded by one or more other elements - following: the element is followed by one or more other elements - abstract: the element’s context is specified otherwise Advantages: - multi-tiers can handle suprasegmental aspects (stress, tone) - multi-tiers are extremely flexible, but also comparable across languages - multi-tiers are human- and machine-readable - multi-tiers are easy to compute automatically - multi-tiers are an excellent heuristic for initial language comparison 9 / 20
  8. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    上 ŋ w # $ # PRECEDING FO LLO W IN G ABSTRACT tone palatal nasal æ k æ A V 上 i N 10 / 20
  9. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Tier Alignment SOURCE s w e r d CV / x_ # C C V C CV / _x C V C C $ SOUND CLASSES / x_ # S W V R SOUND CLASSES / _x W V R T $ PROSODIC STRENGTH 7 5 4 3 2 WORD LENGTH 1 1 1 1 1 FEATURES F G V L P ACCENT 1 1 1 1 1 TARGET ʃ v eː r t 10 / 20
  10. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20
  11. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20
  12. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20
  13. Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling

    Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20
  14. Handling Context with Computers Optimizing Multi-Tiered Alignments Optimizing Multi-Tiered Alignments

    Multi-tiers are only a representation! Mult-tiered alignments alone won’t give us the sound-change conditio- ning context. In order to infer the patterns relevant to condition sound change processes, a careful inspection of relevant contexts is needed. Multi-tier-Systems can be automatically optimized Automated procedures can provide a great help here, since they can be used to seek the optimal tier-system which is needed in order to explain a given datasets as having evolved from regular sound change processes. Multi-tiers also provide great help in identifying erroneous cognate sets, undetected borrowings, or wrong alignments in a given dataset. 12 / 20
  15. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually 14 / 20
  16. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions 14 / 20
  17. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers 14 / 20
  18. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers infer sound change processes by applying multi-tier optimization techniques 14 / 20
  19. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and refine the patterns infer sound change processes by applying multi-tier optimization techniques 14 / 20
  20. Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE

    provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and refine the patterns infer sound change processes by applying multi-tier optimization techniques adjust the inferred patterns to the tier system and add them to the database 14 / 20
  21. Examples Demo Demo Data Basis - three Germanic languages (German,

    English, Dutch, with 428, 386, and 288 words) aligned with Proto-Germanic protoforms - 1584 words in total, 481 cognate sets Data preparation - ancestral and descendant forms were taken from Orel (2003) and checked for errors using further literature (Kluge 2002, Pfeiffer 1992) - phonetic transcriptions were taken from independent sources - preliminary algorithms for the construction of tier systems were program- med (currently at http://github.com/lingpy/tiers/), and will later be included in LingPy (http://lingpy.org) - alignments were carried out automatically using LingPy and then manually checked using the EDICTOR tool (http://tsv.lingpy.org) - irregular forms were automatically identified with help of a multi-tier opti- mization procedure and excluded from the demo - demo online available at http://dighl.github.io/tiers/ germanic-testset.html 15 / 20
  22. Examples Further Steps Further Steps expand the analyses to further

    datasets (Tukano, Indo-European, Chinese) work out and test ways to model sound sequences beyond their IPA types (gestural mechanics, features systems, etc.) explore and adjust the multi-tier-systems based on discussions with computer scientists and experts on diachronic phonology (can we classify tier systems in a less technical way?) seek solutions for problems which multi-tiers cannot answer so far (non-linear sound change patterns, secondary changes) try to find methods and models to further structure the inferred sound change patterns and reconcile them with existing theories of language change find solutions to get the time dimension into the sound change modeling process (using trees or networks) seek collaborations with scholars willing to share their data and with scholars willing to share and test their theories 16 / 20
  23. P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE

    It’s a very long way up to the top... 18 / 20
  24. P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE

    ... but together we can m ake it! 18 / 20
  25. P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE

    Concluding Remarks Collecting cross-linguistic data on regu- lar sound change patterns is very chal- lenging.The way towards a cross-linguistic database for historical phonology is long and stony. But we are making progress, and with combined efforts involving the collaboration of historical linguists and computer scientists within a computer- assisted framework of language collabo- ration, we may succeed. 19 / 20
  26. P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE

    Concluding Remarks So far, computational approaches only use the data, not the theories, and lin- guistic theories are often only developed on small amounts of data. Putting the construction of the database in a frame- work of computer-assisted – as opposed to computer-based – language compari- son could reconcile the achievements of historical linguistics with the most recent developments in computer sciences. 19 / 20