Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context

Towards a Cross-Linguistic Database for Historical Phonology? A Proposal for
a Machine-Readable Modeling of Phonetic Context Johann-Mattis List¹, Thiago Chacon² ¹Centre des recherches linguistiques sur l’Asie Orientale, Paris ²University of Brasilia, Brasilia 2015-09-04 1 / 20

Prologue 2 / 20

CHALLENGES, DREAMS, AND REALITY 3 / 20

WHAT WE PROMISED IN THE ABSTRACT CHALLENGES, DREAMS, AND REALITY
P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE 3 / 20

WHERE WE ARE RIGHT NOW... P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY,
VERY LO NG TI TLE CHALLENGES, DREAMS, AND REALITY 3 / 20

Challenges Challenges 4 / 20

Challenges Preliminary Thoughts Preliminary Thoughts Structural Challenges: - transparency: real
data, no abstract patterns - comparability: unified human- and machine-readable format - flexibility: no restriction to only one theory or feature system Concrete Challenges - context: How to handle context of sound change patterns in a transparent, flexible, and comparable way? - language-specific aspects: How to allow for a modeling of family- or language-specific sound change conditions without sacrificing comparability? - quantifiability: How to quantify patterns inside and across languages? - computability: How to formalize the process of modeling in such a way that we can get as much help as possible from computational applicati- ons? 5 / 20

Handling Context with Computers Handling Context with Computers 6 /
20

Handling Context with Computers Alignments Alignments Main Idea: Start by
aligning cognate word forms of ancestral (source) and descendant (target) languages. Advantages: - comparability across languages and language families. - concrete, transparent statements regarding relatedness - easy to produce with new interactive software tools - no restrictions to data size Challenges: - handling of supra-segmental aspects of pronunciation - handling of unalignable patterns (morphology) - handling of non-linear patterns (metathesis, mergers) 7 / 20

Handling Context with Computers Alignments Alignments German Schwert < Proto-Germanic
*swerd- German schwellen < Proto-Germanic *swellan German schwimmen < Proto-Germanic *swemman German Schwester < Proto-Germanic *swestēr ... ... 8 / 20

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling
Main Idea: We add context to alignments by defining different types of context to each segment of the ancestral word form. Each alignment of ancestral and descendant form is then additionally represented by multiple context tiers. Three Technical Types of Context: - preceding: the element is preceded by one or more other elements - following: the element is followed by one or more other elements - abstract: the element’s context is specified otherwise Advantages: - multi-tiers can handle suprasegmental aspects (stress, tone) - multi-tiers are extremely flexible, but also comparable across languages - multi-tiers are human- and machine-readable - multi-tiers are easy to compute automatically - multi-tiers are an excellent heuristic for initial language comparison 9 / 20

上 ŋ w # $ æ k 10 / 20

上 ŋ w # $ # PRECEDING æ k 10 / 20

上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ 10 / 20

上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ A 10 / 20

上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ A V 10 / 20

上 ŋ w # $ # PRECEDING FO LLO W IN G ABSTRACT tone palatal nasal æ k æ A V 上 i N 10 / 20

Tier Alignment SOURCE s w e r d CV / x_ # C C V C CV / _x C V C C $ SOUND CLASSES / x_ # S W V R SOUND CLASSES / _x W V R T $ PROSODIC STRENGTH 7 5 4 3 2 WORD LENGTH 1 1 1 1 1 FEATURES F G V L P ACCENT 1 1 1 1 1 TARGET ʃ v eː r t 10 / 20

Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reﬂex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20

Handling Context with Computers Optimizing Multi-Tiered Alignments Optimizing Multi-Tiered Alignments
Multi-tiers are only a representation! Mult-tiered alignments alone won’t give us the sound-change conditio- ning context. In order to infer the patterns relevant to condition sound change processes, a careful inspection of relevant contexts is needed. Multi-tier-Systems can be automatically optimized Automated procedures can provide a great help here, since they can be used to seek the optimal tier-system which is needed in order to explain a given datasets as having evolved from regular sound change processes. Multi-tiers also provide great help in identifying erroneous cognate sets, undetected borrowings, or wrong alignments in a given dataset. 12 / 20

Examples Demo 13 / 20

Examples Workﬂow Workﬂow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE
14 / 20

provide data 14 / 20

provide data align data automatically 14 / 20

provide data align data automatically revise alignments manually 14 / 20

provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions 14 / 20

provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers 14 / 20

provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers infer sound change processes by applying multi-tier optimization techniques 14 / 20

provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and reﬁne the patterns infer sound change processes by applying multi-tier optimization techniques 14 / 20

provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and reﬁne the patterns infer sound change processes by applying multi-tier optimization techniques adjust the inferred patterns to the tier system and add them to the database 14 / 20

Examples Demo Demo Data Basis - three Germanic languages (German,
English, Dutch, with 428, 386, and 288 words) aligned with Proto-Germanic protoforms - 1584 words in total, 481 cognate sets Data preparation - ancestral and descendant forms were taken from Orel (2003) and checked for errors using further literature (Kluge 2002, Pfeiﬀer 1992) - phonetic transcriptions were taken from independent sources - preliminary algorithms for the construction of tier systems were program- med (currently at http://github.com/lingpy/tiers/), and will later be included in LingPy (http://lingpy.org) - alignments were carried out automatically using LingPy and then manually checked using the EDICTOR tool (http://tsv.lingpy.org) - irregular forms were automatically identiﬁed with help of a multi-tier optimization procedure and excluded from the demo - demo online available at http://dighl.github.io/tiers/ germanic-testset.html 15 / 20

Examples Further Steps Further Steps expand the analyses to further
datasets (Tukano, Indo-European, Chinese) work out and test ways to model sound sequences beyond their IPA types (gestural mechanics, features systems, etc.) explore and adjust the multi-tier-systems based on discussions with computer scientists and experts on diachronic phonology (can we classify tier systems in a less technical way?) seek solutions for problems which multi-tiers cannot answer so far (non-linear sound change patterns, secondary changes) try to ﬁnd methods and models to further structure the inferred sound change patterns and reconcile them with existing theories of language change ﬁnd solutions to get the time dimension into the sound change modeling process (using trees or networks) seek collaborations with scholars willing to share their data and with scholars willing to share and test their theories 16 / 20

Examples Concluding Remarks 17 / 20

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE
It’s a very long way up to the top... 18 / 20

... but together we can m ake it! 18 / 20

Concluding Remarks Collecting cross-linguistic data on regular sound change patterns is very chal- lenging.The way towards a cross-linguistic database for historical phonology is long and stony. But we are making progress, and with combined eﬀorts involving the collaboration of historical linguists and computer scientists within a computer- assisted framework of language collaboration, we may succeed. 19 / 20

Concluding Remarks So far, computational approaches only use the data, not the theories, and linguistic theories are often only developed on small amounts of data. Putting the construction of the database in a framework of computer-assisted – as opposed to computer-based – language comparison could reconcile the achievements of historical linguistics with the most recent developments in computer sciences. 19 / 20

Thanks for Your Attention! 20 / 20

Towards a cross-linguistic database for histori...

Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript