Slide 1

Slide 1 text

Towards a Cross-Linguistic Database for Historical Phonology? A Proposal for a Machine-Readable Modeling of Phonetic Context Johann-Mattis List¹, Thiago Chacon² ¹Centre des recherches linguistiques sur l’Asie Orientale, Paris ²University of Brasilia, Brasilia 2015-09-04 1 / 20

Slide 2

Slide 2 text

Prologue 2 / 20

Slide 3

Slide 3 text

CHALLENGES, DREAMS, AND REALITY 3 / 20

Slide 4

Slide 4 text

WHAT WE PROMISED IN THE ABSTRACT CHALLENGES, DREAMS, AND REALITY P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE 3 / 20

Slide 5

Slide 5 text

WHERE WE ARE RIGHT NOW... P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE CHALLENGES, DREAMS, AND REALITY 3 / 20

Slide 6

Slide 6 text

Challenges Challenges 4 / 20

Slide 7

Slide 7 text

Challenges Preliminary Thoughts Preliminary Thoughts Structural Challenges: - transparency: real data, no abstract patterns - comparability: unified human- and machine-readable format - flexibility: no restriction to only one theory or feature system Concrete Challenges - context: How to handle context of sound change patterns in a trans- parent, flexible, and comparable way? - language-specific aspects: How to allow for a modeling of family- or language-specific sound change conditions without sacrificing compara- bility? - quantifiability: How to quantify patterns inside and across languages? - computability: How to formalize the process of modeling in such a way that we can get as much help as possible from computational applicati- ons? 5 / 20

Slide 8

Slide 8 text

Handling Context with Computers Handling Context with Computers 6 / 20

Slide 9

Slide 9 text

Handling Context with Computers Alignments Alignments Main Idea: Start by aligning cognate word forms of ancestral (source) and descendant (target) languages. Advantages: - comparability across languages and language families. - concrete, transparent statements regarding relatedness - easy to produce with new interactive software tools - no restrictions to data size Challenges: - handling of supra-segmental aspects of pronunciation - handling of unalignable patterns (morphology) - handling of non-linear patterns (metathesis, mergers) 7 / 20

Slide 10

Slide 10 text

Handling Context with Computers Alignments Alignments German Schwert < Proto-Germanic *swerd- German schwellen < Proto-Germanic *swellan German schwimmen < Proto-Germanic *swemman German Schwester < Proto-Germanic *swestēr ... ... 8 / 20

Slide 11

Slide 11 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Main Idea: We add context to alignments by defining different types of context to each segment of the ancestral word form. Each alignment of ancestral and descendant form is then additionally represented by multiple context tiers. Three Technical Types of Context: - preceding: the element is preceded by one or more other elements - following: the element is followed by one or more other elements - abstract: the element’s context is specified otherwise Advantages: - multi-tiers can handle suprasegmental aspects (stress, tone) - multi-tiers are extremely flexible, but also comparable across languages - multi-tiers are human- and machine-readable - multi-tiers are easy to compute automatically - multi-tiers are an excellent heuristic for initial language comparison 9 / 20

Slide 12

Slide 12 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ æ k 10 / 20

Slide 13

Slide 13 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ æ k 10 / 20

Slide 14

Slide 14 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ # PRECEDING æ k 10 / 20

Slide 15

Slide 15 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ 10 / 20

Slide 16

Slide 16 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ A 10 / 20

Slide 17

Slide 17 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ # PRECEDING FO LLO W IN G æ k æ A V 10 / 20

Slide 18

Slide 18 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling 上 ŋ w # $ # PRECEDING FO LLO W IN G ABSTRACT tone palatal nasal æ k æ A V 上 i N 10 / 20

Slide 19

Slide 19 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Tier Alignment SOURCE s w e r d CV / x_ # C C V C CV / _x C V C C $ SOUND CLASSES / x_ # S W V R SOUND CLASSES / _x W V R T $ PROSODIC STRENGTH 7 5 4 3 2 WORD LENGTH 1 1 1 1 1 FEATURES F G V L P ACCENT 1 1 1 1 1 TARGET ʃ v eː r t 10 / 20

Slide 20

Slide 20 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20

Slide 21

Slide 21 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20

Slide 22

Slide 22 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20

Slide 23

Slide 23 text

Handling Context with Computers Multi-Tiered Sequence Modeling Multi-Tiered Sequence Modeling Tier Proto-Germanic *p CV preceding # V V C C C V V C V V C C C C CV following C $ V V V $ $ C V V V V V V V CV Cluster preceding # CV CV VC VC VC CV CV VC CV CV #C #C #C #C CV Cluster following VC $ CV CV CV $ $ CC CV CV CV CV CV CV CV Sound Class preceding # V V R R R V V R V V S S S S Sound Class following R $ V V V $ $ J V V V V V V V Sound preceding # a e l l r e a r u i s s s s Sound following l $ a o a $ $ j a a a e u i a Prosodic Strength abstract 7 2 6 6 6 2 2 6 6 6 6 5 5 5 5 Reflex in German p͡f f f f f f f f f f f p p p p Frequency in Database 2 2 2 1 1 2 1 1 1 1 2 3 1 1 6 Proto-Germanic *p > p͡f / #_ Proto-Germanic *p > f / [R,V]_ Proto-Germanic *p > p / [S]_ 11 / 20

Slide 24

Slide 24 text

Handling Context with Computers Optimizing Multi-Tiered Alignments Optimizing Multi-Tiered Alignments Multi-tiers are only a representation! Mult-tiered alignments alone won’t give us the sound-change conditio- ning context. In order to infer the patterns relevant to condition sound change processes, a careful inspection of relevant contexts is needed. Multi-tier-Systems can be automatically optimized Automated procedures can provide a great help here, since they can be used to seek the optimal tier-system which is needed in order to explain a given datasets as having evolved from regular sound change processes. Multi-tiers also provide great help in identifying erroneous cognate sets, undetected borrowings, or wrong alignments in a given dataset. 12 / 20

Slide 25

Slide 25 text

Examples Demo 13 / 20

Slide 26

Slide 26 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 14 / 20

Slide 27

Slide 27 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data 14 / 20

Slide 28

Slide 28 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically 14 / 20

Slide 29

Slide 29 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually 14 / 20

Slide 30

Slide 30 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions 14 / 20

Slide 31

Slide 31 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers 14 / 20

Slide 32

Slide 32 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers infer sound change processes by applying multi-tier optimization techniques 14 / 20

Slide 33

Slide 33 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and refine the patterns infer sound change processes by applying multi-tier optimization techniques 14 / 20

Slide 34

Slide 34 text

Examples Workflow Workflow P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE provide data align data automatically revise alignments manually check consistency of alignments and determine exceptions inspect exceptions and provide information on additional tiers check the inferred sound change processes and refine the patterns infer sound change processes by applying multi-tier optimization techniques adjust the inferred patterns to the tier system and add them to the database 14 / 20

Slide 35

Slide 35 text

Examples Demo Demo Data Basis - three Germanic languages (German, English, Dutch, with 428, 386, and 288 words) aligned with Proto-Germanic protoforms - 1584 words in total, 481 cognate sets Data preparation - ancestral and descendant forms were taken from Orel (2003) and checked for errors using further literature (Kluge 2002, Pfeiffer 1992) - phonetic transcriptions were taken from independent sources - preliminary algorithms for the construction of tier systems were program- med (currently at http://github.com/lingpy/tiers/), and will later be included in LingPy (http://lingpy.org) - alignments were carried out automatically using LingPy and then manually checked using the EDICTOR tool (http://tsv.lingpy.org) - irregular forms were automatically identified with help of a multi-tier opti- mization procedure and excluded from the demo - demo online available at http://dighl.github.io/tiers/ germanic-testset.html 15 / 20

Slide 36

Slide 36 text

Examples Further Steps Further Steps expand the analyses to further datasets (Tukano, Indo-European, Chinese) work out and test ways to model sound sequences beyond their IPA types (gestural mechanics, features systems, etc.) explore and adjust the multi-tier-systems based on discussions with computer scientists and experts on diachronic phonology (can we classify tier systems in a less technical way?) seek solutions for problems which multi-tiers cannot answer so far (non-linear sound change patterns, secondary changes) try to find methods and models to further structure the inferred sound change patterns and reconcile them with existing theories of language change find solutions to get the time dimension into the sound change modeling process (using trees or networks) seek collaborations with scholars willing to share their data and with scholars willing to share and test their theories 16 / 20

Slide 37

Slide 37 text

Examples Concluding Remarks 17 / 20

Slide 38

Slide 38 text

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE It’s a very long way up to the top... 18 / 20

Slide 39

Slide 39 text

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE ... but together we can m ake it! 18 / 20

Slide 40

Slide 40 text

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE Concluding Remarks Collecting cross-linguistic data on regu- lar sound change patterns is very chal- lenging.The way towards a cross-linguistic database for historical phonology is long and stony. But we are making progress, and with combined efforts involving the collaboration of historical linguists and computer scientists within a computer- assisted framework of language collabo- ration, we may succeed. 19 / 20

Slide 41

Slide 41 text

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE Concluding Remarks So far, computational approaches only use the data, not the theories, and lin- guistic theories are often only developed on small amounts of data. Putting the construction of the database in a frame- work of computer-assisted – as opposed to computer-based – language compari- son could reconcile the achievements of historical linguistics with the most recent developments in computer sciences. 19 / 20

Slide 42

Slide 42 text

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE Thanks for Your Attention! 20 / 20