Better models with Saussure: Simulating lexical evolution with semantic shifts
Talk by Gereon Kaiping and Johann-Mattis List, presented at the conference "Phylogenetic methods in historical linguistics" (2017/03/27-30, Eberhard-Karls-Universität, Tübingen)
do Closing Remarks References 1 Context and Motivation 2 Our Model 3 What we can and can’t do 4 Closing Remarks Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Context Phylogenetic reconstruction has been enjoying a great popularity of late. Language trees are not only used for genetic subgrouping of language families, but also to address general linguistic questions (typological universals, ...) general anthropological/historical questions (Urheimat, ....) Phylogenetic reconstruction was the driving force for the recent quantitative turn in historical linguistics, and has has been accepted by most scholars in the field Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Context Phylogenetic reconstruction has been enjoying a great popularity of late. Language trees are not only used for genetic subgrouping of language families, but also to address general linguistic questions (typological universals, ...) general anthropological/historical questions (Urheimat, ....) Phylogenetic reconstruction was the driving force for the recent quantitative turn in historical linguistics, and has has been accepted by most scholars in the field Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Context The Reconstruction Dilemma historical linguistics deals with past states and events, investigating research objects and processes which are not directly observable → even falsification is tricky scholars tend to argue in terms of the likelihood of scenarios, but we cannot compare our inferences against inferences in controlled experiments Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Context despite their popularity, methods for phylogenetic reconstruction are rarely tested, neither against gold standards (which do not really exist, the closest we have are phylogenies in databases like Glottolog [3]), nor against the results of simulation studies in the rare cases where phylogenetic methods have been tested with help of simulation studies [5, 2, 7, 6, 1], they were based on very simplistic models of lexical change that assume independent gain/loss of words or replacement of items Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Lexicostatistical Word List Data Concept ID German ID English ID Italian ID French HAND 1 Hand 1 hand 2 mano 2 main BLOOD 3 Blut 3 blood 4 sangue 4 sang HEAD 5 Kopf 6 head 7 testa 7 tête TOOTH 8 Zahn 8 tooth 8 dente 8 dent TO SLEEP 9 schlafen 9 sleep 10 dormir 10 dormir TO SAY 11 sagen 11 say 12 dire 12 dire … … … … … Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Lexicostatistical Word List Data Concept ID German ID English ID Italian ID French HAND 1 Hand 1 hand 2 mano 2 main BLOOD 3 Blut 3 blood 4 sangue 4 sang HEAD 5 Kopf 6 head 7 testa 7 tête TOOTH 8 Zahn 8 tooth 8 dente 8 dent TO SLEEP 9 schlafen 9 sleep 10 dormir 10 dormir TO SAY 11 sagen 11 say 12 dire 12 dire … … … … … Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Gain-Loss Coding Concept Proto-Form German English Italian French HAND PGM *xanda- HAND LAT mānus BLOOD PGM *blođa- BLOOD LAT sanguis HEAD PGM *kuppa- HEAD PGM *xawbda- HEAD LAT tēsta TOOTH PIE *h3dont- TO SLEEP PGM slēpan- TO SLEEP LAT dormīre TO SAY PGM *sagjan- TO SAY LAT dīcere … … … … … … Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Gain-Loss Coding Proto-Form German English Italian French PGM *xanda- LAT mānus PGM *blođa- LAT sanguis PGM *kuppa- PGM *xawbda- LAT tēsta PIE *h3dont- PGM slēpan- LAT dormīre PGM *sagjan- LAT dīcere … … … … … Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Motivation gain/loss and replacement approaches are not satisfying linguistically phylogenetic reconstruction is important but insufficiently tested gold standard (controlled) datasets are not available we barely understand the processes underlying lexical change → by working on more realistic simulations, we can learn a lot about the processes of lexical change and also help to evaluate the accuracy of phylogenetic approaches Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Saussure’s model of the linguistic sign: Dynamics arbre bois forêt Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Saussure’s model of the linguistic sign: Dynamics arbre bois forêt Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Saussure’s model of the linguistic sign: Dynamics arbre bois forêt Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Semantic Network 1 post, pole staff, walking stick doorpost, jamb tree stump mast club firewood root tree trunk woods, forest banana tree tree wood 1http://clics.lingpy.org/browse.php?gloss=wood Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Semantic Network CLICS [4] database of synchronic lexical associations (“colexifications”), currently 221 language varieties 1280 concepts uses network approaches to partition the data into semantic fields web-application at http://clics.lingpy.org allows for quick browsing of the semantic networks Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Design Goals A more realistic model of lexical evolution is based on a bipartite graph structure of word forms and word meanings builds on a dynamic representation of reference potentials instead of Saussure’s inseparable dichotomy of the linguistic sign feeds on (ideally, weighted and directed) networks of semantic associations to account for the fact that semantic shift and lexical replacement follow certain preference laws Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Design Goals A more realistic model of lexical evolution is based on a bipartite graph structure of word forms and word meanings builds on a dynamic representation of reference potentials instead of Saussure’s inseparable dichotomy of the linguistic sign feeds on (ideally, weighted and directed) networks of semantic associations to account for the fact that semantic shift and lexical replacement follow certain preference laws Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Design Goals A more realistic model of lexical evolution is based on a bipartite graph structure of word forms and word meanings builds on a dynamic representation of reference potentials instead of Saussure’s inseparable dichotomy of the linguistic sign feeds on (ideally, weighted and directed) networks of semantic associations to account for the fact that semantic shift and lexical replacement follow certain preference laws Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Design Goals A more realistic model of lexical evolution is based on a bipartite graph structure of word forms and word meanings builds on a dynamic representation of reference potentials instead of Saussure’s inseparable dichotomy of the linguistic sign feeds on (ideally, weighted and directed) networks of semantic associations to account for the fact that semantic shift and lexical replacement follow certain preference laws Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Questions of Implementation How should the model drive the change of edge weights in the bipartite graph? How to choose the underlying semantic network? How do we see whether the model has any chance of realism? How can we select realistic parameters for the model? Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Questions of Implementation How should the model drive the change of edge weights in the bipartite graph? How to choose the underlying semantic network? How do we see whether the model has any chance of realism? How can we select realistic parameters for the model? Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Questions of Implementation How should the model drive the change of edge weights in the bipartite graph? How to choose the underlying semantic network? How do we see whether the model has any chance of realism? How can we select realistic parameters for the model? Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Questions of Implementation How should the model drive the change of edge weights in the bipartite graph? How to choose the underlying semantic network? How do we see whether the model has any chance of realism? How can we select realistic parameters for the model? Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Framework A B C D E intention, purpose woods, forest tree wood post, pole Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Framework A B C D E intention, purpose woods, forest tree wood post, pole [1] [2] [3] 6 1 1 5 2 Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Framework A B C D E intention, purpose woods, forest tree wood post, pole [1] [2] [3] 6 1 1 5 2 Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Evolution Inspiration: Discrimination Game and Guessing Game Choose two random concepts ci (P ∝ deg2) Score each word w for each ci: wi = wt(w; ci) + 0.1 c neighbor of ci wt(w; c) Increase wt(w, ci) where w¬i = 0 and wi max; Or create a new word meaning ci with wt 1. Decrease wt(w, ci) where 0 < w¬i < wi max; Or a random connection (∝ wt) Decrease wt of a random connection (∝ wt) intention, purpose woods, forest tree wood post, pole [1] [2] [3] 6 1 1 5 2 ↓ intention, purpose woods, forest tree wood post, pole [1] [2] [3] 7 0 1 5 2 1 Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Long-Term Behaviour Proposition Behaviour (vocabulary size, polysemy, synonymy) should stabilize over long time scales at reasonable values. Test: Run the simulation along a branch with 2 000 000 time steps, using CLICS (see above) as semantic network. Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Long-Term Behaviour Proposition Behaviour vocabulary size, polysemy, synonymy should stabilize over long time scales at reasonable values 100 101 102 103 104 105 106 time steps t 800 900 1000 1100 1200 1300 Vocabulary size 100 101 102 103 104 105 106 time steps t 0 1 2 3 4 5 Average Polysemy/Synonymity Polysemy Synonymity Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Calibration Proposition Years:Replacement-Steps scaling parameter has an optimum Test: Run the simulation along a known dated tree (Chinese dialects from the Cíhuì) and compare with cross-semantically cognate coded data. Compare pairwise shared cognate proportion between real and simulated data. Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Cíhuì Collection of Chinese dialects created in the late 1950s and published in 1964 (Běijīng University 1964) Contains lexical data, as the short title suggests (cíhuì means “lexical inventory” or Wortschatz in German) Based on a questionnaire consisting of 905 concepts (daily life and basic vocabulary) Offers data for 18 dialect varieties, including varieties from each of the seven largest dialect groups of Chines (Mǐn, Cantonese, Mandarin, Hakka, Wú, Xiāng, and Gàn) Data was prepared during List’s research project (2015-2016), digitized, and partial cognate coding was extracted automatically, based on annotations in the original source Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Cíhuì Collection of Chinese dialects created in the late 1950s and published in 1964 (Běijīng University 1964) Contains lexical data, as the short title suggests (cíhuì means “lexical inventory” or Wortschatz in German) Based on a questionnaire consisting of 905 concepts (daily life and basic vocabulary) Offers data for 18 dialect varieties, including varieties from each of the seven largest dialect groups of Chines (Mǐn, Cantonese, Mandarin, Hakka, Wú, Xiāng, and Gàn) Data was prepared during List’s research project (2015-2016), digitized, and partial cognate coding was extracted automatically, based on annotations in the original source Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Cíhuì Collection of Chinese dialects created in the late 1950s and published in 1964 (Běijīng University 1964) Contains lexical data, as the short title suggests (cíhuì means “lexical inventory” or Wortschatz in German) Based on a questionnaire consisting of 905 concepts (daily life and basic vocabulary) Offers data for 18 dialect varieties, including varieties from each of the seven largest dialect groups of Chines (Mǐn, Cantonese, Mandarin, Hakka, Wú, Xiāng, and Gàn) Data was prepared during List’s research project (2015-2016), digitized, and partial cognate coding was extracted automatically, based on annotations in the original source Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Cíhuì Collection of Chinese dialects created in the late 1950s and published in 1964 (Běijīng University 1964) Contains lexical data, as the short title suggests (cíhuì means “lexical inventory” or Wortschatz in German) Based on a questionnaire consisting of 905 concepts (daily life and basic vocabulary) Offers data for 18 dialect varieties, including varieties from each of the seven largest dialect groups of Chines (Mǐn, Cantonese, Mandarin, Hakka, Wú, Xiāng, and Gàn) Data was prepared during List’s research project (2015-2016), digitized, and partial cognate coding was extracted automatically, based on annotations in the original source Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Excursus: Cíhuì Collection of Chinese dialects created in the late 1950s and published in 1964 (Běijīng University 1964) Contains lexical data, as the short title suggests (cíhuì means “lexical inventory” or Wortschatz in German) Based on a questionnaire consisting of 905 concepts (daily life and basic vocabulary) Offers data for 18 dialect varieties, including varieties from each of the seven largest dialect groups of Chines (Mǐn, Cantonese, Mandarin, Hakka, Wú, Xiāng, and Gàn) Data was prepared during List’s research project (2015-2016), digitized, and partial cognate coding was extracted automatically, based on annotations in the original source Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Calibration Proposition Years:Replacement-Steps scaling parameter has an optimum Values around 1.5 look best, and even reasonable, given the ad-hoc nature of the tree Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Calibration Proposition Years:Replacement-Steps scaling parameter has an optimum Values around 1.5 look best, and even reasonable, given the ad-hoc nature of the tree Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Semantic Shift Proposition The model shows reasonable amounts of semantic shift Test: Visually compare distribution of Meaning Classes/Cognate Class for simulated and real Cíhuì data Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Semantic Shift Proposition The model shows reasonable amounts of semantic shift Note: Simulation results are not filtered to exclude synonyms Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Missing Bits More comparable data for better calibration A better model of the semantics – and calibration of that (frequent pathways, etc.) Support for language contact (borrowings) Partial cognate support/Compositionality/Derivations Needs severe change in vocabulary representation, and some serious quantitative data Might help with language contact Population level modeling, for rate variation/punctuated evolution to emerge Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Conclusions We have a realistic model of semantic shift. With more dated, cross-semantic-cognate-coded trees we can calibrate it more confidently. We (and you, it’s Open Source1!) can already use it to run and compare different tree building methods. 1http://github.com/Anaphory/simuling Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Sources and Further Reading I Quentin Atkinson et al. “From Words to Dates: Water into Wine, Mathemagic or Phylogenetic Inference?” In: Transactions of the Philological Society 103.2 (Aug. 1, 2005), pp. 193–219. issn: 1467-968X. doi: 10.1111/j.1467-968X.2005.00151.x. url: http: //onlinelibrary.wiley.com/doi/10.1111/j.1467- 968X.2005.00151.x/abstract (visited on 03/23/2017). Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Sources and Further Reading II François Barbançon et al. “An Experimental Study Comparing Linguistic Phylogenetic Reconstruction Methods”. In: Diachronica 30.2 (2013), pp. 143–170. doi: 10.1075/dia.30.2.01bar. url: http://www.ingentaconnect.com/content/jbp/dia/ 2013/00000030/00000002/art00001 (visited on 10/15/2016). Harald Hammarström et al. Glottolog. Version 2.5. URL: http://glottolog.org. Leipzig, 2015. Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Sources and Further Reading III J.-M. List et al., eds. CLICS: Database of Cross-Linguistic Colexifications. Marburg: Forschungszentrum Deutscher Sprachatlas, 2014. Archived at: http://www.webcitation.org/6ccEMrZYM. url: http://clics.lingpy.org. Andrew D. M. Smith. “Models of Language Evolution and Change”. In: Wiley Interdisciplinary Reviews-Cognitive Science 5.3 (May 1, 2014). WOS:000334511800004, pp. 281–293. issn: 1939-5078. doi: 10.1002/wcs.1285. url: http://onlinelibrary.wiley.com/doi/10.1002/ wcs.1285/abstract. Gereon Kaiping, Mattis List Better Models With Saussure
do Closing Remarks References Sources and Further Reading IV S.A. Starostin. “Computer-Based Simulation of the Glottochronological Process”. In: [Works on Linguistics]. Moscow: , 2007, pp. 854–862. Tandy Warnow et al. “A Stochastic Model of Language Evolution That Incorporates Homoplasy and Borrowing”. In: (). url: http://statistics.berkeley.edu/sites/ default/files/tech-reports/673.pdf (visited on 03/22/2017). Gereon Kaiping, Mattis List Better Models With Saussure