Handling Phonological and Etymological Relations in Computer-Based and Computer-Assisted Frameworks

Slide 1

Slide 1 text

Handling Phonological and Etymological Relations in Computer-Based and Computer-Assisted Frameworks Theoretical Aspects Johann-Mattis List DFG Research Fellow Centre des Recherches Linguistiques sur l’Asie Orientale (EHESS) Team “Adaptation, Integration, Reticulation, Evolution” (UPMC) Paris 2015-02-23 1 / 30

Slide 12

Slide 12 text

State of the Art Examples Examples Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages Ward C. Wheelera,* and Peter M. Whiteleyb aDivision of Invertebrate Zoology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA; bDivision of Anthropology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA Accepted 18 March 2014 Abstract Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators. Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses. Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution. ©The Willi Hennig Society 2014. Introduction How languages evolve has long been a central question for the human sciences. Linguistic elements may be transmitted horizontally (“borrowing”) among neighbouring languages, but most language transmis- sion obviously occurs via lineal descent with modif cation. Linguistic and biological evolution are thus analogous in important respects; constructing trees of languages “genetically” related in families is well estab- lished (e.g. Greenhill et al., 2009). Recently, phylogenetic models have enhanced both methodology and hypothesis-testing for language ancestry (e.g. Forster and Renfrew, 2006). Approaches now engage archaeol- ogy, anthropology, genetics, and computational sci- ence, as well as historical linguistics itself. Notwithstanding advances, disputes remain vigorous in both methods and results, including for well-studied language families such as Indo-European (see, for example, Forster and Renfrew, 2006; Campbell and Poser, 2008). Often, reconstructions are untestable— hence the vigour of disputation. The approach adopted here, by contrast, involves an inspectable set of procedures applied directly to empirical linguistic data. We use analytical methods derived from DNA sequence optimization algorithms, treating words as sequences of sounds. We demonstrate this with Uto-Aztecan (UA) languages of North and Middle America. The basic approach articulated here is to remove the inferential overburden of hypothesized “proto-forms” (discussed below), and perform analysis solely using the observed sound content of words. In this way, the sequences of sounds that constitute all human languages form the empirical basis upon which language trees are built. To accomplish this, we have adapted techniques more usually applied to the analysis of DNA and protein sequence data, but are readily applied to sound sequences as well (as with other non- molecular sequence data; Schulmeister and Wheeler, 2004; Robillard et al., 2006). In moving from proto- forms to sound sequences, a transition occurs analogous to the advances forged in organismic systematic *Corresponding author: E-m ail address: [email protected] Cladistics Cladistics (2014) 1–13 10.1111/cla.12078 ©The Willi Hennig Society 2014 4 / 30

Slide 13

Slide 13 text

State of the Art Examples Examples Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages Ward C. Wheelera,* and Peter M. Whiteleyb aDivision of Invertebrate Zoology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA; bDivision of Anthropology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA Accepted 18 March 2014 Abstract Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators. Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses. Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution. ©The Willi Hennig Society 2014. Introduction How languages evolve has long been a central question for the human sciences. Linguistic elements may be transmitted horizontally (“borrowing”) among neighbouring languages, but most language transmis- sion obviously occurs via lineal descent with modif cation. Linguistic and biological evolution are thus analogous in important respects; constructing trees of languages “genetically” related in families is well estab- lished (e.g. Greenhill et al., 2009). Recently, phylogenetic models have enhanced both methodology and hypothesis-testing for language ancestry (e.g. Forster and Renfrew, 2006). Approaches now engage archaeol- ogy, anthropology, genetics, and computational sci- ence, as well as historical linguistics itself. Notwithstanding advances, disputes remain vigorous in both methods and results, including for well-studied language families such as Indo-European (see, for example, Forster and Renfrew, 2006; Campbell and Poser, 2008). Often, reconstructions are untestable— hence the vigour of disputation. The approach adopted here, by contrast, involves an inspectable set of procedures applied directly to empirical linguistic data. We use analytical methods derived from DNA sequence optimization algorithms, treating words as sequences of sounds. We demonstrate this with Uto-Aztecan (UA) languages of North and Middle America. The basic approach articulated here is to remove the inferential overburden of hypothesized “proto-forms” (discussed below), and perform analysis solely using the observed sound content of words. In this way, the sequences of sounds that constitute all human languages form the empirical basis upon which language trees are built. To accomplish this, we have adapted techniques more usually applied to the analysis of DNA and protein sequence data, but are readily applied to sound sequences as well (as with other non- molecular sequence data; Schulmeister and Wheeler, 2004; Robillard et al., 2006). In moving from proto- forms to sound sequences, a transition occurs analogous to the advances forged in organismic systematic *Corresponding author: E-m ail address: [email protected] Cladistics Cladistics (2014) 1–13 10.1111/cla.12078 ©The Willi Hennig Society 2014 Data 32 Uto-Aztekan languages Swadesh 100 concept lists manually extracted cognate sets IPA-encoding, 148 unique symbols Method simultaneous sequence optimization and phylogenetic inference (ML framework) varying scoring functions for the matching of sound symbols Output phylogenies, (?) transition frequencies (?) Software POY 5.0 (Wheeler 2013): Phylogenetic Analysis of DNA and other data using dynamic homology 4 / 30

Slide 15

Slide 15 text

State of the Art Examples Examples 2015 Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064 Article Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4 J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,* and Tanmoy Bhattacharya3,6,* 1School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, USA 2SchoolofBiologicalSciences, UniversityofReading, Reading RG6 6BX, UK 3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA 4Krasnow Institute for Advanced Study, George Mason University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA 22030, USA 5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA 6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Summary Background: Concerted evolution is normally used to describe parallel changes at different sites in a genome, but it is also observed in languages where a specif c phoneme changes to the same other phoneme in many words in the lexicon—a phenomenon known as regular sound change. We develop a general statistical model that can detect concerted changes in aligned sequence data and apply it to study regular sound changes in the Turkic language family. Results: Linguistic evolution, unlike the genetic substitutional process, is dominated by events of concerted evolutionary change. Our model identif ed more than 70 historical events ofregularsoundchangethatoccurredthroughouttheevolution of the Turkic language family, while simultaneously inferring a dated phylogenetic tree. Including regular sound changes yielded an approximately4-fold improvement in thecharacter- ization of linguistic change over a simpler model of sporadic change, improved phylogenetic inference, and returned more reliable and plausible dates for events on the phylogenies. The historical timings of the concerted changes closely follow a Poisson process model, and the sound transition networks derived fromour model mirror linguistic expectations. Conclusions: We demonstrate that a model with no prior knowledge of complex concerted or regular changes can nevertheless infer the historical timings and genealogical placements of events of concerted change from the signals left in contemporary data. Our model can be applied wherever discrete elements—such as genes, words, cultural trends, technologies, or morphological traits—can change in parallel within an organism or other evolving group. Introduction Concerted evolutionary change is widespread in genetic systems, being implicated in the genome-wide control of repetitive elements [1–3], the evolution of gene families [2], and homogenization of Y chromosome sequences [4, 5] and as a means by which asexual organisms might escape the debilitating consequences of Muller’s ratchet [3]. It might arise from several mechanisms, including homologous recombi- nation, that allow certain favorable elements to spread or damaging elements to be neutralized. Linguists have long recognized concerted change that affects copies of the same sound (or phoneme) appearing in different words as a central feature of linguistic evolution [6]. A well-known example is the *p> f sound change in the Germanic languages wherein an older Indo-European p sound was replaced by an f sound, such as in *pater> father, or *pes, *pedis> foot (linguistic convention is to use the ‘‘> ’’ symbol to indicate a transition from one sound to another, and here the * symbol denotes a reconstructed ancestral form). These multipleinstances ofonephonemechanging to thesameother phoneme yield regular sound correspondences between pairs or groups of languages. Linguists have proposed several explanations for the regularity of changes grounded in a number of basic processes, including speech production, perception, and cognition [7–9]. Can events of concerted change be detected statistically in sequence data, and do they improve the characterization of evolutionand theinferenceof evolutionaryhistories? Although previous researchers working in a linguistic setting have used the concept of regular changes to build algorithms for auto- matically inferring cognacy, to our knowledge the model we report here is the f rst probabilistic description of concerted change. This places concerted evolution in a statistical setting that allows for formal hypothesis testing about the nature and rates of concerted changes. For example, the question of how many parallel changes are required to be recognized as an instance of concerted change is naturally dealt with in our model: the statistical signature of concerted or regular change is that the multiple parallel events are more probable if treated as a single coordinated change than as a collection of inde- pendent changes (Box 1). Usefully, the genetic and linguistic phenomena share funda- mental properties relevant to their statistical characterization. Phonemes are the units of sound that make up words and distinguish one word from another, just as the four nucleotide bases (A, C, T, G) make up DNA gene sequences or the 20 amino acids make up protein sequences. The number of distinct sounds in a language varies greatly, but somewhere around 30–60 phonemes are commonly suff cient to describe the range of distinctive sounds in a language’s words [10]. Collections of words can therefore be thought of as providing phonemic ‘‘sequence information’’ that might be informative as to the history, rate, and patterns of concerted evolutionary change in language, and in a manner analogous to sequences of DNA. 5 / 30

Slide 16

Slide 16 text

State of the Art Examples Examples 2015 Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064 Article Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4 J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,* and Tanmoy Bhattacharya3,6,* 1School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, USA 2SchoolofBiologicalSciences, UniversityofReading, Reading RG6 6BX, UK 3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA 4Krasnow Institute for Advanced Study, George Mason University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA 22030, USA 5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA 6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Summary Background: Concerted evolution is normally used to describe parallel changes at different sites in a genome, but it is also observed in languages where a specif c phoneme changes to the same other phoneme in many words in the lexicon—a phenomenon known as regular sound change. We develop a general statistical model that can detect concerted changes in aligned sequence data and apply it to study regular sound changes in the Turkic language family. Results: Linguistic evolution, unlike the genetic substitutional process, is dominated by events of concerted evolutionary change. Our model identif ed more than 70 historical events ofregularsoundchangethatoccurredthroughouttheevolution of the Turkic language family, while simultaneously inferring a dated phylogenetic tree. Including regular sound changes yielded an approximately4-fold improvement in thecharacter- ization of linguistic change over a simpler model of sporadic change, improved phylogenetic inference, and returned more reliable and plausible dates for events on the phylogenies. The historical timings of the concerted changes closely follow a Poisson process model, and the sound transition networks derived fromour model mirror linguistic expectations. Conclusions: We demonstrate that a model with no prior knowledge of complex concerted or regular changes can nevertheless infer the historical timings and genealogical placements of events of concerted change from the signals left in contemporary data. Our model can be applied wherever discrete elements—such as genes, words, cultural trends, technologies, or morphological traits—can change in parallel within an organism or other evolving group. Introduction Concerted evolutionary change is widespread in genetic systems, being implicated in the genome-wide control of repetitive elements [1–3], the evolution of gene families [2], and homogenization of Y chromosome sequences [4, 5] and as a means by which asexual organisms might escape the debilitating consequences of Muller’s ratchet [3]. It might arise from several mechanisms, including homologous recombi- nation, that allow certain favorable elements to spread or damaging elements to be neutralized. Linguists have long recognized concerted change that affects copies of the same sound (or phoneme) appearing in different words as a central feature of linguistic evolution [6]. A well-known example is the *p> f sound change in the Germanic languages wherein an older Indo-European p sound was replaced by an f sound, such as in *pater> father, or *pes, *pedis> foot (linguistic convention is to use the ‘‘> ’’ symbol to indicate a transition from one sound to another, and here the * symbol denotes a reconstructed ancestral form). These multipleinstances ofonephonemechanging to thesameother phoneme yield regular sound correspondences between pairs or groups of languages. Linguists have proposed several explanations for the regularity of changes grounded in a number of basic processes, including speech production, perception, and cognition [7–9]. Can events of concerted change be detected statistically in sequence data, and do they improve the characterization of evolutionand theinferenceof evolutionaryhistories? Although previous researchers working in a linguistic setting have used the concept of regular changes to build algorithms for auto- matically inferring cognacy, to our knowledge the model we report here is the f rst probabilistic description of concerted change. This places concerted evolution in a statistical setting that allows for formal hypothesis testing about the nature and rates of concerted changes. For example, the question of how many parallel changes are required to be recognized as an instance of concerted change is naturally dealt with in our model: the statistical signature of concerted or regular change is that the multiple parallel events are more probable if treated as a single coordinated change than as a collection of inde- pendent changes (Box 1). Usefully, the genetic and linguistic phenomena share funda- mental properties relevant to their statistical characterization. Phonemes are the units of sound that make up words and distinguish one word from another, just as the four nucleotide bases (A, C, T, G) make up DNA gene sequences or the 20 amino acids make up protein sequences. The number of distinct sounds in a language varies greatly, but somewhere around 30–60 phonemes are commonly suff cient to describe the range of distinctive sounds in a language’s words [10]. Collections of words can therefore be thought of as providing phonemic ‘‘sequence information’’ that might be informative as to the history, rate, and patterns of concerted evolutionary change in language, and in a manner analogous to sequences of DNA. Data 26 Turkic languages 225 cognate sets from etymological dictionaries (Tower of Babel) manually compiled alignment analyses ASCII-encoding, 62 unique symbols Method Bayesian Markov chain Monte Carlo statistical model allowing for sporadic (irregular) and concerted (regular) changes along a phylogenetic tree that produces the alignments Output phylogenies, change rates Software Bayes Phylogenies 5 / 30

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text