Slide 1

Slide 1 text

Handling Phonological and Etymological Relations in Computer-Based and Computer-Assisted Frameworks Theoretical Aspects Johann-Mattis List DFG Research Fellow Centre des Recherches Linguistiques sur l’Asie Orientale (EHESS) Team “Adaptation, Integration, Reticulation, Evolution” (UPMC) Paris 2015-02-23 1 / 30

Slide 2

Slide 2 text

LANGUAGE MODELING 2.0 2 / 30

Slide 3

Slide 3 text

LANGUAGE MODELING 2.0 2 / 30

Slide 4

Slide 4 text

LANGUAGE MODELING 2.0 2 / 30

Slide 5

Slide 5 text

LANGUAGE MODELING 2.0 2 / 30

Slide 6

Slide 6 text

LANGUAGE MODELING 2.0 2 / 30

Slide 7

Slide 7 text

LANGUAGE MODELING 2.0 State of the Art 2 / 30

Slide 8

Slide 8 text

State of the Art New Models New Models 1 0 Gain-Loss Models 3 / 30

Slide 9

Slide 9 text

State of the Art New Models New Models 1 0 p p͡f f v h ø Gain-Loss Models Multistate Models 3 / 30

Slide 10

Slide 10 text

State of the Art New Models New Models 1 0 p p͡f f v h ø Gain-Loss Models Multistate Models ===> 3 / 30

Slide 11

Slide 11 text

State of the Art Examples Examples 2014 4 / 30

Slide 12

Slide 12 text

State of the Art Examples Examples Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages Ward C. Wheelera,* and Peter M. Whiteleyb aDivision of Invertebrate Zoology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA; bDivision of Anthropology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA Accepted 18 March 2014 Abstract Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators. Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses. Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution. ©The Willi Hennig Society 2014. Introduction How languages evolve has long been a central ques- tion for the human sciences. Linguistic elements may be transmitted horizontally (“borrowing”) among neighbouring languages, but most language transmis- sion obviously occurs via lineal descent with modif ca- tion. Linguistic and biological evolution are thus analogous in important respects; constructing trees of languages “genetically” related in families is well estab- lished (e.g. Greenhill et al., 2009). Recently, phyloge- netic models have enhanced both methodology and hypothesis-testing for language ancestry (e.g. Forster and Renfrew, 2006). Approaches now engage archaeol- ogy, anthropology, genetics, and computational sci- ence, as well as historical linguistics itself. Notwithstanding advances, disputes remain vigorous in both methods and results, including for well-studied language families such as Indo-European (see, for example, Forster and Renfrew, 2006; Campbell and Poser, 2008). Often, reconstructions are untestable— hence the vigour of disputation. The approach adopted here, by contrast, involves an inspectable set of procedures applied directly to empirical linguistic data. We use analytical methods derived from DNA sequence optimization algorithms, treating words as sequences of sounds. We demonstrate this with Uto-Aztecan (UA) languages of North and Middle America. The basic approach articulated here is to remove the inferential overburden of hypothesized “proto-forms” (discussed below), and perform analysis solely using the observed sound content of words. In this way, the sequences of sounds that constitute all human lan- guages form the empirical basis upon which language trees are built. To accomplish this, we have adapted techniques more usually applied to the analysis of DNA and protein sequence data, but are readily applied to sound sequences as well (as with other non- molecular sequence data; Schulmeister and Wheeler, 2004; Robillard et al., 2006). In moving from proto- forms to sound sequences, a transition occurs analo- gous to the advances forged in organismic systematic *Corresponding author: E-m ail address: [email protected] Cladistics Cladistics (2014) 1–13 10.1111/cla.12078 ©The Willi Hennig Society 2014 4 / 30

Slide 13

Slide 13 text

State of the Art Examples Examples Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages Ward C. Wheelera,* and Peter M. Whiteleyb aDivision of Invertebrate Zoology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA; bDivision of Anthropology, Am erican Museumof Natural History, Central Park West @ 79th Street, New York, NY, 10024-5192, USA Accepted 18 March 2014 Abstract Language origins and diversif cation are vital for mapping human history. Traditionally, the reconstruction of language trees has been based on cognate forms among related languages, with ancestral protolanguages inferred by individual investigators. Disagreement among competing authorities is typically extensive, without empirical grounds for resolving alternative hypotheses. Here, we apply analytical methods derived from DNA sequence optimization algorithms to Uto-Aztecan languages, treating words as sequences of sounds. Our analysis yields novel relationships and suggests a resolution to current conf icts about the Proto-Uto-Aztecan homeland. The techniques used for Uto-Aztecan are applicable to written and unwritten languages, and should enable more empirically robust hypotheses of language relationships, language histories, and linguistic evolution. ©The Willi Hennig Society 2014. Introduction How languages evolve has long been a central ques- tion for the human sciences. Linguistic elements may be transmitted horizontally (“borrowing”) among neighbouring languages, but most language transmis- sion obviously occurs via lineal descent with modif ca- tion. Linguistic and biological evolution are thus analogous in important respects; constructing trees of languages “genetically” related in families is well estab- lished (e.g. Greenhill et al., 2009). Recently, phyloge- netic models have enhanced both methodology and hypothesis-testing for language ancestry (e.g. Forster and Renfrew, 2006). Approaches now engage archaeol- ogy, anthropology, genetics, and computational sci- ence, as well as historical linguistics itself. Notwithstanding advances, disputes remain vigorous in both methods and results, including for well-studied language families such as Indo-European (see, for example, Forster and Renfrew, 2006; Campbell and Poser, 2008). Often, reconstructions are untestable— hence the vigour of disputation. The approach adopted here, by contrast, involves an inspectable set of procedures applied directly to empirical linguistic data. We use analytical methods derived from DNA sequence optimization algorithms, treating words as sequences of sounds. We demonstrate this with Uto-Aztecan (UA) languages of North and Middle America. The basic approach articulated here is to remove the inferential overburden of hypothesized “proto-forms” (discussed below), and perform analysis solely using the observed sound content of words. In this way, the sequences of sounds that constitute all human lan- guages form the empirical basis upon which language trees are built. To accomplish this, we have adapted techniques more usually applied to the analysis of DNA and protein sequence data, but are readily applied to sound sequences as well (as with other non- molecular sequence data; Schulmeister and Wheeler, 2004; Robillard et al., 2006). In moving from proto- forms to sound sequences, a transition occurs analo- gous to the advances forged in organismic systematic *Corresponding author: E-m ail address: [email protected] Cladistics Cladistics (2014) 1–13 10.1111/cla.12078 ©The Willi Hennig Society 2014 Data 32 Uto-Aztekan languages Swadesh 100 concept lists manually extracted cognate sets IPA-encoding, 148 unique symbols Method simultaneous sequence optimization and phylogenetic inference (ML framework) varying scoring functions for the matching of sound symbols Output phylogenies, (?) transition frequencies (?) Software POY 5.0 (Wheeler 2013): Phylogenetic Analysis of DNA and other data using dynamic homology 4 / 30

Slide 14

Slide 14 text

State of the Art Examples Examples 2015 5 / 30

Slide 15

Slide 15 text

State of the Art Examples Examples 2015 Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064 Article Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4 J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,* and Tanmoy Bhattacharya3,6,* 1School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, USA 2SchoolofBiologicalSciences, UniversityofReading, Reading RG6 6BX, UK 3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA 4Krasnow Institute for Advanced Study, George Mason University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA 22030, USA 5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA 6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Summary Background: Concerted evolution is normally used to describe parallel changes at different sites in a genome, but it is also observed in languages where a specif c phoneme changes to the same other phoneme in many words in the lexicon—a phenomenon known as regular sound change. We develop a general statistical model that can detect concerted changes in aligned sequence data and apply it to study regular sound changes in the Turkic language family. Results: Linguistic evolution, unlike the genetic substitutional process, is dominated by events of concerted evolutionary change. Our model identif ed more than 70 historical events ofregularsoundchangethatoccurredthroughouttheevolution of the Turkic language family, while simultaneously inferring a dated phylogenetic tree. Including regular sound changes yielded an approximately4-fold improvement in thecharacter- ization of linguistic change over a simpler model of sporadic change, improved phylogenetic inference, and returned more reliable and plausible dates for events on the phylogenies. The historical timings of the concerted changes closely follow a Poisson process model, and the sound transition networks derived fromour model mirror linguistic expectations. Conclusions: We demonstrate that a model with no prior knowledge of complex concerted or regular changes can nevertheless infer the historical timings and genealogical placements of events of concerted change from the signals left in contemporary data. Our model can be applied wherever discrete elements—such as genes, words, cultural trends, technologies, or morphological traits—can change in parallel within an organism or other evolving group. Introduction Concerted evolutionary change is widespread in genetic systems, being implicated in the genome-wide control of repetitive elements [1–3], the evolution of gene families [2], and homogenization of Y chromosome sequences [4, 5] and as a means by which asexual organisms might escape the debilitating consequences of Muller’s ratchet [3]. It might arise from several mechanisms, including homologous recombi- nation, that allow certain favorable elements to spread or damaging elements to be neutralized. Linguists have long recognized concerted change that affects copies of the same sound (or phoneme) appearing in different words as a central feature of linguistic evolution [6]. A well-known example is the *p> f sound change in the Germanic languages wherein an older Indo-European p sound was replaced by an f sound, such as in *pater> father, or *pes, *pedis> foot (linguistic convention is to use the ‘‘> ’’ symbol to indicate a transition from one sound to another, and here the * symbol denotes a reconstructed ancestral form). These multipleinstances ofonephonemechanging to thesameother phoneme yield regular sound correspondences between pairs or groups of languages. Linguists have proposed several explanations for the regularity of changes grounded in a number of basic processes, including speech production, perception, and cognition [7–9]. Can events of concerted change be detected statistically in sequence data, and do they improve the characterization of evolutionand theinferenceof evolutionaryhistories? Although previous researchers working in a linguistic setting have used the concept of regular changes to build algorithms for auto- matically inferring cognacy, to our knowledge the model we report here is the f rst probabilistic description of concerted change. This places concerted evolution in a statistical setting that allows for formal hypothesis testing about the nature and rates of concerted changes. For example, the question of how many parallel changes are required to be recognized as an instance of concerted change is naturally dealt with in our model: the statistical signature of concerted or regular change is that the multiple parallel events are more probable if treated as a single coordinated change than as a collection of inde- pendent changes (Box 1). Usefully, the genetic and linguistic phenomena share funda- mental properties relevant to their statistical characterization. Phonemes are the units of sound that make up words and distinguish one word from another, just as the four nucleotide bases (A, C, T, G) make up DNA gene sequences or the 20 amino acids make up protein sequences. The number of distinct sounds in a language varies greatly, but somewhere around 30–60 phonemes are commonly suff cient to describe the range of distinctive sounds in a language’s words [10]. Collections of words can therefore be thought of as providing phonemic ‘‘sequence information’’ that might be informative as to the history, rate, and patterns of concerted evolutionary change in language, and in a manner analogous to sequences of DNA. 5 / 30

Slide 16

Slide 16 text

State of the Art Examples Examples 2015 Current Biology 25, 1–9, J anuary 5, 2015 ª 2015 The Authors http://dx.doi.org/10.1016/j.cub.2014.10.064 Article Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution Daniel J . Hruschka,1 Simon Branford,2 Eric D. Smith,3,4 J on Wilkins,3,5 Andrew Meade,2 Mark Pagel,2,3,* and Tanmoy Bhattacharya3,6,* 1School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, USA 2SchoolofBiologicalSciences, UniversityofReading, Reading RG6 6BX, UK 3The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA 4Krasnow Institute for Advanced Study, George Mason University, Mail Stop 2A1, 4400 University Drive, Fairfax, VA 22030, USA 5Ronin Institute, 127 Haddon Place, Montclair, NJ 07043, USA 6T-2, Los Alamos National Laboratory, Los Alamos, NM 87545, USA Summary Background: Concerted evolution is normally used to describe parallel changes at different sites in a genome, but it is also observed in languages where a specif c phoneme changes to the same other phoneme in many words in the lexicon—a phenomenon known as regular sound change. We develop a general statistical model that can detect concerted changes in aligned sequence data and apply it to study regular sound changes in the Turkic language family. Results: Linguistic evolution, unlike the genetic substitutional process, is dominated by events of concerted evolutionary change. Our model identif ed more than 70 historical events ofregularsoundchangethatoccurredthroughouttheevolution of the Turkic language family, while simultaneously inferring a dated phylogenetic tree. Including regular sound changes yielded an approximately4-fold improvement in thecharacter- ization of linguistic change over a simpler model of sporadic change, improved phylogenetic inference, and returned more reliable and plausible dates for events on the phylogenies. The historical timings of the concerted changes closely follow a Poisson process model, and the sound transition networks derived fromour model mirror linguistic expectations. Conclusions: We demonstrate that a model with no prior knowledge of complex concerted or regular changes can nevertheless infer the historical timings and genealogical placements of events of concerted change from the signals left in contemporary data. Our model can be applied wherever discrete elements—such as genes, words, cultural trends, technologies, or morphological traits—can change in parallel within an organism or other evolving group. Introduction Concerted evolutionary change is widespread in genetic systems, being implicated in the genome-wide control of repetitive elements [1–3], the evolution of gene families [2], and homogenization of Y chromosome sequences [4, 5] and as a means by which asexual organisms might escape the debilitating consequences of Muller’s ratchet [3]. It might arise from several mechanisms, including homologous recombi- nation, that allow certain favorable elements to spread or damaging elements to be neutralized. Linguists have long recognized concerted change that affects copies of the same sound (or phoneme) appearing in different words as a central feature of linguistic evolution [6]. A well-known example is the *p> f sound change in the Germanic languages wherein an older Indo-European p sound was replaced by an f sound, such as in *pater> father, or *pes, *pedis> foot (linguistic convention is to use the ‘‘> ’’ symbol to indicate a transition from one sound to another, and here the * symbol denotes a reconstructed ancestral form). These multipleinstances ofonephonemechanging to thesameother phoneme yield regular sound correspondences between pairs or groups of languages. Linguists have proposed several explanations for the regularity of changes grounded in a number of basic processes, including speech production, perception, and cognition [7–9]. Can events of concerted change be detected statistically in sequence data, and do they improve the characterization of evolutionand theinferenceof evolutionaryhistories? Although previous researchers working in a linguistic setting have used the concept of regular changes to build algorithms for auto- matically inferring cognacy, to our knowledge the model we report here is the f rst probabilistic description of concerted change. This places concerted evolution in a statistical setting that allows for formal hypothesis testing about the nature and rates of concerted changes. For example, the question of how many parallel changes are required to be recognized as an instance of concerted change is naturally dealt with in our model: the statistical signature of concerted or regular change is that the multiple parallel events are more probable if treated as a single coordinated change than as a collection of inde- pendent changes (Box 1). Usefully, the genetic and linguistic phenomena share funda- mental properties relevant to their statistical characterization. Phonemes are the units of sound that make up words and distinguish one word from another, just as the four nucleotide bases (A, C, T, G) make up DNA gene sequences or the 20 amino acids make up protein sequences. The number of distinct sounds in a language varies greatly, but somewhere around 30–60 phonemes are commonly suff cient to describe the range of distinctive sounds in a language’s words [10]. Collections of words can therefore be thought of as providing phonemic ‘‘sequence information’’ that might be informative as to the history, rate, and patterns of concerted evolutionary change in language, and in a manner analogous to sequences of DNA. Data 26 Turkic languages 225 cognate sets from etymological dictionaries (Tower of Babel) manually compiled alignment analyses ASCII-encoding, 62 unique symbols Method Bayesian Markov chain Monte Carlo statistical model allowing for sporadic (irregular) and concerted (regular) changes along a phylogenetic tree that produces the alignments Output phylogenies, change rates Software Bayes Phylogenies 5 / 30

Slide 17

Slide 17 text

State of the Art Problems Problems ! 6 / 30

Slide 18

Slide 18 text

State of the Art Problems Data Representation File ”UT-root.fas”, lines 1-20 (Wheeler et al. 2014) 7 / 30

Slide 19

Slide 19 text

State of the Art Problems Data Representation File ”TurkicFull.txt”, lines 1-20 (Hruschka et al. 2015) 7 / 30

Slide 20

Slide 20 text

State of the Art Problems Model Restrictions 8 / 30

Slide 21

Slide 21 text

State of the Art Problems Model Restrictions approach Wheeler et al. 2014 Hruschka et al. 2015 data size taxa 32 26 cognate sets 100 225 data structure phonetic strings ✓ ✓ segmentation ✓ ✓ cognate sets ✓ ✓ alignments inferred prescribed models sound change transition graph transition graph transition rates prescribed inferred unobserved states ✗ ✗ context ✗ ✗ 8 / 30

Slide 22

Slide 22 text

State of the Art Problems Model Restrictions ˈs o h₂ w l̩ s h₂ u ˈe n - Indo-European "sun" s oː l - s oː l i k u l u s "sun" "small sun" s oː l ɛ j z ɔ n ə s u nː ɔ̃ː Germanic German Latin French "sun" "sun" "sun" 9 / 30

Slide 23

Slide 23 text

h j - ä r t a - h - e - r z - - h - e a r t - - c - - o r d i s hjärta herz heart cordis Modeling Phonological R elations 10 / 30

Slide 24

Slide 24 text

Modeling Phonological Relations Alignments Alignments 11 / 30

Slide 25

Slide 25 text

Modeling Phonological Relations Alignments Alignments 11 / 30

Slide 26

Slide 26 text

Modeling Phonological Relations Alignments Alignments 11 / 30

Slide 27

Slide 27 text

Modeling Phonological Relations Alignments Alignments 11 / 30

Slide 28

Slide 28 text

Modeling Phonological Relations Alignments Alignments 11 / 30

Slide 29

Slide 29 text

Modeling Phonological Relations Problems Problems ! 12 / 30

Slide 30

Slide 30 text

Modeling Phonological Relations Problems Problems unalignable cognates 13 / 30

Slide 31

Slide 31 text

Modeling Phonological Relations Problems Problems unalignable cognates t u ŋ a b iː t ə - context 13 / 30

Slide 32

Slide 32 text

Modeling Phonological Relations Problems Problems unalignable cognates t u ŋ a b iː t ə - context site dependencies 13 / 30

Slide 33

Slide 33 text

Modeling Phonological Relations Solutions Solutions ? 14 / 30

Slide 34

Slide 34 text

Modeling Phonological Relations Solutions LingPy and EDICTOR LingPy (List et al. 2014, http://lingpy.org) EDICTOR (List in prep., http://tsv.lingpy.org) P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 15 / 30

Slide 35

Slide 35 text

Modeling Phonological Relations Solutions Unalignable Parts 16 / 30

Slide 36

Slide 36 text

Modeling Phonological Relations Solutions Unalignable Parts 16 / 30

Slide 37

Slide 37 text

Modeling Phonological Relations Solutions Unalignable Parts 16 / 30

Slide 38

Slide 38 text

Modeling Phonological Relations Solutions Unalignable Parts implemented (LingPy and EDICTOR) 16 / 30

Slide 39

Slide 39 text

Modeling Phonological Relations Solutions Phonetic Context 17 / 30

Slide 40

Slide 40 text

Modeling Phonological Relations Solutions Phonetic Context 17 / 30

Slide 41

Slide 41 text

Modeling Phonological Relations Solutions Phonetic Context implemented (LingPy and EDICTOR) 17 / 30

Slide 42

Slide 42 text

Modeling Phonological Relations Solutions Site Dependence 18 / 30

Slide 43

Slide 43 text

Modeling Phonological Relations Solutions Site Dependence 18 / 30

Slide 44

Slide 44 text

Modeling Phonological Relations Solutions Site Dependence implementation pending 18 / 30

Slide 45

Slide 45 text

Modeling Phonological Relations Solutions Summary Problem Subproblem LingPy EDICTOR unalignable cognates linear relations ✓ ✓ non-linear relations ✗ ✗ context prosodic context ✓ ✓ user-defined context (✓) ✓ site dependencies neighboring columns ✗ (✓) non-neighboring columns ✗ ✗ 19 / 30

Slide 46

Slide 46 text

Modeling Phonological Relations Solutions Summary PhonoBank? Yes! If we thoroughly collect aligned cognate sets, along with proto-forms, and marked context (datasets of Paul and Thiago) we will be able to automatically extract the sound changes and cre- ate our PhonoBank! 19 / 30

Slide 47

Slide 47 text

PIE *bhreu◌̯ Hg◌ ̑ - “to use” PIE *bhruHg◌ ̑ -ié- “to use” (present tense) PGM *ƀrūkan- “to use” OHG brūhhan “to use” G brauchen “to use” G Brauch “custom” OHG fruht “profit, fruit” G frugal “nourishing” Fr fruit “profit,fruit” Fr frugal “modest (food)” Lt fruor, fruī “I enjoy” Lt frūctus “profit” Lt frux “fruit, grain” Lt frugalis “bring profit” inherited from borrowed from derived from Modeling Etymological R elations 20 / 30

Slide 48

Slide 48 text

Modeling Etymological Relations Dimensions of Lexical Change Dimensions of Lexical Change 21 / 30

Slide 49

Slide 49 text

Modeling Etymological Relations Dimensions of Lexical Change Dimensions of Lexical Change SEMANTIC CHANGE MORPHOLOGICAL CHANGE S T R A T IC C H A N G E Dimensions of Lexical Change (Gévaudan 2007) 21 / 30

Slide 50

Slide 50 text

Modeling Etymological Relations Dimensions of Lexical Change Dimensions of Lexical Change *kuppa- Kopf Kopf köpfen world cup Weltcup semantic change morphological change stratic change 22 / 30

Slide 51

Slide 51 text

Modeling Etymological Relations Inference Inference word Wort слово cuvînt palabra mot adottszó slovo verbum focal 词 parola λόγος शब◌् द ord λόγος Wort слово cuvînt palabra mot adottszó slovo verbum focal 词 parola शब◌् द ord word ord ord word 23 / 30

Slide 52

Slide 52 text

Modeling Etymological Relations Inference Semantic Shift and Colexification Networks Key Concept Russian German ... 1.1 world mir, svet Welt ... 1.21 earth, land zemlja Erde, Land ... 1.212 ground, soil počva Erde, Boden ... 1.420 tree derevo Baum ... 1.430 wood derevo Wald ... ... ... ... ... ... 24 / 30

Slide 53

Slide 53 text

Modeling Etymological Relations Inference Semantic Shift and Colexification Networks post, pole staff, walking stick doorpost, jamb tree stump mast club firewood root tree trunk woods, forest banana tree tree wood CLICS: Database of Cross-Linguistic Colexifications (List et al. 2014, http://clics.lingpy.org) 24 / 30

Slide 54

Slide 54 text

Modeling Etymological Relations Inference Semantic Shift and Concept Comparison GLOSS OCCS Blust- 2008-210 Dolgopolsky- 1964-15 Dunn-2012-207 4118 WATER 3 water water water 5440 TONGUE 3 tongue tongue tongue 5450 I 3 I first person marker I 5456 YOU 3 you second person marker you 5490 WHO 3 who? who/what who CONCEPTICON: A resource for the linking of concept lists (List and Cysouw, forthcoming, http://concepticon.github.io) Concept ID arbitrarité 25 / 30

Slide 55

Slide 55 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . ---Lánzhōu . Fùzhōu -- . Xiāngtàn -- . M ěixiàn -- . H ongkong -- . ---Wǔhàn . ---Běijīng . ---Kùnmíng . Hángzhōu -- . Xiàmén -- . ---Chéngdū . Sùzhōu -- . Shànghǎi -- . Táiběi -- . ---Zhèngzhōu . Shèxiàn -- . ---Nánjīng . ---Guìyáng . W énzhōu -- . N ánníng -- . Tūnxī -- . ---Tiānjìn . Shāntóu -- . ---Xīníng . ---Q īngdǎo . ---Ürüm qi . ---Píngyáo . Nánchàng -- . ---Tàiyuán . Chángshā -- . Hǎikǒu -- . ---Héfèi . Jiàn'ǒu -- . ---Yīnchuàn . ---Hohhot . Táoyuán -- . ---Xī'ān . G uǎngzhōu -- . ---Harbin . ---Jìnán . 0 . 0 . 0 . Inferred Links Minimal Lateral Networks of Chinese dialects (List et al. 2014) 26 / 30

Slide 56

Slide 56 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . Tūnxī - . Fùzhōu - . --Héfèi . Hǎikǒu - . --Wǔhàn . --Kùnmíng . --Zhèngzhōu . --Xī'ān . Nánchàng - . G uǎngzhōu - . Sùzhōu - . --Ürüm qi . --Q īngdǎo . M ěixiàn - . N ánníng - . Chángshā - . --Lánzhōu . --Tàiyuán . --Tiānjìn . --Harbin . Táoyuán - . W énzhōu - . --Xīníng . H ongkong - . Xiàmén - . Hángzhōu - . --Yīnchuàn . --Chéngdū . --Nánjīng . Táiběi - . --Guìyáng . --Hohhot . --Běijīng . Shànghǎi - . Xiāngtàn - . --Jìnán . Shāntóu - . Shèxiàn - . Jiàn'ǒu - . --Píngyáo . 1 . 4 . 8 . Inferred Links Minimal Lateral Networks of Chinese dialects (List et al. 2014) 26 / 30

Slide 57

Slide 57 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . Guānhuà . Xiàng . Mǐn . Yuè . Wú . Jìn . Kèjiā . Gàn . Huī . 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 . 9 . 10 . 11 . 12 . 13 . 14 . 15 . 16 . 17 . 18 . 19 . 20 . 21 . 22 . 23 . 24 . 25 . 26 . 27 . 28 . 29 . 30 . 31 . 32 . 33 . 34 . 35 . 36 . 37 . 38 . 39 . 40 . 1 . Běijīng 北京 . 2 . Chángshā 长沙 . 3 . Chéngdū 成都 . 4 . Fùzhōu 福州 . 5 . Guǎngzhōu 广州 . 6 . Guìyáng 贵阳 . 7 . Harbin 哈尔滨 . 8 . Hǎikǒu 海口 . 9 . Hángzhōu 杭州 . 10 . Héfèi 合肥 . 11 . Hohhot 呼和浩特 . 12 . Jiàn'ōu 建瓯 . 13 . Jìnán 济南 . 14 . Kùnmíng 昆明 . 15 . Lánzhōu 兰州 . 16 . Měixiàn 梅县 . 17 . Nánchàng 南昌 . 18 . Nánjīng 南京 . 19 . Nánníng 南宁 . 20 . Píngyáo 平遥 . 21 . Qīngdǎo 青岛 . 22 . Shànghǎi 上海 . 23 . Shāntóu 汕头 . 24 . Shèxiàn 歙县 . 25 . Sùzhōu 苏州 . 26 . Táiběi 台北 . 27 . Tàiyuán 太原 . 28 . Táoyuán 桃园 . 29 . Tiānjìn 天津 . 30 . Tūnxī 屯溪 . 31 . Wénzhōu 温州 . 32 . Wǔhàn 武汉 . 33 . Ürümqi 乌鲁木齐 . 34 . Xiàmén 厦门 . 35 . Hongkong 香港 . 36 . Xiāngtàn 湘潭 . 37 . Xīníng 西宁 . 38 . Xī'ān 西安 . 39 . Yīnchuàn 银川 . 40 . Zhèngzhōu 郑州 . 1 . 7 . 15 . Inferred Links Minimal Lateral Networks of Chinese dialects (List et al. 2014) 26 / 30

Slide 58

Slide 58 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . -----Jìnán . -----Harbin . -----Héfèi . Chángshā ---- . Sùzhōu ---- . -----Yīnchuàn . -----Běijīng . Hángzhōu ---- . -----Chéngdū . -----Hohhot . -----Lánzhōu . Xiāngtàn ---- . -----Ürüm qi . M ěixiàn ---- . -----Xī'ān . G uǎngzhōu ---- . -----Nánjīng . Táoyuán ---- . -----Zhèngzhōu . -----Kùnmíng . Táiběi ---- . Shànghǎi ---- . Xiàmén ---- . Jiàn'ǒu ---- . Shèxiàn ---- . -----Q īngdǎo . -----Xīníng . Fùzhōu ---- . -----Tàiyuán . -----Píngyáo . Nánchàng ---- . H ongkong ---- . N ánníng ---- . W énzhōu ---- . -----Guìyáng . Shāntóu ---- . -----Tiānjìn . Tūnxī ---- . Hǎikǒu ---- . -----Wǔhàn . 太阳 . 日头 . 热头 . 阳婆 . 日 . Loss Event . Gain Event Inferred evolution of „sun” (List et al. submitted) 26 / 30

Slide 59

Slide 59 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . Shànghǎi ---- . Hongkong ---- . Táiběi ---- . Nánjīng ---- . Táoyuán ---- . Běijīng ---- . Měixiàn ---- . Xiàmén ---- . Fùzhōu ---- . Guǎngzhōu ---- . 太阳 . 日头 . Loss Event . Gain Event Inferred evolution of „sun” (List et al. submitted) 26 / 30

Slide 60

Slide 60 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . Shànghǎi ---- . Hongkong ---- . Táiběi ---- . Nánjīng ---- . Táoyuán ---- . Běijīng ---- . Měixiàn ---- . Xiàmén ---- . Fùzhōu ---- . Guǎngzhōu ---- . 太阳 . 日头 . Loss Event . Gain Event Inferred evolution of „sun” (List et al. submitted) 26 / 30

Slide 61

Slide 61 text

Modeling Etymological Relations Inference Language Contact and Minimal Lateral Networks . . Shànghǎi ---- . Hongkong ---- . Táiběi ---- . Nánjīng ---- . Táoyuán ---- . Běijīng ---- . Měixiàn ---- . Xiàmén ---- . Fùzhōu ---- . Guǎngzhōu ---- . 太阳 . 日头 . Loss Event . Gain Event Inferred evolution of „sun” (List et al. submitted) 26 / 30

Slide 62

Slide 62 text

Modeling Etymological Relations Models Models LOSS INNO VATIO N INNO VATIO N BORROWING 27 / 30

Slide 63

Slide 63 text

Modeling Etymological Relations Models New Models for Lexical Change German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e 28 / 30

Slide 64

Slide 64 text

Modeling Etymological Relations Models New Models for Lexical Change German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 28 / 30

Slide 65

Slide 65 text

Modeling Etymological Relations Models New Models for Lexical Change German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "MOON" "SHINE" "LIGHT" 28 / 30

Slide 66

Slide 66 text

Modeling Etymological Relations Models New Models for Lexical Change ŋ u o ʔ ⁵ - - - - - - - - - - ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ j - y t ² l - œ ŋ ²² - - - - - - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "SHINE" "LIGHT" 28 / 30

Slide 67

Slide 67 text

Modeling Etymological Relations Models New Models for Lexical Change 28 / 30

Slide 68

Slide 68 text

Modeling Etymological Relations Models New Models for Lexical Change 28 / 30

Slide 69

Slide 69 text

Modeling Etymological Relations Models New Models for Lexical Change 28 / 30

Slide 70

Slide 70 text

Modeling Etymological Relations Models New Models for Lexical Change transition priors for etymologically related words can be provided manually, automatically, or semi-automatically transitions priors can be simultaneously defined for semantic change and morphological change (including complex paradigms) all we need in order to use the rich information inherent in complex etymological relations are software “beasts” that provide multistate models accepting n states along with user-specified individual transition priors for each set of etymologically related words 28 / 30

Slide 71

Slide 71 text

Modeling Etymological Relations Models New Data for Lexical Change Database of Lexical Change Patterns in Sino-Tibetan Languages project of CRLAO (Paris, L. Sagart and G. Jacques) in collaboration with SOAS (London, N. Hill) 50+ doculects 250+ concepts alignment analyses for alignable cognates detailed annotation of morphological relations (“linguistic priors”) cross-semantic search for cognate words to be launched before autumn 2015 29 / 30

Slide 72

Slide 72 text

Modeling Etymological Relations Models New Data for Lexical Change Database of Lexical Change Patterns in Sino-Tibetan Languages project of CRLAO (Paris, L. Sagart and G. Jacques) in collaboration with SOAS (London, N. Hill) 50+ doculects 250+ concepts alignment analyses for alignable cognates detailed annotation of morphological relations (“linguistic priors”) cross-semantic search for cognate words to be launched before autumn 2015 A Comparative Database of Tukanoan and Northwestern South American Languages Thiago will soon present you with the details! 29 / 30

Slide 73

Slide 73 text

THANK YOU FOR LISTENING! 30 / 30