Computer-Assisted Language Comparison. Ideas, Tools, Applications

Computer-Assisted Language Comparison Ideas, Tools, Applications Johann-Mattis List DFG research
fellow Centre des recherches linguistiques sur l’Asie Orientale Team Adaptation, Integration, Reticulation, Evolution EHESS and UPMC, Paris 2016-01-20 1 / 32

Background Background 2 / 32

Background The Comparative Method The Comparative Method 3 / 32

Background The Comparative Method The Comparative Method In linguistics, the
comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor, as opposed to the method of internal reconstruction, which analyses the internal development of a single language over time. Wikipedia s.v. "Comparative Method" The method of comparing languages to determine whether and how they have developed from a common ancestor. The items compared are lexical and grammatical units, and the aim is to discover correspondences relating sounds in two or more di�erent languages, which are so numerous and so regular, across sets of units with similar meanings, that no other explanation is reasonable. Oxford Dictionary of Linguistics (Matthews 1997) The comparative is both the earliest and the most important of the methods of reconstruction. Most of the major insights into the prehistory of languages have been gained by the applications of this method, and most reconstructions have been based on it. Fox (1995) The Comparative Method is the central tool in historical linguistics for historical reconstruction and also classifying languages. A classi�cation done with the Comparative Method is called a genetic classi�cation. The result is that languages are arranged in language family trees. This means that languages are classi�ed according to their genealogical relationships2 and are interpreted as being in relation of child- or sisterhood to other languages. Such a way of classifying entities is called phylogenetic classi�cation in biology; a classi�cation by genealogical relationships. Fleischhauer (2009) The method of comparatistics today is generally known under the not very well-chosen term "comparative-historical method". It constitutes a huge complex of abstract and concrete procedures for the investigation of the history of related languages which genetically go back to some unofrom tradition of the past. Klimov (1990), my translation → comparative linguistics, reconstruction Routledge Dictionary of Language and Linguistics (Bussmann 1996) 3 / 32

Background The Comparative Method The Comparative Method Scholar Proof of
Relationship Study of Language History External Reconstruction Linguistic Reconstruction Language Classiﬁcation Anttila (1972) ✓ ✓ Bußmann (2002) ✓ Fleischhauer (2009) ✓ Fox (1995) ✓ Glück (2000) ✓ Harrison (2003) ✓ Hoenigswald (1960) ✓ Jarceva (1990) ✓ Klimov (1990) ✓ ✓ Lehmann (1969) ✓ Makaev (1977) ✓ Matthews (1997) ✓ Rankin (2003) ✓ 3 / 32

Background The Comparative Method The Comparative Method Working Deﬁnition for
the Comparative Method The comparative method is a bunch of techniques that are commonly used by historical linguists in order to reconstruct the history of languages and language families. 3 / 32

Background Workﬂows Workﬂows 4 / 32

Background Workflows Workflows Workflow by Ross and Durie (1996) 1.
Determine on the strength of diagnostic evidence that a set of languages are genetically related, that is, that they constitute a ‘family’; 2. Collect putative cognate sets for the family (both morphological paradigms and lexical items). 3. Work out the sound correspondences from the cognate sets, putting ‘irregular’ cognate sets on one side; 4. Reconstruct the protolanguage of the family as follows: a Reconstruct the protophonology from the sound correspondences worked out in (3), using conventional wisdom regarding the directions of sound changes. b Reconstruct protomorphemes (both morphological paradigms and lexical items) from the cognate sets collected in (2), using the protophonology re- constructed in (4a). 5. Establish innovations (phonological, lexical, semantic, morphological, morpho- syntactic) shared by groups of languages within the family relative to the re- constructed protolanguage. 6. Tabulate the innovations established in (5) to arrive at an internal classification of the family, a ‘family tree’. 7. Construct an etymological dictionary, tracing borrowings, semantic change, and so forth, for the lexicon of the family (or of one language of the family). 4 / 32

Background Workflows Workflows PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS
RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION Tentative Visualization of the Workflow by Ross and Durie (1996: 6f) 4 / 32

Background Workflows Workflows proof of relationship identification of cognates identification
of sound correspondences reconstruction of proto-forms internal classification revise revise revise revise Simplified Version of Ross and Durie’s Workflow (List 2014: 58) 4 / 32

Problems Problems 5 / 32

Problems Application Application 6 / 32

Problems Application Application PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS
RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION 6 / 32

RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION TIME CONSUMING... 6 / 32

RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION TIME CONSUMING... TEDIOUS... 6 / 32

Problems Representation Representation 7 / 32

Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig
Frucht f. ‘der Fortpﬂanzung der eigenen Art dienendes Produkt einer Pﬂanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de) 7 / 32

Frucht f. ‘der Fortpﬂanzung der eigenen Art dienendes Produkt einer Pﬂanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de 7 / 32

Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] inherited from borrowed from derived from PIE *bhreu◌◌̯ Hg◌ ◌ ̑ - “to use” PIE *bhruHg◌ ◌ ̑ -ié- “to use” (present tense) PGM *ƀrūkan- “to use” OHG brūhhan “to use” G brauchen “to use” G Brauch “custom” OHG fruht “profit, fruit” G frugal “modest (food)” Fr fruit “profit,fruit” Fr frugal “modest (food)” Lt fruor, fruī “I enjoy” Lt frūctus “profit” Lt frux “fruit, grain” Lt frugalis “bring profit” Adapted from an Illustration by Hans Geisler (University Düsseldorf) German "Frucht" in Pfei�er (1993, also at http://dwds.de 7 / 32

Problems Representation Representation Entry for PIE *kʷetware in Tower of
Babel (http://starling.rinet.ru) 7 / 32

Problems Representation Representation Insuﬃciencies of Data Representation data in “textual
form” (impossible to search it eﬃciently) no standardized phonetic representations no standardized glosses for meanings no standardized names or abbreviations for language and dialect names no standardized representation of sound correspondences no standardized assignment of cognate sets and borrowings ... 8 / 32

Problems Replication Replication 9 / 32

Problems Replication Replication Gloss Blust Pawley Distance “day” *qaco *qaco
0 “to spit” *qanusi *qanusi 0 “person” *taumataq *tamwata 3 “to vomit” *mumutaq *mumuta 1 “name” *ŋajan *qajan 1 “snake” *mwata *mwata 0 “man” *mwa ruqane *taumwaqane 5 “four” *pani *pat 2 “one” *sakai *tasa 3 ... ... ... ... Disagreement between experts on PO reconstructions (Bouchard-Côté et al. 2014) 9 / 32

Problems Replication Replication Reproducability Problems in Historical Linguistics Scholars disagree
on many points in historical linguistics, be it the number of laryngeals, the position of Baltic and Slavic, or whether a given word was borrowed or not. We know well that no two etymological dictionaries for the same language or language families are completely identi- cal. Unfortunately, we lack a rigorous check to which degree experts actually agree or disagree in their judgments. We also lack methods for evaluation which would help us to show to which degree a given hypothesis (a reconstruction, a family tree, or an etymology) corresponds with our linguistic data. 9 / 32

A Computer-Assisted Framework for Language Comparison Towards a Computer-Assisted Framework
for Language Comparison 10 / 32

A Computer-Assisted Framework for Language Comparison P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY,
VERY LONG TITLE 11 / 32

A Computer-Assisted Framework for Language Comparison PRO: - intuition -
background knowledge - can juggle with multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 11 / 32

A Computer-Assisted Framework for Language Comparison PRO: - intuition -
background knowledge - can juggle with multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE COMPUTER-ASSISTED LANGUAGE COMPARISON 11 / 32

A Computer-Assisted Framework for Language Comparison Standards Standards 12 /
32

A Computer-Assisted Framework for Language Comparison Standards Standards: Concept Labeling
12 / 32

Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID: 3232) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID: 3232) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID: 3232) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID: 3232) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID: 3232) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID: 3232) Matisoﬀ (1978) 200 fat/grease GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID: 3232) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID: 3232) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID: 3232) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID: 3232) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID: 3232) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID: 3232) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID: 3232) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID: 3232) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID: 3232) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID: 3232) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID: 3232) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID: 3232) Concept labels for “GREASE” in 22 diﬀerent concept lists (see List et al. 2015, online at http://concepticon.clld.org) 12 / 32

Concept labels for “GREASE” in 22 diﬀerent concept lists (see List et al. 2015, online at http://concepticon.clld.org) Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323) Matisoﬀ (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323) 12 / 32

The Concepticon (List, Cysouw, and Forkel submitted), is available in form of an online application at http://concepticon.clld.org and an online repository at http://github.com/clld/concepticon-data. The data currently comprises 128 concept lists in which more than 10 000 concept labels are linked to about 2000 concept sets. Basic semantic relations (broader, narrower, etc.) are deﬁned between similar concept sets. Concept sets are enriched by linking them to additional meta-data. 13 / 32

A Computer-Assisted Framework for Language Comparison Standards Standards: Lexical Representation
14 / 32

Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) 14 / 32

Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 14 / 32

Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 14 / 32

Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 14 / 32

A Computer-Assisted Framework for Language Comparison Standards Standards: Representation of
Cognate Judgments 15 / 32

Cognate Judgments Language Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k Cognate judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) 15 / 32

Cognate Judgments Cognate judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) Language Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k 15 / 32

A Computer-Assisted Framework for Language Comparison Standards Jena Wordlist Standard
16 / 32

JENA WORDLIST STANDARD The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray 16 / 32

The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD DEFINE STANDARDS FOR - Wordlists - Cognate Sets - Alignments PROVIDE TOOLS FOR - Data Validation - Data Exchange - Data Enrichment 16 / 32

The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD arbitrarité Glottolog http://glottolog.clld.org Phoible http://phoible.clld.org CONCEPTICON http://concepticon.clld.org [ˈfɔi.bł] INTEGRATE EXISTING STANDARDS 16 / 32

The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray PROVIDE TOOLS FOR EDITING AND ANALYSIS LingPy http://lingpy.org TSV EDICTOR http://tsv.lingpy.org JENA WORDLIST STANDARD 16 / 32

The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD LexiBank - Cross-Linguistic Database of Lexical Cognate Sets PhonoBank - Cross-Linguistic Database of Regular Sound Change Patterns USE THE STANDARD TO BUILD NEW DATABASES 16 / 32

A Computer-Assisted Framework for Language Comparison Workﬂows Workﬂows 17 /
32

A Computer-Assisted Framework for Language Comparison Workflows Workflows P(A|B)=(P(B|A)P(A))/(P(B) FRANZ
BOPP VERY, VERY LONG TITLE Semantic Tagging Segmentation Cognate Detection Alignment Analysis Linguistic Reconstruction Phylogenetic Reconstruction HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] RAW DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WORDLIST DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] TOKENS, MORPHEMES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] COGNATE SETS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] SOUND CORRESPON- DENCES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PROTO- FORMS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PHYLO- GENIES PROVIDES AUTOMATIC ANALYSES REVISES AUTOMATIC ANALYSES A possible computer-assisted, iterative workflow with automatic and manual components. 17 / 32

A Computer-Assisted Framework for Language Comparison Workﬂows Workﬂows: Tools 18
/ 32

A Computer-Assisted Framework for Language Comparison Workﬂows Workﬂows: Tools LingPy
http://lingpy.org TSV EDICTOR http://tsv.lingpy.org 18 / 32

and EDICTOR: Two tools for computer-assisted language comparison. TSV EDICTOR http://tsv.lingpy.org Software Library for Automatic Tasks in Historical Linguistics - phonetic segmentation - phonetic alignment - cognate detection - ancestral state reconstruction - borrowing detection - phylogenetic reconstruction 18 / 32

and EDICTOR: Two tools for computer-assisted language comparison. TSV LingPy http://lingpy.org Online Tool for Computer- Assisted Language Comparison - server- and client-based - data validation - phonetic segmentation - cognate set editor - alignment editor - correspondence evaluation 18 / 32

A Computer-Assisted Framework for Language Comparison Test Cases Test Cases:
Tukano (with T. Chacon) Recent research in historical linguistics shows initial attempts to model language history no longer as a process of word gain and word loss, but as a process of sound changes across sets of cognate words (Wheeler and Whiteley 2015, Bouchard-Côté et al. 2013, Hruschka et al. 2015, Jäger and List 2015). Classical linguists often base genetic classiﬁcation on shared innovations in sound change which allow to identify subgroups. The problem of shared innovations is the inherent circularity of the concept. Valid innovations need to respect known tendencies of sound change, but highly frequent sound change patterns can often likewise be interpreted in terms of parallel evolution. Computational approaches ignore salient features of sound change: context-dependency, system-dependency, and directionality. They also ignore that sound systems of ancestral languages do not necessarily resemble the alphabets of the contemporary languages. 19 / 32

Tukano (with T. Chacon) Chacon and List (submitted) address these problems by assembling known sound changes extracted from distinct phonetic contexts in the consonantal inventory of 21 Tukano languages along with their ancestral forms in Proto-Tukano, using a weighted, directed parsimony framework to model transitions for multiple states of characters corresponding to one proto-sound in a distinct context, including states which are not attested in contemporary languages as “latent states”, and using a genetic algorithm to infer the set of trees which minimizes the parsimony score. 20 / 32

Tukano (with T. Chacon) 21 / 32

Tukano (with T. Chacon) The results show: that directional models largely outperform classical Sankoﬀ-parsimony, that the directions in the proposed sound changes consistently identify the root of the languages by splitting Tukano into an Eastern and a Western branch, that the consensus classiﬁcation for the best-scoring trees convincingly reconciles previous proposals in the literature. 22 / 32

Chinese B A C D duplication speciation lateral transfer D D orthologs paralogs xenologs B C D B A A B A B 23 / 32

Chinese Historical Relations Terminology Biology Linguistics common descent direct homology orthology cognacy.... ? oblique cognacy indirect paralogy involving lateral transfer xenology ? Linguistics direct cognate relation etymological relation indirect cognate relation (oblique cognacy) indirect etymological relation cognate relation (cognacy) 23 / 32

Chinese The linguistic terminology regarding historical relations between words lags behind the terminology used in evolutiona- ry biology. 23 / 32

Chinese Italian dare French donner Indo-European *deh₃- *deh₃-no- Latin dare dōnum dōnāre Italian sole French soleil Swedish sol German Sonne Germanic *sōwel- *sunnō- Latin solis soliculus Indo-European *sóh₂-wl̩ - *sh₂én- A B 23 / 32

Chinese Lexical change reveals complex patterns of which classical historical linguists are aware, but which they completely ignore in their terminology regarding historical relations between words. 23 / 32

Chinese German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e 23 / 32

Chinese German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 23 / 32

Chinese German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "MOON" "SHINE" "LIGHT" 23 / 32

Chinese Fúzhōu Měixiàn Guǎngzhōu Běijīng 23 / 32

Chinese Fúzhōu Měixiàn Guǎngzhōu Běijīng INNO VATIO N INNO VATIO N INNO VATIO N BO RRO W ING LO SS INNO VATIO N INNO VATIO N 23 / 32

Chinese The classical gain-loss models for lexical change employed in computational historical linguistics are largely unrealistic when it comes to the modeling of complex historical relations, especially relations of indirect cognacy. 24 / 32

Chinese B A C AC ABD AB A D 24 / 32

Chinese Instead of using gain-loss models, we should try to ﬁnd ways to model lexical change within multi-state approaches which also include the directionality of change. 24 / 32

Chinese List (submitted) illustrates how complex historical relations between words in Chinese dialects can be modeled by employing a directed weighted parsimony framework, modeling partial cognacy resulting from compounding as character-state transitions, computing weights between multiple characters states with help of a modiﬁed Hamming distance applied to the alignment of words which are segmented into morphemes, with insertions being more heavily penalized as deletions). 25 / 32

Chinese 月月光月光佛月亮月 0 1 2 1 月光 2 0 1 2 月光佛 4 2 0 4 月亮 2 2 3 0 月月光月光佛月亮月 0 1 2 1 月光 1 0 1 2 月光佛 2 1 0 3 月亮 1 2 3 0 月月光月光佛月亮月 0 1 1 1 月光 1 0 1 1 月光佛 1 1 0 1 月亮 1 1 1 0 月光月光佛 - 月亮月光佛 - 0 0 1 0 2 2 = 1 = 4 月光月光佛 - 月亮月光佛 - 0 0 1 0 2 1 = 1 = 3 Transition Penalty (SANKOFF) Transition Penalty (DWST) 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 A B C 26 / 32

Chinese The model is tested by determining how well the approach accounts for ancestral state reconstruction (semantic reconstruction) on a dataset of 24 Chinese dialects for which reference phylogenies were provided, and ancestral states are known via Ancient Chinese texts. The results show, that binary state models perform worst (around 55% correct reconstructions), Fitch parsimony applied to multi-state representations performs slightly better than binary models (around 60% correct reconstructions), Sankoﬀ parsimony performs much better, with scores around 75%, but high dependency upon the reference phylogeny, directed Sankoﬀ parsimony outperforms all approaches, reaching 82%. 27 / 32

Chinese Fúzhōu Táiběi Xiàmén Zhāngpíng Mǐn Guǎngzhōu Měixiàn Liánchéng Hakka Wēnzhōu Níngbō Sūzhōu Shànghǎi Shànghǎi_B Wú Nánchāng Ānyì Gàn Chángshā Shuāngfēng Xiāng Yàngshān Wǔhàn Níngxià Chéngdū Běijīng Tàiyuán Yúcì Guānhuà 月月娘月光佛月光月亮月明 ‘MOON’ ‘MOON-MOTHER’ ‘MOON-LIGHT’ ‘MOON-LIGHT-SUFFIX’ ‘MOON-SHINE’ ‘MOON-BRIGHT’ 28 / 32

A Computer-Assisted Framework for Language Comparison Summary Summary The test
cases mentioned above do not stop with the computational applications, but are instead intended to serve as a starting point from which classical linguists can evaluate and improve on the ﬁndings. In the case of sound change processes, interactive applications help linguists to identify classical “shared innovations”, but instead of determining them manually, linguists can inspect the consequences of their hypotheses regarding subgrouping. In the case of complex relations between words, linguists can investigate the plausibility of phylogenetic models and compounding processes, thereby using parallel evolution as a proxy for the identiﬁcation of lateral relations between the languages and gaining more insights into potential regularities of compounding in the history of Chinese. 29 / 32

Challenges Challenges 30 / 32

Challenges Challenges P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG
TI TLE 31 / 32

Challenges Challenges P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG
TI TLE Modeling of Morphological Change morphological change is not systematic (as opposed to sound change) morphological diﬀerences in cognate sets distort the alignments Modeling of Semantic Change semantic shift is not systematic but has general tendencies we need to incorporate known tendencies in our analyses Modeling of Irregular Sound Change irregular or sporadic sound change is problematic for reconstruction we need to ﬁnd ways to incorporate our uncertainty in our alignments 31 / 32

Concluding Remarks The current practice of data representation in historical
linguistics does not only make it dif- ﬁcult to compare and test hypotheses proposed by classical linguists with those proposed by computational approaches, but also to recon- cile the insights we can gain from the two approaches. If we want to gain new insights into the past of our languages, we need to ﬁnd ways to integrate both the knowledge which experts have been accumulating over centuries and the new computational tools which help to organize, analyze and integrate this knowledge. 32 / 32

Thanks for Your Attention! 32 / 32

Computer-Assisted Language Comparison. Ideas, T...

Computer-Assisted Language Comparison. Ideas, Tools, Applications

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript