Slide 1

Slide 1 text

The Future of the Comparative Method Towards a Computer-Assisted Framework of Linguistic Reconstruction Johann-Mattis List DFG research fellow Centre des recherches linguistiques sur l’Asie Orientale Team Adaptation, Integration, Reticulation, Evolution EHESS and UPMC, Paris 2015-06-22 1 / 20

Slide 2

Slide 2 text

Background Background 2 / 20

Slide 3

Slide 3 text

Background Definitions Definitions 3 / 20

Slide 4

Slide 4 text

Background Definitions Definitions In linguistics, the comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor, as opposed to the method of internal reconstruction, which analyses the internal development of a single language over time. Wikipedia s.v. "Comparative Method" The method of comparing languages to determine whether and how they have developed from a common ancestor. The items compared are lexical and grammatical units, and the aim is to discover correspondences relating sounds in two or more di�erent languages, which are so numerous and so regular, across sets of units with similar meanings, that no other explanation is reasonable. Oxford Dictionary of Linguistics (Matthews 1997) The comparative is both the earliest and the most important of the methods of reconstruction. Most of the major insights into the prehistory of languages have been gained by the applications of this method, and most reconstructions have been based on it. Fox (1995) The Comparative Method is the central tool in historical linguistics for historical reconstruction and also classifying languages. A classi�cation done with the Comparative Method is called a genetic classi�cation. The result is that languages are arranged in language family trees. This means that languages are classi�ed according to their genealogical relationships2 and are interpreted as being in relation of child- or sisterhood to other languages. Such a way of classifying entities is called phylogenetic classi�cation in biology; a classi�cation by genealogical relationships. Fleischhauer (2009) The method of comparatistics today is generally known under the not very well-chosen term "comparative-historical method". It constitutes a huge complex of abstract and concrete procedures for the investigation of the history of related languages which genetically go back to some unofrom tradition of the past. Klimov (1990), my translation → comparative linguistics, reconstruction Routledge Dictionary of Language and Linguistics (Bussmann 1996) 3 / 20

Slide 5

Slide 5 text

Background Definitions Definitions Scholar Proof of Relationship Study of Language History External Reconstruction Linguistic Reconstruction Language Classification Anttila (1972) ✓ ✓ Bußmann (2002) ✓ Fleischhauer (2009) ✓ Fox (1995) ✓ Glück (2000) ✓ Harrison (2003) ✓ Hoenigswald (1960) ✓ Jarceva (1990) ✓ Klimov (1990) ✓ ✓ Lehmann (1969) ✓ Makaev (1977) ✓ Matthews (1997) ✓ Rankin (2003) ✓ 3 / 20

Slide 6

Slide 6 text

Background Definitions Definitions Working Definition for the Comparative Method The comparative method is a bunch of techniques that are commonly used by historical linguists in order to reconstruct the history of languages and language families. 3 / 20

Slide 7

Slide 7 text

Background Workflows Workflows 4 / 20

Slide 8

Slide 8 text

Background Workflows Workflows Workflow by Ross and Durie (1996) 1. Determine on the strength of diagnostic evidence that a set of languages are genetically related, that is, that they constitute a ‘family’; 2. Collect putative cognate sets for the family (both morphological paradigms and lexical items). 3. Work out the sound correspondences from the cognate sets, putting ‘irregular’ cognate sets on one side; 4. Reconstruct the protolanguage of the family as follows: a Reconstruct the protophonology from the sound correspondences worked out in (3), using conventional wisdom regarding the directions of sound changes. b Reconstruct protomorphemes (both morphological paradigms and lexical items) from the cognate sets collected in (2), using the protophonology reconstructed in (4a). 5. Establish innovations (phonological, lexical, semantic, morphological, morpho- syntactic) shared by groups of languages within the family relative to the recon- structed protolanguage. 6. Tabulate the innovations established in (5) to arrive at an internal classification of the family, a ‘family tree’. 7. Construct an etymological dictionary, tracing borrowings, semantic change, and so forth, for the lexicon of the family (or of one language of the family). 4 / 20

Slide 9

Slide 9 text

Background Workflows Workflows PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION Tentative Visualization of the Workflow by Ross and Durie (1996: 6f) 4 / 20

Slide 10

Slide 10 text

Background Workflows Workflows proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification revise revise revise revise Simplified Version of Ross and Durie’s Workflow (List 2014: 58) 4 / 20

Slide 11

Slide 11 text

Problems Problems 5 / 20

Slide 12

Slide 12 text

Problems Application Application 6 / 20

Slide 13

Slide 13 text

Problems Application Application PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION 6 / 20

Slide 14

Slide 14 text

Problems Application Application PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION TIME CONSUMING... 6 / 20

Slide 15

Slide 15 text

Problems Application Application PHONOLOGICAL AND MORPHOLOGICAL RECONSTRUCTION IDENTIFICATION OF INNOVATIONS RECONSTRUCTION OF PHYLOGENIES PUBLISH ETYMOLOGICAL DICTIONARY PROOF OF LANGUAGE RELATIONSHIP SOUND CORRESPONDENCE IDENTIFICATION COGNATE SET IDENTIFICATION TIME CONSUMING... TEDIOUS... 6 / 20

Slide 16

Slide 16 text

Problems Representation Representation 7 / 20

Slide 17

Slide 17 text

Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de) 7 / 20

Slide 18

Slide 18 text

Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de 7 / 20

Slide 19

Slide 19 text

Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] inherited from borrowed from derived from PIE *bhreu◌◌̯ Hg◌ ◌ ̑ - “to use” PIE *bhruHg◌ ◌ ̑ -ié- “to use” (present tense) PGM *ƀrūkan- “to use” OHG brūhhan “to use” G brauchen “to use” G Brauch “custom” OHG fruht “profit, fruit” G frugal “modest (food)” Fr fruit “profit,fruit” Fr frugal “modest (food)” Lt fruor, fruī “I enjoy” Lt frūctus “profit” Lt frux “fruit, grain” Lt frugalis “bring profit” Adapted from an Illustration by Hans Geisler (University Düsseldorf) German "Frucht" in Pfei�er (1993, also at http://dwds.de 7 / 20

Slide 20

Slide 20 text

Problems Representation Representation Entry for PIE *kʷetware in Tower of Babel (http://starling.rinet.ru) 7 / 20

Slide 21

Slide 21 text

Problems Representation Representation Insufficiencies of Data Representation data in “textual form” (impossible to search it efficiently) no standardized phonetic representations no standardized glosses for meanings no standardized names or abbreviations for language and dialect names no standardized representation of sound correspondences no standardized assignment of cognate sets and borrowings ... 8 / 20

Slide 22

Slide 22 text

Problems Replication Replication 9 / 20

Slide 23

Slide 23 text

Problems Replication Replication Gloss Blust Pawley Distance “day” *qaco *qaco 0 “to spit” *qanusi *qanusi 0 “person” *taumataq *tamwata 3 “to vomit” *mumutaq *mumuta 1 “name” *ŋajan *qajan 1 “snake” *mwata *mwata 0 “man” *mwa ruqane *taumwaqane 5 “four” *pani *pat 2 “one” *sakai *tasa 3 ... ... ... ... Disagreement between experts on PO reconstructions (Bouchard-Côté et al. 2014) 9 / 20

Slide 24

Slide 24 text

Problems Replication Replication Reproducability Problems in Historical Linguistics Scholars disagree on many points in historical linguistics, be it the number of laryngeals, the position of Baltic and Slavic, or whether a given word was borrowed or not. We know well that no two etymological dictionaries for the sa- me language or language families are completely identical. Unfortunately, we lack a rigorous check to which degree ex- perts actually agree or disagree in their judgments. We also lack methods for evaluation which would help us to show to which degree a given hypothesis (a reconstruction, a family tree, or an etymology) corresponds with our linguistic data. 9 / 20

Slide 25

Slide 25 text

Towards a Computer-Assisted Comparative Method Towards a Computer-Assisted Comparative Method 10 / 20

Slide 26

Slide 26 text

Towards a Computer-Assisted Comparative Method P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 11 / 20

Slide 27

Slide 27 text

Towards a Computer-Assisted Comparative Method PRO: - intuition - background knowledge - can juggle with multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 11 / 20

Slide 28

Slide 28 text

Towards a Computer-Assisted Comparative Method PRO: - intuition - background knowledge - can juggle with multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 11 / 20

Slide 29

Slide 29 text

Towards a Computer-Assisted Comparative Method PRO: - intuition - background knowledge - can juggle with multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE COMPUTER-ASSISTED LANGUAGE COMPARISON 11 / 20

Slide 30

Slide 30 text

Towards a Computer-Assisted Comparative Method Standards Standards 12 / 20

Slide 31

Slide 31 text

Towards a Computer-Assisted Comparative Method Standards Standards: Concept Labeling 12 / 20

Slide 32

Slide 32 text

Towards a Computer-Assisted Comparative Method Standards Standards: Concept Labeling Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID: 3232) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID: 3232) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID: 3232) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID: 3232) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID: 3232) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID: 3232) Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID: 3232) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID: 3232) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID: 3232) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID: 3232) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID: 3232) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID: 3232) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID: 3232) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID: 3232) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID: 3232) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID: 3232) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID: 3232) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID: 3232) Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015, online at http://concepticon.clld.org) 12 / 20

Slide 33

Slide 33 text

Towards a Computer-Assisted Comparative Method Standards Standards: Concept Labeling Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015, online at http://concepticon.clld.org) Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323) Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323) 12 / 20

Slide 34

Slide 34 text

Towards a Computer-Assisted Comparative Method Standards Standards: Concept Labeling Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015, online at http://concepticon.clld.org) Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323) Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323) 12 / 20

Slide 35

Slide 35 text

Towards a Computer-Assisted Comparative Method Standards Standards: Lexical Representation 13 / 20

Slide 36

Slide 36 text

Towards a Computer-Assisted Comparative Method Standards Standards: Lexical Representation Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) 13 / 20

Slide 37

Slide 37 text

Towards a Computer-Assisted Comparative Method Standards Standards: Lexical Representation Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 13 / 20

Slide 38

Slide 38 text

Towards a Computer-Assisted Comparative Method Standards Standards: Lexical Representation Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 13 / 20

Slide 39

Slide 39 text

Towards a Computer-Assisted Comparative Method Standards Standards: Lexical Representation Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 13 / 20

Slide 40

Slide 40 text

Towards a Computer-Assisted Comparative Method Standards Standards: Representation of Cognate Judgments 14 / 20

Slide 41

Slide 41 text

Towards a Computer-Assisted Comparative Method Standards Standards: Representation of Cognate Judgments Language Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k Cognate judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) 14 / 20

Slide 42

Slide 42 text

Towards a Computer-Assisted Comparative Method Standards Standards: Representation of Cognate Judgments Cognate judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) Language Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k 14 / 20

Slide 43

Slide 43 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard 15 / 20

Slide 44

Slide 44 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard JENA WORDLIST STANDARD The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray 15 / 20

Slide 45

Slide 45 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD DEFINE STANDARDS FOR - Wordlists - Cognate Sets - Alignments PROVIDE TOOLS FOR - Data Validation - Data Exchange - Data Enrichment 15 / 20

Slide 46

Slide 46 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD arbitrarité Glottolog http://glottolog.clld.org Phoible http://phoible.clld.org CONCEPTICON http://concepticon.clld.org [ˈfɔi.bł] INTEGRATE EXISTING STANDARDS 15 / 20

Slide 47

Slide 47 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray PROVIDE TOOLS FOR EDITING AND ANALYSIS LingPy http://lingpy.org TSV EDICTOR http://tsv.lingpy.org JENA WORDLIST STANDARD 15 / 20

Slide 48

Slide 48 text

Towards a Computer-Assisted Comparative Method Standards Jena Wordlist Standard The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD LexiBank - Cross-Linguistic Database of Lexical Cognate Sets PhonoBank - Cross-Linguistic Database of Regular Sound Change Patterns USE THE STANDARD TO BUILD NEW DATABASES 15 / 20

Slide 49

Slide 49 text

Towards a Computer-Assisted Comparative Method Workflows Workflows 16 / 20

Slide 50

Slide 50 text

Towards a Computer-Assisted Comparative Method Workflows Workflows P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE Semantic Tagging Segmentation Cognate Detection Alignment Analysis Linguistic Reconstruction Phylogenetic Reconstruction HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] RAW DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WORDLIST DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] TOKENS, MORPHEMES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] COGNATE SETS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] SOUND CORRESPON- DENCES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PROTO- FORMS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PHYLO- GENIES PROVIDES AUTOMATIC ANALYSES REVISES AUTOMATIC ANALYSES A possible computer-assisted, iterative workflow with automatic and manual components. 16 / 20

Slide 51

Slide 51 text

Towards a Computer-Assisted Comparative Method Workflows Workflows: Tools 17 / 20

Slide 52

Slide 52 text

Towards a Computer-Assisted Comparative Method Workflows Workflows: Tools LingPy http://lingpy.org TSV EDICTOR http://tsv.lingpy.org 17 / 20

Slide 53

Slide 53 text

Towards a Computer-Assisted Comparative Method Workflows Workflows: Tools LingPy and EDICTOR: Two tools for computer-assisted language comparison. TSV EDICTOR http://tsv.lingpy.org Software Library for Automatic Tasks in Historical Linguistics - phonetic segmentation - phonetic alignment - cognate detection - ancestral state reconstruction - borrowing detection - phylogenetic reconstruction 17 / 20

Slide 54

Slide 54 text

Towards a Computer-Assisted Comparative Method Workflows Workflows: Tools LingPy and EDICTOR: Two tools for computer-assisted language comparison. TSV LingPy http://lingpy.org Online Tool for Computer- Assisted Language Comparison - server- and client-based - data validation - phonetic segmentation - cognate set editor - alignment editor - correspondence evaluation 17 / 20

Slide 55

Slide 55 text

Towards a Computer-Assisted Comparative Method Workflows Workflows: Test Cases Reconstruction of Tukano Languages (with T. Chacon) 15 Tukano languages 140 concepts cognate sets are aligned with proposed reconstructions Reconstruction of Burmish Languages (with N. Hill) 8 Burmish languages about 500 concepts cognate sets were determined automatically and are currently being reviewd by the expert Lexical Homology Database of Sino-Tibetan Languages (with L. Sagart and G. Jacques) more than 50 Sino-Tibetan languages about 240 concepts data is currently being assembled 18 / 20

Slide 56

Slide 56 text

Towards a Computer-Assisted Comparative Method Challenges Challenges P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE 19 / 20

Slide 57

Slide 57 text

Towards a Computer-Assisted Comparative Method Challenges Challenges P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE Modeling of Morphological Change morphological change is not systematic (as opposed to sound change) morphological differences in cognate sets distort the alignments Modeling of Semantic Change semantic shift is not systematic but has general tendencies we need to incorporate known tendencies in our analyses Modeling of Irregular Sound Change irregular or sporadic sound change is problematic for reconstruction we need to find ways to incorporate our uncertainty in our alignments 19 / 20

Slide 58

Slide 58 text

Concluding Remarks Many hypotheses have been proposed regarding the deeper phylogeny of the Austronesian and many other language families. Unfortunately, the current practice of data presentation makes it dif- ficult to compare and test these hypotheses. If we want to gain new insights into the past of our lan- guages, we need to find ways to integrate both the knowledge which experts have been accu- mulating over centuries and the new computa- tional tools which help to organize, analyze and integrate this knowledge. 20 / 20

Slide 59

Slide 59 text

Thanks for Your Attention! 20 / 20