Increasing the Comparability of Linguistic Data Johann-Mattis List Department of Linguistic and Cultural Evolution Max Planck Institute for the Science of Human History Jena 2017/02/24 1 / 37
Prolog Moral of the Story Restricting our perspective by modeling and for- malizing the phenomena we are dealing with may actually open our eyes for details we had disregarded before. 4 / 37
Problems General Data Problem in Linguistics Linguists face very complex problems in their research. But they tend to overemphasize the complexity of their problems. As a result, they refuse to handle even the things which could be easily handled. Instead of “Yes, we can!”, lin- guists tend to say “Can we really?” 6 / 37
Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de) 8 / 37
Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de 8 / 37
Problems Representation Representation Frucht, ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] inherited from borrowed from derived from PIE *bhreu◌◌̯ Hg◌ ◌ ̑ - “to use” PIE *bhruHg◌ ◌ ̑ -ié- “to use” (present tense) PGM *ƀrūkan- “to use” OHG brūhhan “to use” G brauchen “to use” G Brauch “custom” OHG fruht “profit, fruit” G frugal “modest (food)” Fr fruit “profit,fruit” Fr frugal “modest (food)” Lt fruor, fruī “I enjoy” Lt frūctus “profit” Lt frux “fruit, grain” Lt frugalis “bring profit” Adapted from an Illustration by Hans Geisler (University Düsseldorf) German "Frucht" in Pfei�er (1993, also at http://dwds.de 8 / 37
Problems Representation Representation Insufficiencies of Data Representation data in “textual form” (impossible to search it efficiently) no standardized phonetic representations no standardized glosses for meanings no standardized names or abbreviations for language and dialect names no standardized representation of sound correspondences no standardized assignment of cognate sets and borrowings ... 9 / 37
Problems Replication Replication Reproducability Problems in Historical Linguistics Scholars disagree on many points in historical linguistics, be it the number of laryngeals, the position of Baltic and Slavic, or whether a given word was borrowed or not. We know well that no two etymological dictionaries for the same language or language families are completely identi- cal. Unfortunately, we lack a rigorous check to which de- gree experts actually agree or disagree in their judgments. We also lack methods for evaluation which would help us to show to which degree a given hypothesis (a reconstruction, a family tree, or an etymology) corresponds with our linguis- tic data. 10 / 37
Increasing Comparability Formats Cross-Linguistic Data Formats Key Aspects of CLDF use CSV (comma-separated values) as a basic format for tabular data use JSON (key-value data-format) for meta-data define how standard columns of the data (languages/doculects, concepts, transcriptions, grammatical features, etc.) should be treated provide an API that checks the consistency of datasets provide sample datasets that illustrate the data format provide applications which handle CLDF (for example, in automatic analyses) 13 / 37
Increasing Comparability Standards Word Lists and Lexical Data Word Lists in CLDF define each row as a word indicate the language in which this word is spoken in one column indicate the meaning in another column provide information on the form in additional columns 15 / 37
Increasing Comparability Standards Standards for Word Lists and Lexical Data ID DOCULECT CONCEPT ... 1 German Woldemort valdəmar ... 2 English Woldemort wɔldəmɔrt ... 3 Chinese Woldemort fu⁵¹ti⁵¹mɔ³⁵ ... 4 Russian Woldemort vladimir ... ... ... ... ... ... 10 German Harry haralt ... 11 English Harry hæri ... 12 Russian Harry gali ... ... ... ... ... ... TRANSCRIPTION 16 / 37
Increasing Comparability Applications Applications The CLDF Python API (Forkel et al. 2016) simplifies the handling and the testing of data sets in CLDF format. With help of the API and its extensions, scholars can test whether the data conforms to the format. With additional software which is mostly already available, one can further easily draw statistics from the data. Last not least, tools which handle CLDF data can be used for automatic analysis (LingPy, List and Forkel 2016, http://lingpy.org) or for manual cu- ration (EDICTOR, List 2017, http://edictor.digling.org). 17 / 37
Examples Concepticon Concepticon Concepticon (List et al. 2016) link concept labels in published concept lists (questionnaires) to concept sets link concept sets to meta-data define relations between concept sets never link one concept in a given list to more than one concept set (guarantees consistency) provide an API to check the consistency of the data and to query the data provide a web-interface to browse through the data 21 / 37
Examples Concepticon Concepticon STONE EGG FOOT THE STONE THE EGG THE LEG STONE (FRUIT) EGG (CHICKEN) FOOT/LEG STONE EGG LEG FOOT http://concepticon.clld.org 22 / 37
Examples CLICS CLICS Semantic change plays a crucial role in language change. Although most linguists assume that it proceeds according to certain general patterns, we currently lack the empirical basis to pursue the question in depth. Normally, semantic change proceeds by cumulation and reduction. 24 / 37
Examples CLICS CLICS German “head” Kopf . k ɔ p͡f Pre-German “head” *kop – k ɔ p “vessel” Proto- Germanic *kuppa- k u pː a “vessel” POLYSEMY PHASE FORM MEANING MONOSEMY PHASE MONOSEMY PHASE CUMULATION REDUCTION 24 / 37
Examples CLICS CLICS Concept "money" is part of a cluster with the central concept "fishscale" with a total of 10 nodes. Hover over forms for each link. Click on the forms to check their sources. Click HERE to export the current network. ty: Line weights: Coloring: Family silver leather fishscale bark coin fur snail skin, hide money shell 49 links for "silver" and "money": Language Family Form 1. Ignaciano Arawakan ne 2. Aymara, Central Aymaran ḳulʸḳi 3. Tsafiki Barbacoan kaˈla 4. Seselwa Creole French Creole larzan 5. Miao, White Hmong-Mien nyiaj 6. Breton Indo-European arhant 7. French Indo-European argent 8. Gaelic, Irish Indo-European airgead 9. Welsh Indo-European arian 10. Cofán Isolate koriΦĩʔdi 25 / 37
Examples CLICS CLICS Concept "wheel" is part of a cluster with the central concept "leg" with a total of 11 nodes. Hover over the e each link. Click on the forms to check their sources. Click HERE to export the current network. ity: Line weights: Coloring: Geolocation sphere, ball round footprint foot calf of leg circle thigh wheel leg hip buttocks 6 links for "foot" and "wheel": Language Family Form 1. Cofán Isolate c̷ɨʔtʰe 2. Puinave Isolate sim 3. Yaminahua Panoan taɨ 4. Wayampi Tupi pɨ 5. Pumé Unclassified taɔ 6. Ninam Yanomam mãhuk 25 / 37
Examples Cross-Linguistic Phonetic Alphabet Cross-Linguistic Phonetic Alphabet The use of the standards recommended by the IPA are widely varying in linguistics. Experts on language families often have their own traditions, humans necessarily com- mit errors when transcribing data, technical confusions arise from the usage of lookalike symbols which do not share the same code point, and scholars interpret the IPA differently. Furthermore, the IPA does not offer recommendations for all aspects of transcription: morphological annotation, for ex- ample is not included and varies greatly among scholars. 32 / 37
Examples Cross-Linguistic Phonetic Alphabet Cross-Linguisic Phonetic Alphabet CLPA (List and Forkel, in prep.) define standards for phonetic representation provide meta-data for standardized sounds (feature matrices, etc.) provide an API that allows to query the data and check the consistency of transcriptions with regard to CLPA provide solutions for scholars to convert their data to CLPA develop standards for phonotactic and morphological annotation 33 / 37