Slide 1

Slide 1 text

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics Toward the Establishment of Standards and Best Practices Johann-Mattis List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2017/07/07 very long title P(A|B)=P(B|A)... 1 / 36

Slide 2

Slide 2 text

Background Background data 2 / 36

Slide 3

Slide 3 text

Background Data Data in Linguistics Linguistics is a data-driven discipline: 3 / 36

Slide 4

Slide 4 text

Background Data Data in Linguistics Linguistics is a data-driven discipline: grammars 3 / 36

Slide 5

Slide 5 text

Background Data Data in Linguistics Linguistics is a data-driven discipline: grammars dictionaries 3 / 36

Slide 6

Slide 6 text

Background Data Data in Linguistics Linguistics is a data-driven discipline: grammars dictionaries texts 3 / 36

Slide 7

Slide 7 text

Background Data Data in Linguistics Linguistics is a data-driven discipline: grammars dictionaries texts etc. 3 / 36

Slide 8

Slide 8 text

Background Data Data in Comparative Linguistics Comparative linguistics is also a data-driven discipline: 4 / 36

Slide 9

Slide 9 text

Background Data Data in Comparative Linguistics Comparative linguistics is also a data-driven discipline: historical grammars 4 / 36

Slide 10

Slide 10 text

Background Data Data in Comparative Linguistics Comparative linguistics is also a data-driven discipline: historical grammars etymological dictionaries 4 / 36

Slide 11

Slide 11 text

Background Data Data in Comparative Linguistics Comparative linguistics is also a data-driven discipline: historical grammars etymological dictionaries annotated texts 4 / 36

Slide 12

Slide 12 text

Background Data Data in Comparative Linguistics Comparative linguistics is also a data-driven discipline: historical grammars etymological dictionaries annotated texts etc. 4 / 36

Slide 13

Slide 13 text

Background Annotation Annotation in Linguistics Annotation of data is crucial in linguistics: 5 / 36

Slide 14

Slide 14 text

Background Annotation Annotation in Linguistics Annotation of data is crucial in linguistics: inter-linear glossed text 5 / 36

Slide 15

Slide 15 text

Background Annotation Annotation in Linguistics Annotation of data is crucial in linguistics: inter-linear glossed text “glossing” a words meaning in the literature 5 / 36

Slide 16

Slide 16 text

Background Annotation Annotation in Linguistics Annotation of data is crucial in linguistics: inter-linear glossed text “glossing” a words meaning in the literature etc. 5 / 36

Slide 17

Slide 17 text

Background Annotation Annotation in Comparative Linguistics Annotation of data is also crucial in comparative linguistics: 6 / 36

Slide 18

Slide 18 text

Background Annotation Annotation in Comparative Linguistics Annotation of data is also crucial in comparative linguistics: indicating that words are cognate 6 / 36

Slide 19

Slide 19 text

Background Annotation Annotation in Comparative Linguistics Annotation of data is also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound 6 / 36

Slide 20

Slide 20 text

Background Annotation Annotation in Comparative Linguistics Annotation of data is also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound etc. 6 / 36

Slide 21

Slide 21 text

Background Analysis Analysis in Linguistics Analysis of data is crucial in linguistics: 7 / 36

Slide 22

Slide 22 text

Background Analysis Analysis in Linguistics Analysis of data is crucial in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) 7 / 36

Slide 23

Slide 23 text

Background Analysis Analysis in Linguistics Analysis of data is crucial in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) etc. 7 / 36

Slide 24

Slide 24 text

Background Analysis Analysis in Comparative Linguistics Analysis of data is also crucial in comparative linguistics: 8 / 36

Slide 25

Slide 25 text

Background Analysis Analysis in Comparative Linguistics Analysis of data is also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) 8 / 36

Slide 26

Slide 26 text

Background Analysis Analysis in Comparative Linguistics Analysis of data is also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) 8 / 36

Slide 27

Slide 27 text

Background Analysis Analysis in Comparative Linguistics Analysis of data is also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) 8 / 36

Slide 28

Slide 28 text

Background Analysis Analysis in Comparative Linguistics Analysis of data is also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) etc. 8 / 36

Slide 29

Slide 29 text

Problems Problems ! 9 / 36

Slide 30

Slide 30 text

Problems Availability Availability Linguists all over the world publish papers without sharing their data and analyses. SO SAD for science! #MakeLinguisticsGreatAgain #FakePubs 10 / 36

Slide 31

Slide 31 text

Problems Availability Availability Sharing is not yet considered as economic in linguistic research: 11 / 36

Slide 32

Slide 32 text

Problems Availability Availability Sharing is not yet considered as economic in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses 11 / 36

Slide 33

Slide 33 text

Problems Availability Availability Sharing is not yet considered as economic in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very difficult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. 11 / 36

Slide 34

Slide 34 text

Problems Availability Availability Sharing is not yet considered as economic in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very difficult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. even if authors share, they may often forget to share their explicit annotations and analyses, thus leaving the readers alone with the raw data 11 / 36

Slide 35

Slide 35 text

Problems Transparency Transparency Taken from a blog by Bengtson (2017) at http://euskararenjatorria.net/?p=26071 12 / 36

Slide 36

Slide 36 text

Problems Transparency Transparency Annotations and analyses often lack transparency and cannot be understood in the raw form in which they are provided: 13 / 36

Slide 37

Slide 37 text

Problems Transparency Transparency Annotations and analyses often lack transparency and cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate 13 / 36

Slide 38

Slide 38 text

Problems Transparency Transparency Annotations and analyses often lack transparency and cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving sufficient proof for this neither in the texts nor in the data 13 / 36

Slide 39

Slide 39 text

Problems Transparency Transparency Annotations and analyses often lack transparency and cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving sufficient proof for this neither in the texts nor in the data sources are omitted and readers are left alone with a bunch of word lists of which nobody knows where they were exactly taken from 13 / 36

Slide 40

Slide 40 text

Problems Comparability Comparability 東童孺務口覯後松公蜂注足濡送重局功粟婁句主數束芻用豐葚譖洧侑訧鮪囿又尤友有右萏三男南琛諶煁甚枕風深任弓牛龜舊裘雄丘郿桋騤厎佽湄袺蓍鴟侐悸穗惟秭遟湝只祗茨履葵匕死視唯季矢偕祁比旨脂指耆毗階維妣屎師資咨兕 Karlgren (1954) Li (1971) Wáng (1980) Zhèngzhāng (2003) Pān (2000) Starostin (1989) Baxter and Sagart (2014) Schuessler (2007) ? a e i o u æ ɑ ɔ ə ɯ ʊ Different reconstructions for Old Chinese, taken from List et al. (2017) 14 / 36

Slide 41

Slide 41 text

Problems Comparability Comparability It is extremely hard to compare different opinions in comparative linguistics: 15 / 36

Slide 42

Slide 42 text

Problems Comparability Comparability It is extremely hard to compare different opinions in comparative linguistics: data sources are insufficiently reported 15 / 36

Slide 43

Slide 43 text

Problems Comparability Comparability It is extremely hard to compare different opinions in comparative linguistics: data sources are insufficiently reported computational studies are based on different subsets of the data 15 / 36

Slide 44

Slide 44 text

Problems Comparability Comparability It is extremely hard to compare different opinions in comparative linguistics: data sources are insufficiently reported computational studies are based on different subsets of the data analyses exclude or focus on specific parts of a given dataset 15 / 36

Slide 45

Slide 45 text

Problems Summary Summary 16 / 36

Slide 46

Slide 46 text

Problems Summary Summary Although linguistics and especially comparative linguistics is an extremely data-driven discipline, we often still behave as if it was all about prose and discourse. In times when digital applications make the lives of scientists a lot easier, linguists need to find ways to transfer their analyses into the new age. This can only be done if we drastically increase the availabil- ity, transparency, and comparability of data in linguistics. 16 / 36

Slide 47

Slide 47 text

Cross-Linguistic Data Formats Cross-Linguistic Data Formats 17 / 36

Slide 48

Slide 48 text

Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: 18 / 36

Slide 49

Slide 49 text

Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), 18 / 36

Slide 50

Slide 50 text

Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and 18 / 36

Slide 51

Slide 51 text

Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and working examples for best practice. 18 / 36

Slide 52

Slide 52 text

Cross-Linguistic Data Formats General Ideas General Ideas As of now, a couple of software tools (LingPy, Beastling, EDICTOR) support CLDF to some degree. In the future, we hope that the number of users will increase, and that the community helps to develop the formats further. 19 / 36

Slide 53

Slide 53 text

Cross-Linguistic Data Formats Technical Aspects Technical Aspects 20 / 36

Slide 54

Slide 54 text

Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for details, discussions, and working examples. 20 / 36

Slide 55

Slide 55 text

Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). 20 / 36

Slide 56

Slide 56 text

Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). 20 / 36

Slide 57

Slide 57 text

Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). A pycldf API in Python is currently in preparation and will help users to evaluate whether their data conforms to CLDF. 20 / 36

Slide 58

Slide 58 text

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. 21 / 36

Slide 59

Slide 59 text

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. 21 / 36

Slide 60

Slide 60 text

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). 21 / 36

Slide 61

Slide 61 text

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. 21 / 36

Slide 62

Slide 62 text

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. In the face of ambiguity, refuse the temptation to guess. 21 / 36

Slide 63

Slide 63 text

Cross-Linguistic Data Formats Standards Standards "The language, the meaning, and the form of the linguistic sign are the triple pillars upon which CLDF rests. Together, they will restore comparability of linguistic data in the world." — Tommen Baratheon (King of the Iron Throne, † in 6x10) 22 / 36

Slide 64

Slide 64 text

Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al. 2017) to refer to language varieties 23 / 36

Slide 65

Slide 65 text

Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al. 2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts 23 / 36

Slide 66

Slide 66 text

Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al. 2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts develop a cross-linguistic phonetic notation system to evaluate whether phonetic transcriptions for word forms conform to cross-linguistic standards 23 / 36

Slide 67

Slide 67 text

Cross-Linguistic Data Formats Standards Standards: Concepticon Concepticon (List et al. 2016) link concept labels in published concept lists (questionnaires) to concept sets link concept sets to meta-data define relations between concept sets never link one concept in a given list to more than one concept set (guarantees consistency) provide an API to check the consistency of the data and to query the data provide a web-interface to browse through the data 24 / 36

Slide 68

Slide 68 text

Cross-Linguistic Data Formats Standards http://concepticon.clld.org 25 / 36

Slide 69

Slide 69 text

Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations There are too many IPAs! 26 / 36

Slide 70

Slide 70 text

Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations There are too many IPAs! 26 / 36

Slide 71

Slide 71 text

Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations Cross-Linguistic Phonetic Notations (in prep.) normalize ambiguities of IPA establish a pool of phonetic segments which are linked to other datasets (Phoible, Moran et al. 2014, PBase, Mielke 2008, etc.) provide or link to feature systems or additional kinds of metadata provide generic transformation tables for frequently used phonetic transcription systems provide an API to check the consistency of a given transcription 27 / 36

Slide 72

Slide 72 text

Cross-Linguistic Data Formats Standards GLD to CLPN https://github.com/LinguList/Hmong-Mien_Language_family/ 28 / 36

Slide 73

Slide 73 text

Word Lists in CLDF WORDLIST DATA Word Lists in CLDF 29 / 36

Slide 74

Slide 74 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) 30 / 36

Slide 75

Slide 75 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) Language and Language_ID (Glottolog) 30 / 36

Slide 76

Slide 76 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) 30 / 36

Slide 77

Slide 77 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) 30 / 36

Slide 78

Slide 78 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex file) 30 / 36

Slide 79

Slide 79 text

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex file) Comment (free text) 30 / 36

Slide 80

Slide 80 text

Word Lists in CLDF Etymological Annotation Etymological Annotation Concept Cognate Set Alignment MEANING RELATION Morphemes Slice Form Value Segments FORM ID A B requires important language-internal language-external Language LANGUAGE 31 / 36

Slide 81

Slide 81 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor 32 / 36

Slide 82

Slide 82 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data 32 / 36

Slide 83

Slide 83 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data online available at http://edictor.digling.org 32 / 36

Slide 84

Slide 84 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) 32 / 36

Slide 85

Slide 85 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline 32 / 36

Slide 86

Slide 86 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis 32 / 36

Slide 87

Slide 87 text

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis serves as a litmus test for the expressive power of wordlists in CLDF 32 / 36

Slide 88

Slide 88 text

Word Lists in CLDF Examples http://edictor.digling.org 33 / 36

Slide 89

Slide 89 text

Outlook Outlook 34 / 36

Slide 90

Slide 90 text

35 / 36

Slide 91

Slide 91 text

35 / 36

Slide 92

Slide 92 text

35 / 36

Slide 93

Slide 93 text

Danke für Ihre Aufmerksamkeit! 36 / 36