Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics
Toward the Establishment of Standards and Best Practices Johann-Mattis List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2017/07/07 very long title P(A|B)=P(B|A)... 1 / 36

Background Background data 2 / 36

Background Data Data in Linguistics Linguistics is a data-driven discipline:
3 / 36

grammars 3 / 36

grammars dictionaries 3 / 36

grammars dictionaries texts 3 / 36

grammars dictionaries texts etc. 3 / 36

Background Data Data in Comparative Linguistics Comparative linguistics is also
a data-driven discipline: 4 / 36

a data-driven discipline: historical grammars 4 / 36

a data-driven discipline: historical grammars etymological dictionaries 4 / 36

a data-driven discipline: historical grammars etymological dictionaries annotated texts 4 / 36

a data-driven discipline: historical grammars etymological dictionaries annotated texts etc. 4 / 36

Background Annotation Annotation in Linguistics Annotation of data is crucial
in linguistics: 5 / 36

in linguistics: inter-linear glossed text 5 / 36

in linguistics: inter-linear glossed text “glossing” a words meaning in the literature 5 / 36

in linguistics: inter-linear glossed text “glossing” a words meaning in the literature etc. 5 / 36

Background Annotation Annotation in Comparative Linguistics Annotation of data is
also crucial in comparative linguistics: 6 / 36

also crucial in comparative linguistics: indicating that words are cognate 6 / 36

also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound 6 / 36

also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound etc. 6 / 36

Background Analysis Analysis in Linguistics Analysis of data is crucial
in linguistics: 7 / 36

in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) 7 / 36

in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) etc. 7 / 36

Background Analysis Analysis in Comparative Linguistics Analysis of data is
also crucial in comparative linguistics: 8 / 36

also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) 8 / 36

also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) 8 / 36

also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) 8 / 36

also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) etc. 8 / 36

Problems Problems ! 9 / 36

Problems Availability Availability Linguists all over the world publish papers
without sharing their data and analyses. SO SAD for science! #MakeLinguisticsGreatAgain #FakePubs 10 / 36

Problems Availability Availability Sharing is not yet considered as economic
in linguistic research: 11 / 36

in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses 11 / 36

in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very diﬃcult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. 11 / 36

in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very diﬃcult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. even if authors share, they may often forget to share their explicit annotations and analyses, thus leaving the readers alone with the raw data 11 / 36

Problems Transparency Transparency Taken from a blog by Bengtson (2017)
at http://euskararenjatorria.net/?p=26071 12 / 36

Problems Transparency Transparency Annotations and analyses often lack transparency and
cannot be understood in the raw form in which they are provided: 13 / 36

cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate 13 / 36

cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving suﬃcient proof for this neither in the texts nor in the data 13 / 36

cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving suﬃcient proof for this neither in the texts nor in the data sources are omitted and readers are left alone with a bunch of word lists of which nobody knows where they were exactly taken from 13 / 36

Problems Comparability Comparability 東童孺務口覯後松公蜂注足濡送重局功粟婁句主數束芻用豐葚譖洧侑訧鮪囿又尤友有右萏三男南琛諶煁甚枕風深任弓牛龜舊裘雄丘郿桋騤厎佽湄袺蓍鴟侐悸穗惟秭遟湝只祗茨履葵匕死視唯季矢偕祁比旨脂指耆毗階維妣屎師資咨兕 Karlgren (1954) Li (1971) Wáng (1980)
Zhèngzhāng (2003) Pān (2000) Starostin (1989) Baxter and Sagart (2014) Schuessler (2007) ? a e i o u æ ɑ ɔ ə ɯ ʊ Diﬀerent reconstructions for Old Chinese, taken from List et al. (2017) 14 / 36

Problems Comparability Comparability It is extremely hard to compare diﬀerent
opinions in comparative linguistics: 15 / 36

opinions in comparative linguistics: data sources are insuﬃciently reported 15 / 36

opinions in comparative linguistics: data sources are insuﬃciently reported computational studies are based on diﬀerent subsets of the data 15 / 36

opinions in comparative linguistics: data sources are insufficiently reported computational studies are based on different subsets of the data analyses exclude or focus on specific parts of a given dataset 15 / 36

Problems Summary Summary 16 / 36

Problems Summary Summary Although linguistics and especially comparative linguistics is
an extremely data-driven discipline, we often still behave as if it was all about prose and discourse. In times when digital applications make the lives of scientists a lot easier, linguists need to ﬁnd ways to transfer their analyses into the new age. This can only be done if we drastically increase the availability, transparency, and comparability of data in linguistics. 16 / 36

Cross-Linguistic Data Formats Cross-Linguistic Data Formats 17 / 36

Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data
Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: 18 / 36

Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization eﬀorts (linguistic meta-data-bases like Glottolog and Concepticon), 18 / 36

Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization eﬀorts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and 18 / 36

Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization eﬀorts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and working examples for best practice. 18 / 36

Cross-Linguistic Data Formats General Ideas General Ideas As of now,
a couple of software tools (LingPy, Beastling, EDICTOR) support CLDF to some degree. In the future, we hope that the number of users will increase, and that the community helps to develop the formats further. 19 / 36

Cross-Linguistic Data Formats Technical Aspects Technical Aspects 20 / 36

Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for
details, discussions, and working examples. 20 / 36

details, discussions, and working examples. Format for machine-readable speciﬁcation is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). 20 / 36

details, discussions, and working examples. Format for machine-readable speciﬁcation is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). 20 / 36

details, discussions, and working examples. Format for machine-readable speciﬁcation is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). A pycldf API in Python is currently in preparation and will help users to evaluate whether their data conforms to CLDF. 20 / 36

Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following
Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. 21 / 36

Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. 21 / 36

Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). 21 / 36

Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. 21 / 36

Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. In the face of ambiguity, refuse the temptation to guess. 21 / 36

Cross-Linguistic Data Formats Standards Standards "The language, the meaning, and
the form of the linguistic sign are the triple pillars upon which CLDF rests. Together, they will restore comparability of linguistic data in the world." — Tommen Baratheon (King of the Iron Throne, † in 6x10) 22 / 36

Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al.
2017) to refer to language varieties 23 / 36

2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts 23 / 36

2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts develop a cross-linguistic phonetic notation system to evaluate whether phonetic transcriptions for word forms conform to cross-linguistic standards 23 / 36

Cross-Linguistic Data Formats Standards Standards: Concepticon Concepticon (List et al.
2016) link concept labels in published concept lists (questionnaires) to concept sets link concept sets to meta-data deﬁne relations between concept sets never link one concept in a given list to more than one concept set (guarantees consistency) provide an API to check the consistency of the data and to query the data provide a web-interface to browse through the data 24 / 36

Cross-Linguistic Data Formats Standards http://concepticon.clld.org 25 / 36

Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations There are
too many IPAs! 26 / 36

Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations Cross-Linguistic Phonetic
Notations (in prep.) normalize ambiguities of IPA establish a pool of phonetic segments which are linked to other datasets (Phoible, Moran et al. 2014, PBase, Mielke 2008, etc.) provide or link to feature systems or additional kinds of metadata provide generic transformation tables for frequently used phonetic transcription systems provide an API to check the consistency of a given transcription 27 / 36

Cross-Linguistic Data Formats Standards GLD to CLPN https://github.com/LinguList/Hmong-Mien_Language_family/ 28 /
36

Word Lists in CLDF WORDLIST DATA Word Lists in CLDF
29 / 36

Word Lists in CLDF Basic Aspects Basic Aspects ID (unique
identiﬁer for each word) 30 / 36

identiﬁer for each word) Language and Language_ID (Glottolog) 30 / 36

identiﬁer for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) 30 / 36

identiﬁer for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) 30 / 36

identiﬁer for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex ﬁle) 30 / 36

identiﬁer for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex ﬁle) Comment (free text) 30 / 36

Word Lists in CLDF Etymological Annotation Etymological Annotation Concept Cognate
Set Alignment MEANING RELATION Morphemes Slice Form Value Segments FORM ID A B requires important language-internal language-external Language LANGUAGE 31 / 36

Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor
32 / 36

web-based tool for the annotation of etymological data 32 / 36

web-based tool for the annotation of etymological data online available at http://edictor.digling.org 32 / 36

web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) 32 / 36

web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and oﬄine 32 / 36

web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and oﬄine allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis 32 / 36

web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and oﬄine allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis serves as a litmus test for the expressive power of wordlists in CLDF 32 / 36

Word Lists in CLDF Examples http://edictor.digling.org 33 / 36

Outlook Outlook 34 / 36

35 / 36

Danke für Ihre Aufmerksamkeit! 36 / 36

Annotation and Analysis of Cross-Linguistic Lex...

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript