Towards a Database of Morpheme-Annotated Wordlists
Talk, held at the 24th ICHL-conference at the ARC Centre of Excellence for the Dynamics of Language (Australian National University, Canberra, 2019/07/04).
contributions by A. Tjuka, Y.-F. Lai, and J.-M. List Max Planck Institute for the Science of Human History Department of Linguistic and Cultural Evolution CALC Project July 4th, 2019 1 / 29
basic feature of human language (Zeige 2015). Language consists of re-combinable elements. This entails an unlimited amount of expressions from a limited amount of elements. Different words may, therefore, share some of their morphemes. 4 / 29
with the same words (François 2008): Segment of 141 colexifications of skin ∼ bark in the CLICS2 database (List et al. 2018a, graphic by A. Tjuka). 5 / 29
to families of related words, and different meanings might be expressed with the same morphemes: A bipartite partial colexification network (Hill and List 2017) generated with the EDICTOR tool (List 2017), showing 4 fully or partially colexified concepts in English. 6 / 29
There are already many databases on linguistic data, e.g. borrowing (WOLD) cognates (ABVD) semantic shifts (DatSemShift) language structures (WALS) ... yet no cross-linguistic databases on morphological data. 8 / 29
Wordlists Wordlists are lists of concept-form-pairs. They are being used for language comparison since at least the 18th century (e.g. Leibniz 1768). They were popularized by Morris Swadesh in the 1950s (Campbell 2013). As a standard method in fieldwork they are available for a very large amount of languages (List et al. 2019). Sometimes they are the only data available. Concept Form all mumu ashes kuy-ham bark naka belly ȼek big mɨha black yɨk blood nɨˀpin bone pak burn neˀm- Excerpt from a wordlist with Zoque data (Swadesh 1954). 10 / 29
Morpheme-Segmented Wordlists ID DOCULECT CONCEPT FORM TOKENS SEGM-TOKENS MORPHEMES 339 German spider Spinne ʃ p ɪ n ə ʃ p ɪ n + ə SPIN _e-suff 341 German spider web Spinnwebe ʃ p ɪ n v eː b ə ʃ p ɪ n + v eː b + ə SPIN WEAVE _e-suff 342 German spider web Spinnennetz ʃ p ɪ n ə n n ɛ ts ʃ p ɪ n + ə n + n ɛ ts SPIN _en-fuge NET 753 German spin spinnen ʃ p ɪ n ə n ʃ p ɪ n + ə n SPIN _inf Data based on the Intercontinental Dictionary Series (Key and Comrie 2016). 11 / 29
Automated Morpheme Segmentation Morphemes (List 2019) are recurring combinations of form and meaning and abstraction of relations within the lexicon which reflect language history and are often bound to phonotactic restrictions while being sometimes marked orthographically (space, dash, different character). Many approaches search only for recurring letter strings. The quality of an approach depends on language and amount of data. There is no standard for testing new methods. A database of morpheme-segmented wordlists could be used as a gold standard for testing purposes. 12 / 29
Automated Cognate Detection II Automated cognate detection is a standard method in historical computational linguistics (List et al. 2017). It is used to analyze large amounts of language data. One popular algorithm for it is LexStat. It is based on detecting regular sound correspondences, like the traditional comparative method. But partial cognacy can seriously hamper results. → Solution: A model of historical relations between words. First step: A database of morphologically annotated wordlists. 14 / 29
Studies on Cross-Linguistic Promiscuity Semantic promiscuity: morphological productivity of concepts (List 2018). fallen TO FALL Fall CASE (EVENT) Fall CASE (JURISTIC) fällen TO FELL A TREE Fall FALL fällen TO KILL IN BATTLE abfallen TO DROP OFF Abfall GARBAGE abfallen TO SLOPE abfallen TO TURN AWAY FROM AN IDEOLOGY befallen TO INFEST Beifall APPLAUSE Falle TRAP einfallen TO COLLAPSE einfallen TO INVADE einfallen TO COME TO MIND Einfall IDEA fällig DUE gefällig COMPLIANT ’To fall’, a concept with high promiscuity (Schweikhard 2018). 15 / 29
Partial Colexifications Acccording to Urban (2011), the synchronic structure of word families can point to diachronic semantic changes, e.g.: 16 languages fully colexify ‘mouth’ and ‘lip’. 31 languages partially colexify ‘mouth’ and ‘lip’. These 31 languages use a simplex for mouth, and a complex word or phrase for lip, like Kanuri cî ‘mouth’, kâ cî-bè ‘lip’, literally ‘stick mouth-of’. Therefore ‘lip’ can be semantically derived from ‘mouth’. This is confirmed by Sanskrit lapana- ‘mouth’ → Sindhī lavan-a ‘lip’. This could be tested on a database of morphologically annotated wordlists. 16 / 29
CLDF Initiative CLDF (Forkel et al. 2018) try to make linguistic data FAIR (Wilkinson et al. 2016): Findable Accessible Interoperable Reusable This involves four steps (Forkel et al. 2018): aggregating datasets, specifying the sources, including cross-links between datasets and reference catalogues, cleaning the data and presenting it in a tsv-table-format containing one row for each word form and one column for each type of annotation, and segmenting the forms by phonemes and morphemes. 17 / 29
Standards Our database will follow these standards: Only synchronically transparent morpheme borders are annotated. Morpheme borders are marked consistently with a plus-sign. Cognate morphemes are annotated via glosses in order to distinguish homophones and link allomorphs. Example annotations will be provided as guidelines for contributors. 20 / 29
A Morpheme-Segmented Wordlist ID Language_ID Parameter_ID Value Form Segments Source Tibetan_Old_Tibetan-1741-1 Tibetan_Old_Tibetan 1741 steŋ steŋ s t e ŋ Huang1992 rGyalrong_Japhug-1741-1 rGyalrong_Japhug 1741 ɯ-taʁ ɯ-taʁ ɯ + t a ʁ Jacques2015b Tibetan_Old_Tibetan-98-1 Tibetan_Old_Tibetan 98 tʰams.tɕad tʰams.tɕad tʰ a m s + tɕ a d Huang1992 Kiranti_Khaling-98-1 Kiranti_Khaling 98 kʰøle kʰøle kʰ ø l e Jacques2017FN Kiranti_Limbu-98-1 Kiranti_Limbu 98 kak kak k a k Jacques2017FN rGyalrong_Japhug-98-1 rGyalrong_Japhug 98 %tʰamtɕɤt %tʰamtɕɤt tʰ a m tɕ ɤ t Jacques2015b Tangut-98-1 Tangut 98 zji¹ zji¹ z j i ¹ Li1997 Tibetan_Old_Tibetan-1292-1 Tibetan_Old_Tibetan 1292 ŋan ŋan ŋ a n Huang1992 rGyalrong_Japhug-1292-1 rGyalrong_Japhug 1292 %ŋɤn %ŋɤn ŋ ɤ n Jacques2015b Tibetan_Old_Tibetan-1422-1 Tibetan_Old_Tibetan 1422 gson+po gson+po g s o n + p o Huang1992 Excerpt from Sino-Tibetan Database of Lexical Cognates (Sagart et al. 2019). 22 / 29
Case Studies on Partial Colexification Language Concept Source Form Abui tree bata Abui skin kul Abui bark bata kul Nung-Fengshan tree fai Nung-Fengshan skin naŋ Nung-Fengshan bark naŋ fai Concepts from the CLICS2 database (List et al. 2018a, table by A. Tjuka). 23 / 29
free access and at least 40 languages on first release, including Sino-Tibetan, Dogon (ongoing collaboration by A. Hantgan and J.-M. List), Tukanoan (ongoing collaboration by T. Chacon, T. Tresoldi and, J.-M. List), and Germanic languages (ongoing work by N. E. Schweikhard), based on pre-compiled wordlists, pre-segmented wordlists, and expert judgments. 27 / 29
in automated cognate detection, testing morpheme detection methods, and developing standards of linguistic annotation and allows for quantitative studies on word family structure, partial colexification, and semantic promiscuity in word formation. 28 / 29
members: Dr. Johann-Mattis List (Group leader) Dr. Yunfan Lai (Post-Doc) Dr. Tiago Tresoldi (Post-Doc) Mei-Shin Wu (Doctorate student) Nathanael E. Schweikhard (Doctorate student) Associated: Annika Tjuka 29 / 29