Towards a Database of Morpheme-Annotated Wordlists

Towards a Database of Morpheme-Annotated Wordlists N. E. Schweikhard with
contributions by A. Tjuka, Y.-F. Lai, and J.-M. List Max Planck Institute for the Science of Human History Department of Linguistic and Cultural Evolution CALC Project July 4th, 2019 1 / 29

Table of Contents 1 Compositionality of Words Word Formation Colexifications
Word Families 2 Towards a Database of Morpheme-Segmented Wordlists General Idea Use Cases Aggregation Strategy Current State 3 Outlook 2 / 29

Compositionality of Words 3 / 29

Compositionality of Words Word Formation Word Formation Compositionality is a
basic feature of human language (Zeige 2015). Language consists of re-combinable elements. This entails an unlimited amount of expressions from a limited amount of elements. Different words may, therefore, share some of their morphemes. 4 / 29

Compositionality of Words Colexifications Colexifications Different meanings might be expressed
with the same words (François 2008): Segment of 141 colexifications of skin ∼ bark in the CLICS2 database (List et al. 2018a, graphic by A. Tjuka). 5 / 29

Compositionality of Words Word Families Word Families Word formation leads
to families of related words, and different meanings might be expressed with the same morphemes: A bipartite partial colexification network (Hill and List 2017) generated with the EDICTOR tool (List 2017), showing 4 fully or partially colexified concepts in English. 6 / 29

Towards a Database of Morpheme-Segmented Wordlists 7 / 29

Towards a Database of Morpheme-Segmented Wordlists General Idea General Idea
There are already many databases on linguistic data, e.g. borrowing (WOLD) cognates (ABVD) semantic shifts (DatSemShift) language structures (WALS) ... yet no cross-linguistic databases on morphological data. 8 / 29

Towards a Database of Morpheme-Segmented Wordlists General Idea General Idea:
Morpheme-Segmented Lexical Database WOLD, a database of borrowings in wordlists of basic vocabulary (Haspelmath and Tadmor 2008). 9 / 29

Wordlists Wordlists are lists of concept-form-pairs. They are being used for language comparison since at least the 18th century (e.g. Leibniz 1768). They were popularized by Morris Swadesh in the 1950s (Campbell 2013). As a standard method in fieldwork they are available for a very large amount of languages (List et al. 2019). Sometimes they are the only data available. Concept Form all mumu ashes kuy-ham bark naka belly ȼek big mɨha black yɨk blood nɨˀpin bone pak burn neˀm- Excerpt from a wordlist with Zoque data (Swadesh 1954). 10 / 29

Morpheme-Segmented Wordlists ID DOCULECT CONCEPT FORM TOKENS SEGM-TOKENS MORPHEMES 339 German spider Spinne ʃ p ɪ n ə ʃ p ɪ n + ə SPIN _e-suff 341 German spider web Spinnwebe ʃ p ɪ n v eː b ə ʃ p ɪ n + v eː b + ə SPIN WEAVE _e-suff 342 German spider web Spinnennetz ʃ p ɪ n ə n n ɛ ts ʃ p ɪ n + ə n + n ɛ ts SPIN _en-fuge NET 753 German spin spinnen ʃ p ɪ n ə n ʃ p ɪ n + ə n SPIN _inf Data based on the Intercontinental Dictionary Series (Key and Comrie 2016). 11 / 29

Towards a Database of Morpheme-Segmented Wordlists Use Cases Use Cases:
Automated Morpheme Segmentation Morphemes (List 2019) are recurring combinations of form and meaning and abstraction of relations within the lexicon which reflect language history and are often bound to phonotactic restrictions while being sometimes marked orthographically (space, dash, different character). Many approaches search only for recurring letter strings. The quality of an approach depends on language and amount of data. There is no standard for testing new methods. A database of morpheme-segmented wordlists could be used as a gold standard for testing purposes. 12 / 29

Automated Cognate Detection I Variety Form Character Cognacy Fúzhōu ŋuoʔ⁵ ᄅ 1 Měixiàn ŋiat⁵ kuoŋ⁴⁴ ᄅܻ 1 2 Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ ᄅܻڍ 1 2 3 Běijīng yɛ⁵¹ liɑŋ¹ ᄅਊ 1 4 Moon in different Chinese varieties (List et al. 2016). 13 / 29

Automated Cognate Detection II Automated cognate detection is a standard method in historical computational linguistics (List et al. 2017). It is used to analyze large amounts of language data. One popular algorithm for it is LexStat. It is based on detecting regular sound correspondences, like the traditional comparative method. But partial cognacy can seriously hamper results. → Solution: A model of historical relations between words. First step: A database of morphologically annotated wordlists. 14 / 29

Studies on Cross-Linguistic Promiscuity Semantic promiscuity: morphological productivity of concepts (List 2018). fallen TO FALL Fall CASE (EVENT) Fall CASE (JURISTIC) fällen TO FELL A TREE Fall FALL fällen TO KILL IN BATTLE abfallen TO DROP OFF Abfall GARBAGE abfallen TO SLOPE abfallen TO TURN AWAY FROM AN IDEOLOGY befallen TO INFEST Beifall APPLAUSE Falle TRAP einfallen TO COLLAPSE einfallen TO INVADE einfallen TO COME TO MIND Einfall IDEA fällig DUE gefällig COMPLIANT ’To fall’, a concept with high promiscuity (Schweikhard 2018). 15 / 29

Partial Colexifications Acccording to Urban (2011), the synchronic structure of word families can point to diachronic semantic changes, e.g.: 16 languages fully colexify ‘mouth’ and ‘lip’. 31 languages partially colexify ‘mouth’ and ‘lip’. These 31 languages use a simplex for mouth, and a complex word or phrase for lip, like Kanuri cî ‘mouth’, kâ cî-bè ‘lip’, literally ‘stick mouth-of’. Therefore ‘lip’ can be semantically derived from ‘mouth’. This is confirmed by Sanskrit lapana- ‘mouth’ → Sindhī lavan-a ‘lip’. This could be tested on a database of morphologically annotated wordlists. 16 / 29

Towards a Database of Morpheme-Segmented Wordlists Aggregation Strategy Aggregation Strategy:
CLDF Initiative CLDF (Forkel et al. 2018) try to make linguistic data FAIR (Wilkinson et al. 2016): Findable Accessible Interoperable Reusable This involves four steps (Forkel et al. 2018): aggregating datasets, specifying the sources, including cross-links between datasets and reference catalogues, cleaning the data and presenting it in a tsv-table-format containing one row for each word form and one column for each type of annotation, and segmenting the forms by phonemes and morphemes. 17 / 29

CLICS Detecting colexifications pattern with the help of the CLICS2 database (List et al. 2018a, graphic from List et al. 2018). 18 / 29

Concepticon The concept barking in the Concepticon database (List et al. 2019). 19 / 29

Standards Our database will follow these standards: Only synchronically transparent morpheme borders are annotated. Morpheme borders are marked consistently with a plus-sign. Cognate morphemes are annotated via glosses in order to distinguish homophones and link allomorphs. Example annotations will be provided as guidelines for contributors. 20 / 29

Towards a Database of Morpheme-Segmented Wordlists Current State Current State:
Collecting Morpheme-Segmented Wordlists Dataset Varieties Concepts Lexemes % mapped to Concepticon allenbai 9 499 4546 100 bodtkhobwa 8 652 3958 89 halenepal 13 1032 11041 69 marrisonnaga 40 724 27441 94 yanglalo 8 1001 9784 89 sohartmannchin 8 280 2171 100 naganorgyalrongic 10 1256 10685 80 castrosui 16 592 9639 85 beidasinitic 18 905 18069 79 suntb 51 996 50434 92 chenhmongmien 25 883 21617 89 Morpheme-segmented datasets from the Lexibank repository, prepared by members of the CALC group and the DLCE (Jena). 21 / 29

A Morpheme-Segmented Wordlist ID Language_ID Parameter_ID Value Form Segments Source Tibetan_Old_Tibetan-1741-1 Tibetan_Old_Tibetan 1741 steŋ steŋ s t e ŋ Huang1992 rGyalrong_Japhug-1741-1 rGyalrong_Japhug 1741 ɯ-taʁ ɯ-taʁ ɯ + t a ʁ Jacques2015b Tibetan_Old_Tibetan-98-1 Tibetan_Old_Tibetan 98 tʰams.tɕad tʰams.tɕad tʰ a m s + tɕ a d Huang1992 Kiranti_Khaling-98-1 Kiranti_Khaling 98 kʰøle kʰøle kʰ ø l e Jacques2017FN Kiranti_Limbu-98-1 Kiranti_Limbu 98 kak kak k a k Jacques2017FN rGyalrong_Japhug-98-1 rGyalrong_Japhug 98 %tʰamtɕɤt %tʰamtɕɤt tʰ a m tɕ ɤ t Jacques2015b Tangut-98-1 Tangut 98 zji¹ zji¹ z j i ¹ Li1997 Tibetan_Old_Tibetan-1292-1 Tibetan_Old_Tibetan 1292 ŋan ŋan ŋ a n Huang1992 rGyalrong_Japhug-1292-1 rGyalrong_Japhug 1292 %ŋɤn %ŋɤn ŋ ɤ n Jacques2015b Tibetan_Old_Tibetan-1422-1 Tibetan_Old_Tibetan 1422 gson+po gson+po g s o n + p o Huang1992 Excerpt from Sino-Tibetan Database of Lexical Cognates (Sagart et al. 2019). 22 / 29

Case Studies on Partial Colexification Language Concept Source Form Abui tree bata Abui skin kul Abui bark bata kul Nung-Fengshan tree fai Nung-Fengshan skin naŋ Nung-Fengshan bark naŋ fai Concepts from the CLICS2 database (List et al. 2018a, table by A. Tjuka). 23 / 29

Outlook 24 / 29

Outlook Outlook: Annotation of Word Formation Processes I ID LANGUAGE
CONCEPT FORM MORPHEMES COGNATES ROOTS 1 Old High German eternity ēwo ēw o 1 2 1 2 2 Ancient Greek life aiōn ai ōn 1 2 1 2 3 Old Avestan life āiiū āiiū 3 1 4 Old Avestan long-living darəgāiiū darəg a āiiū 4 5 3 3 4 1 5 Vedic life áyu áyu 3 1 6 Vedic long-living dīrghā́yu dīrgh á ā́yu 4 5 3 3 4 1 7 Vedic young yúvan yúv an 6 7 1 5 8 Latin (deity name) iūnō iū n ō 6 8 2 1 5 2 9 Indo-European life *h₂ai̯-u-on- h₂ai̯u on 3 2 1 2 10 Indo-European life *h₂oi̯-u- h₂oi̯u 1 1 11 Indo-European long-living *dl̩h₁gʰ-ó-h₂oi̯-u- dl̩h₁gʰ ó h₂oi̯u 4 5 1 3 4 1 12 Indo-European young *h₂i̯-u-h₃on- h₂i̯u h₃on 6 7 1 5 13 Indo-European the young one *h₂i̯-u-h₃n-on- h₂i̯u h₃n on 6 8 2 1 5 2 Data from Wodtko et al. 2008 and Mallory and Adams 2006). 25 / 29

Outlook Outlook: Annotation of Word Formation Processes II aiōn āiiū
ā́yu ēwo iūnō yúvan dīrghā́yu darəgāiiū *h₂ai̯-u-on- *h₂oi̯-u- *h₂i̯-u-h₃on- *h₂i̯-u-h₃n-on- *dl̩h₁gʰ-ó-h₂oi̯-u- OHG Greek Old Avestan Vedic Latin Source Source-ID Target Target-ID Change *h₂ai̯-u-on- 1 aiōn 2 sound change *h₂oi̯-u- 3 *h₂ai̯-u-on- 1 e-grade, on-suffix *h₂oi̯-u- 3 *dl̩h₁gʰ-ó-h₂oi̯-u- 4 compound with *dl̩h₁gʰ-ó- *dl̩h₁gʰ-ó-h₂oi̯- 7 dīrghā́yu 8 sound change ... ... ... ... ... 26 / 29

Outlook Outlook: Goals Our goals entail an online database with
free access and at least 40 languages on first release, including Sino-Tibetan, Dogon (ongoing collaboration by A. Hantgan and J.-M. List), Tukanoan (ongoing collaboration by T. Chacon, T. Tresoldi and, J.-M. List), and Germanic languages (ongoing work by N. E. Schweikhard), based on pre-compiled wordlists, pre-segmented wordlists, and expert judgments. 27 / 29

Outlook Outlook: Application Possibilities Aggregating morpheme border annotations supports us
in automated cognate detection, testing morpheme detection methods, and developing standards of linguistic annotation and allows for quantitative studies on word family structure, partial colexification, and semantic promiscuity in word formation. 28 / 29

Outlook Thank you for your attention! Contact: [email protected] http://calc.digling.org/ CALC
members: Dr. Johann-Mattis List (Group leader) Dr. Yunfan Lai (Post-Doc) Dr. Tiago Tresoldi (Post-Doc) Mei-Shin Wu (Doctorate student) Nathanael E. Schweikhard (Doctorate student) Associated: Annika Tjuka 29 / 29

Towards a Database of Morpheme-Annotated Wordlists

Towards a Database of Morpheme-Annotated Wordlists

Schweikhard

More Decks by Schweikhard

Other Decks in Science

Featured

Transcript