Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards a Database of Morpheme-Annotated Wordlists

Towards a Database of Morpheme-Annotated Wordlists

Talk, held at the 24th ICHL-conference at the ARC Centre of Excellence for the Dynamics of Language (Australian National University, Canberra, 2019/07/04).

Schweikhard

July 04, 2019
Tweet

More Decks by Schweikhard

Other Decks in Science

Transcript

  1. Towards a Database of Morpheme-Annotated Wordlists
    N. E. Schweikhard
    with contributions by A. Tjuka, Y.-F. Lai, and J.-M. List
    Max Planck Institute for the Science of Human History
    Department of Linguistic and Cultural Evolution
    CALC Project
    July 4th, 2019
    1 / 29

    View Slide

  2. Table of Contents
    1 Compositionality of Words
    Word Formation
    Colexifications
    Word Families
    2 Towards a Database of Morpheme-Segmented Wordlists
    General Idea
    Use Cases
    Aggregation Strategy
    Current State
    3 Outlook
    2 / 29

    View Slide

  3. Compositionality of Words
    3 / 29

    View Slide

  4. Compositionality of Words Word Formation
    Word Formation
    Compositionality is a basic feature of human language (Zeige 2015).
    Language consists of re-combinable elements.
    This entails an unlimited amount of expressions from a limited
    amount of elements.
    Different words may, therefore, share some of their morphemes.
    4 / 29

    View Slide

  5. Compositionality of Words Colexifications
    Colexifications
    Different meanings might be expressed with the same words
    (François 2008):
    Segment of 141 colexifications of skin ∼ bark in the CLICS2 database
    (List et al. 2018a, graphic by A. Tjuka).
    5 / 29

    View Slide

  6. Compositionality of Words Word Families
    Word Families
    Word formation leads to families of related words, and different meanings
    might be expressed with the same morphemes:
    A bipartite partial colexification network (Hill and List 2017) generated with the EDICTOR tool
    (List 2017), showing 4 fully or partially colexified concepts in English.
    6 / 29

    View Slide

  7. Towards a Database of
    Morpheme-Segmented Wordlists
    7 / 29

    View Slide

  8. Towards a Database of Morpheme-Segmented Wordlists General Idea
    General Idea
    There are already many databases on linguistic data, e.g.
    borrowing (WOLD)
    cognates (ABVD)
    semantic shifts (DatSemShift)
    language structures (WALS)
    ...
    yet no cross-linguistic databases on morphological data.
    8 / 29

    View Slide

  9. Towards a Database of Morpheme-Segmented Wordlists General Idea
    General Idea: Morpheme-Segmented Lexical Database
    WOLD, a database of borrowings in wordlists of basic vocabulary (Haspelmath and Tadmor
    2008).
    9 / 29

    View Slide

  10. Towards a Database of Morpheme-Segmented Wordlists General Idea
    General Idea: Wordlists
    Wordlists are lists of concept-form-pairs.
    They are being used for language comparison
    since at least the 18th century
    (e.g. Leibniz 1768).
    They were popularized by Morris Swadesh in
    the 1950s (Campbell 2013).
    As a standard method in fieldwork they are
    available for a very large amount of languages
    (List et al. 2019).
    Sometimes they are the only data available.
    Concept Form
    all mumu
    ashes kuy-ham
    bark naka
    belly ȼek
    big mɨha
    black yɨk
    blood nɨˀpin
    bone pak
    burn neˀm-
    Excerpt from a wordlist with
    Zoque data (Swadesh 1954).
    10 / 29

    View Slide

  11. Towards a Database of Morpheme-Segmented Wordlists General Idea
    General Idea: Morpheme-Segmented Wordlists
    ID DOCULECT CONCEPT FORM TOKENS SEGM-TOKENS MORPHEMES
    339 German spider Spinne ʃ p ɪ n ə ʃ p ɪ n + ə SPIN _e-suff
    341 German spider web Spinnwebe ʃ p ɪ n v eː b ə ʃ p ɪ n + v eː b + ə SPIN WEAVE _e-suff
    342 German spider web Spinnennetz ʃ p ɪ n ə n n ɛ ts ʃ p ɪ n + ə n + n ɛ ts SPIN _en-fuge NET
    753 German spin spinnen ʃ p ɪ n ə n ʃ p ɪ n + ə n SPIN _inf
    Data based on the Intercontinental Dictionary Series (Key and Comrie 2016).
    11 / 29

    View Slide

  12. Towards a Database of Morpheme-Segmented Wordlists Use Cases
    Use Cases: Automated Morpheme Segmentation
    Morphemes (List 2019)
    are recurring combinations of form and meaning
    and abstraction of relations within the lexicon
    which reflect language history
    and are often bound to phonotactic restrictions
    while being sometimes marked orthographically
    (space, dash, different character).
    Many approaches search only for recurring letter strings.
    The quality of an approach depends on language and amount of data.
    There is no standard for testing new methods.
    A database of morpheme-segmented wordlists could be used as a gold
    standard for testing purposes.
    12 / 29

    View Slide

  13. Towards a Database of Morpheme-Segmented Wordlists Use Cases
    Use Cases: Automated Cognate Detection I
    Variety Form Character Cognacy
    Fúzhōu ŋuoʔ⁵ ᄅ 1
    Měixiàn ŋiat⁵ kuoŋ⁴⁴ ᄅܻ 1 2
    Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ ᄅܻڍ 1 2 3
    Běijīng yɛ⁵¹ liɑŋ¹ ᄅਊ 1 4
    Moon in different Chinese varieties (List et al. 2016).
    13 / 29

    View Slide

  14. Towards a Database of Morpheme-Segmented Wordlists Use Cases
    Use Cases: Automated Cognate Detection II
    Automated cognate detection is a standard method in historical
    computational linguistics (List et al. 2017).
    It is used to analyze large amounts of language data.
    One popular algorithm for it is LexStat.
    It is based on detecting regular sound correspondences, like the
    traditional comparative method.
    But partial cognacy can seriously hamper results.
    → Solution: A model of historical relations between words.
    First step: A database of morphologically annotated wordlists.
    14 / 29

    View Slide

  15. Towards a Database of Morpheme-Segmented Wordlists Use Cases
    Use Cases: Studies on Cross-Linguistic Promiscuity
    Semantic promiscuity: morphological productivity of concepts (List 2018).
    fallen
    TO FALL
    Fall
    CASE (EVENT)
    Fall
    CASE (JURISTIC)
    fällen
    TO FELL A TREE
    Fall
    FALL
    fällen
    TO KILL IN BATTLE
    abfallen
    TO DROP OFF
    Abfall
    GARBAGE
    abfallen
    TO SLOPE
    abfallen
    TO TURN AWAY
    FROM AN IDEOLOGY
    befallen
    TO INFEST
    Beifall
    APPLAUSE
    Falle
    TRAP
    einfallen
    TO COLLAPSE
    einfallen
    TO INVADE
    einfallen
    TO COME TO MIND
    Einfall
    IDEA
    fällig
    DUE
    gefällig
    COMPLIANT
    ’To fall’, a concept with high promiscuity (Schweikhard 2018).
    15 / 29

    View Slide

  16. Towards a Database of Morpheme-Segmented Wordlists Use Cases
    Use Cases: Partial Colexifications
    Acccording to Urban (2011), the synchronic structure of word families can
    point to diachronic semantic changes, e.g.:
    16 languages fully colexify ‘mouth’ and ‘lip’.
    31 languages partially colexify ‘mouth’ and ‘lip’.
    These 31 languages use a simplex for mouth, and a complex word or
    phrase for lip, like Kanuri cî ‘mouth’, kâ cî-bè ‘lip’, literally ‘stick
    mouth-of’.
    Therefore ‘lip’ can be semantically derived from ‘mouth’.
    This is confirmed by Sanskrit lapana- ‘mouth’ → Sindhī lavan-a ‘lip’.
    This could be tested on a database of morphologically annotated wordlists.
    16 / 29

    View Slide

  17. Towards a Database of Morpheme-Segmented Wordlists Aggregation Strategy
    Aggregation Strategy: CLDF Initiative
    CLDF (Forkel et al. 2018) try to make linguistic data FAIR (Wilkinson et
    al. 2016):
    Findable
    Accessible
    Interoperable
    Reusable
    This involves four steps (Forkel et al. 2018):
    aggregating datasets, specifying the sources,
    including cross-links between datasets and reference catalogues,
    cleaning the data and presenting it in a tsv-table-format containing
    one row for each word form and
    one column for each type of annotation, and
    segmenting the forms by phonemes and morphemes.
    17 / 29

    View Slide

  18. Towards a Database of Morpheme-Segmented Wordlists Aggregation Strategy
    Aggregation Strategy: CLICS
    Detecting colexifications pattern with the help of the CLICS2 database
    (List et al. 2018a, graphic from List et al. 2018).
    18 / 29

    View Slide

  19. Towards a Database of Morpheme-Segmented Wordlists Aggregation Strategy
    Aggregation Strategy: Concepticon
    The concept barking in the Concepticon database (List et al. 2019).
    19 / 29

    View Slide

  20. Towards a Database of Morpheme-Segmented Wordlists Aggregation Strategy
    Aggregation Strategy: Standards
    Our database will follow these standards:
    Only synchronically transparent morpheme borders are annotated.
    Morpheme borders are marked consistently with a plus-sign.
    Cognate morphemes are annotated via glosses in order to
    distinguish homophones and
    link allomorphs.
    Example annotations will be provided as guidelines for contributors.
    20 / 29

    View Slide

  21. Towards a Database of Morpheme-Segmented Wordlists Current State
    Current State: Collecting Morpheme-Segmented Wordlists
    Dataset Varieties Concepts Lexemes % mapped to Concepticon
    allenbai 9 499 4546 100
    bodtkhobwa 8 652 3958 89
    halenepal 13 1032 11041 69
    marrisonnaga 40 724 27441 94
    yanglalo 8 1001 9784 89
    sohartmannchin 8 280 2171 100
    naganorgyalrongic 10 1256 10685 80
    castrosui 16 592 9639 85
    beidasinitic 18 905 18069 79
    suntb 51 996 50434 92
    chenhmongmien 25 883 21617 89
    Morpheme-segmented datasets from the Lexibank repository, prepared by members of the CALC
    group and the DLCE (Jena).
    21 / 29

    View Slide

  22. Towards a Database of Morpheme-Segmented Wordlists Current State
    Current State: A Morpheme-Segmented Wordlist
    ID Language_ID Parameter_ID Value Form Segments Source
    Tibetan_Old_Tibetan-1741-1 Tibetan_Old_Tibetan 1741 steŋ steŋ s t e ŋ Huang1992
    rGyalrong_Japhug-1741-1 rGyalrong_Japhug 1741 ɯ-taʁ ɯ-taʁ ɯ + t a ʁ Jacques2015b
    Tibetan_Old_Tibetan-98-1 Tibetan_Old_Tibetan 98 tʰams.tɕad tʰams.tɕad tʰ a m s + tɕ a d Huang1992
    Kiranti_Khaling-98-1 Kiranti_Khaling 98 kʰøle kʰøle kʰ ø l e Jacques2017FN
    Kiranti_Limbu-98-1 Kiranti_Limbu 98 kak kak k a k Jacques2017FN
    rGyalrong_Japhug-98-1 rGyalrong_Japhug 98 %tʰamtɕɤt %tʰamtɕɤt tʰ a m tɕ ɤ t Jacques2015b
    Tangut-98-1 Tangut 98 zji¹ zji¹ z j i ¹ Li1997
    Tibetan_Old_Tibetan-1292-1 Tibetan_Old_Tibetan 1292 ŋan ŋan ŋ a n Huang1992
    rGyalrong_Japhug-1292-1 rGyalrong_Japhug 1292 %ŋɤn %ŋɤn ŋ ɤ n Jacques2015b
    Tibetan_Old_Tibetan-1422-1 Tibetan_Old_Tibetan 1422 gson+po gson+po g s o n + p o Huang1992
    Excerpt from Sino-Tibetan Database of Lexical Cognates (Sagart et al. 2019).
    22 / 29

    View Slide

  23. Towards a Database of Morpheme-Segmented Wordlists Current State
    Current State: Case Studies on Partial Colexification
    Language Concept Source Form
    Abui tree bata
    Abui skin kul
    Abui bark bata kul
    Nung-Fengshan tree fai
    Nung-Fengshan skin naŋ
    Nung-Fengshan bark naŋ fai
    Concepts from the CLICS2 database (List et al. 2018a, table by A. Tjuka).
    23 / 29

    View Slide

  24. Outlook
    24 / 29

    View Slide

  25. Outlook
    Outlook: Annotation of Word Formation Processes I
    ID LANGUAGE CONCEPT FORM MORPHEMES COGNATES ROOTS
    1 Old High German eternity ēwo ēw o 1 2 1 2
    2 Ancient Greek life aiōn ai ōn 1 2 1 2
    3 Old Avestan life āiiū āiiū 3 1
    4 Old Avestan long-living darəgāiiū darəg a āiiū 4 5 3 3 4 1
    5 Vedic life áyu áyu 3 1
    6 Vedic long-living dīrghā́yu dīrgh á ā́yu 4 5 3 3 4 1
    7 Vedic young yúvan yúv an 6 7 1 5
    8 Latin (deity name) iūnō iū n ō 6 8 2 1 5 2
    9 Indo-European life *h₂ai̯-u-on- h₂ai̯u on 3 2 1 2
    10 Indo-European life *h₂oi̯-u- h₂oi̯u 1 1
    11 Indo-European long-living *dl̩h₁gʰ-ó-h₂oi̯-u- dl̩h₁gʰ ó h₂oi̯u 4 5 1 3 4 1
    12 Indo-European young *h₂i̯-u-h₃on- h₂i̯u h₃on 6 7 1 5
    13 Indo-European the young one *h₂i̯-u-h₃n-on- h₂i̯u h₃n on 6 8 2 1 5 2
    Data from Wodtko et al. 2008 and Mallory and Adams 2006).
    25 / 29

    View Slide

  26. Outlook
    Outlook: Annotation of Word Formation Processes II
    aiōn āiiū ā́yu
    ēwo iūnō
    yúvan
    dīrghā́yu
    darəgāiiū
    *h₂ai̯-u-on-
    *h₂oi̯-u-
    *h₂i̯-u-h₃on-
    *h₂i̯-u-h₃n-on-
    *dl̩h₁gʰ-ó-h₂oi̯-u-
    OHG Greek
    Old Avestan Vedic
    Latin
    Source Source-ID Target Target-ID Change
    *h₂ai̯-u-on- 1 aiōn 2 sound change
    *h₂oi̯-u- 3 *h₂ai̯-u-on- 1 e-grade, on-suffix
    *h₂oi̯-u- 3 *dl̩h₁gʰ-ó-h₂oi̯-u- 4 compound with *dl̩h₁gʰ-ó-
    *dl̩h₁gʰ-ó-h₂oi̯- 7 dīrghā́yu 8 sound change
    ... ... ... ... ...
    26 / 29

    View Slide

  27. Outlook
    Outlook: Goals
    Our goals entail
    an online database
    with free access
    and at least 40 languages on first release,
    including Sino-Tibetan, Dogon (ongoing collaboration by A. Hantgan
    and J.-M. List), Tukanoan (ongoing collaboration by T. Chacon, T.
    Tresoldi and, J.-M. List), and Germanic languages (ongoing work by
    N. E. Schweikhard),
    based on
    pre-compiled wordlists,
    pre-segmented wordlists, and
    expert judgments.
    27 / 29

    View Slide

  28. Outlook
    Outlook: Application Possibilities
    Aggregating morpheme border annotations supports us in
    automated cognate detection,
    testing morpheme detection methods, and
    developing standards of linguistic annotation
    and allows for quantitative studies on
    word family structure,
    partial colexification, and
    semantic promiscuity in word formation.
    28 / 29

    View Slide

  29. Outlook
    Thank you for your attention!
    Contact: [email protected]
    http://calc.digling.org/
    CALC members:
    Dr. Johann-Mattis List (Group leader)
    Dr. Yunfan Lai (Post-Doc)
    Dr. Tiago Tresoldi (Post-Doc)
    Mei-Shin Wu (Doctorate student)
    Nathanael E. Schweikhard (Doctorate student)
    Associated:
    Annika Tjuka
    29 / 29

    View Slide