$30 off During Our Annual Pro Sale. View Details »

Computer-assisted data curation and analysis for historical and typological language comparison

Tiago Tresoldi
December 15, 2018

Computer-assisted data curation and analysis for historical and typological language comparison

Slides for the talk at the "Words, Genes, Bones, and Tools" symposium in Tübingen, 2018.

Tiago Tresoldi

December 15, 2018
Tweet

More Decks by Tiago Tresoldi

Other Decks in Education

Transcript

  1. Computer-assisted data curation
    and analysis for historical and
    typological language comparison
    Tiago Tresoldi & Johann-Mattis List
    Max-Planck-Institut für Menschheitsgeschichte (MPI-SHH), Jena

    View Slide

  2. In the beginning was the tree
    Darwin (1837) Schleicher (1863)

    View Slide

  3. View Slide

  4. Historical Linguistics
    ● Shorter time-depth (~6,000 years)
    ● Biased towards some families, especially IE
    ● Still emphasis on the tree model and
    reconstruction of proto-forms
    ● Sometimes more an art than a science
    – No set of formal guidelines leading to
    reproducible results, more principles
    – No standard for data
    – Good art is better than bad science!

    View Slide

  5. Wordlists and cognates
    ALL SEA WATER WHEN
    Hittite dapiya aruna watar kuwapi
    English all sea water when
    German alle see, meer wasser wann
    French tout mer eau quand
    Italian tutto mare acqua quando
    Greek pant thalasa nero pote

    View Slide

  6. Wordlists and cognates
    ALL SEA WATER WHEN
    Hittite dapiya (A1) aruna (B1) watar (C1) kuwapi (D1)
    English all (A2) sea (B2) water (C1) when (D1)
    German alle (A2)
    see (B2),
    meer (B3)
    wasser (C1) wann (D1)
    French tout (A3) mer (B3) eau (C2) quand (D1)
    Italian tutto (A3) mare (B3) acqua (C2) quando (D1)
    Greek pant (A4) thalasa (B4) nero (C3) pote (D1)

    View Slide

  7. NEXUS files
    Hittite 100010001001
    English 010001001001
    German 010001101001
    French 001000100101
    Italian 001000100101
    Greek 000100010011

    View Slide

  8. Cognate judgment
    ● Correspondence, not similarity
    – Latin “god” deus, Greek “god” θεός [theos]
    – English two, Armenian “two” [
    երկուս erkus]
    ● As semantically strict as possible
    – Albanian “sister” motër
    ● Borrowings, resemblances
    ● Concept lists, basic vocabulary

    View Slide

  9. Current situation
    ● Exponential increase in digital linguistic data
    does not mean increase in data for historical
    linguistics
    ● Preparing datasets for large-scale historical and
    typological language comparison is difficult
    ● Scholars don’t tend to adhere to standards when
    (and if!) preparing datasets
    ● Field divided in traditional and computational
    approaches

    View Slide

  10. Here to help
    Source: https://xkcd.com/1831/

    View Slide

  11. Why should we change things?
    ● Historical linguistics not only for the sake of
    reconstructing proto-forms
    ● The discipline should inform and be informed by
    other areas of language research
    ● It should also inform and be informed by other
    fields of the science of human history
    ● We need computer-assisted, not computer-
    performed research

    View Slide

  12. Data
    ● What do (modeled) biologists do, if they want to
    make an analysis involving data?
    ● What do (modeled) linguists do, if they want to
    make an analysis involving data?
    ● Main alternatives for cross-linguistic comparison:
    parse some on-line data, few limited source
    (Wiktionary), ASJP

    View Slide

  13. Data Example

    View Slide

  14. Glottolog
    https://glottolog.org

    View Slide

  15. CLTS / BIPA
    https://clts.clld.org

    View Slide

  16. Concepticon - I
    https://concepticon.clld.org/

    View Slide

  17. Concepticon - II
    ● Linking concept lists
    ● Concepticon’s concept set “MOTHER” (#1216) is
    linked to 112 different concepts, including
    – “ 母亲” in Allen (2007)
    – “mother, older female relative” in
    Bengtson (1994)
    – “his mother” in Davies (1985)
    – “мать” in a Russian translation of Swadesh
    (1964)

    View Slide

  18. Concepticon - III
    ● Fix errors:
    – Swadesh “dull” sometimes translated in
    Chinese as 钝 (dùn, “blunt”), sometimes as 笨
    (bèn, “stupid”)
    – In Concepticon, data collected as 钝 is
    mapped to concept BLUNT, data collect as 笨
    to concept STUPID

    View Slide

  19. CLDF, Glottobank/Lexibank
    ● CLDF, Forkel et al. (2018)
    ● Lexibank

    View Slide

  20. LingPy, Edictor

    View Slide

  21. LingPy, Edictor

    View Slide

  22. LingPy, Edictor

    View Slide

  23. Example - CLICS²

    View Slide

  24. Example - CLICS²

    View Slide

  25. Example - CLICS²

    View Slide

  26. Example - CLICS²

    View Slide

  27. What does it allow?
    ● Computer-assisted language comparison
    – Data must be human- and machine-readable
    – Interfaces must be lightweight
    – Software should produce transparent results
    ● Empirical, cross-linguistic priors
    ● Levels of confidence
    ● History of languages as part of human histories
    – We should like reticulation!

    View Slide

  28. More CALCies
    Dr. Johann-Mattis List (PI) Dr. Yunfan Lai
    Mei-Shin Wu Nathanael E. Schweikhard
    External Associates: Dr. Nathan W. Hill (SOAS, London), Dr. Tim Bodt (Universität Bern)

    View Slide

  29. Thank you!
    ● Special thanks to: Robert Forkel, Christoph
    Rzymski, Simon Greenhill, Harald Hammarström,
    Cormac Anderson, Mary Walworth, Thiago
    Chacon, Russell Gray
    ● DFG Center for Advanced Studies “Words, Bones,
    Genes, Tools” (Universität Tübingen)
    ● ERC #206320
    ● http://calc.digling.org/

    View Slide

  30. Master

    View Slide