Computer-assisted data curation and analysis for historical and typological language comparison

Computer-assisted data curation and analysis for historical and typological language comparison

Slides for the talk at the "Words, Genes, Bones, and Tools" symposium in Tübingen, 2018.

74ebca07ccf49343d1ddaef84d65b78e?s=128

Tiago Tresoldi

December 15, 2018
Tweet

Transcript

  1. Computer-assisted data curation and analysis for historical and typological language

    comparison Tiago Tresoldi & Johann-Mattis List Max-Planck-Institut für Menschheitsgeschichte (MPI-SHH), Jena
  2. In the beginning was the tree Darwin (1837) Schleicher (1863)

  3. None
  4. Historical Linguistics • Shorter time-depth (~6,000 years) • Biased towards

    some families, especially IE • Still emphasis on the tree model and reconstruction of proto-forms • Sometimes more an art than a science – No set of formal guidelines leading to reproducible results, more principles – No standard for data – Good art is better than bad science!
  5. Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya aruna

    watar kuwapi English all sea water when German alle see, meer wasser wann French tout mer eau quand Italian tutto mare acqua quando Greek pant thalasa nero pote
  6. Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya (A1)

    aruna (B1) watar (C1) kuwapi (D1) English all (A2) sea (B2) water (C1) when (D1) German alle (A2) see (B2), meer (B3) wasser (C1) wann (D1) French tout (A3) mer (B3) eau (C2) quand (D1) Italian tutto (A3) mare (B3) acqua (C2) quando (D1) Greek pant (A4) thalasa (B4) nero (C3) pote (D1)
  7. NEXUS files Hittite 100010001001 English 010001001001 German 010001101001 French 001000100101

    Italian 001000100101 Greek 000100010011
  8. Cognate judgment • Correspondence, not similarity – Latin “god” deus,

    Greek “god” θεός [theos] – English two, Armenian “two” [ երկուս erkus] • As semantically strict as possible – Albanian “sister” motër • Borrowings, resemblances • Concept lists, basic vocabulary
  9. Current situation • Exponential increase in digital linguistic data does

    not mean increase in data for historical linguistics • Preparing datasets for large-scale historical and typological language comparison is difficult • Scholars don’t tend to adhere to standards when (and if!) preparing datasets • Field divided in traditional and computational approaches
  10. Here to help Source: https://xkcd.com/1831/

  11. Why should we change things? • Historical linguistics not only

    for the sake of reconstructing proto-forms • The discipline should inform and be informed by other areas of language research • It should also inform and be informed by other fields of the science of human history • We need computer-assisted, not computer- performed research
  12. Data • What do (modeled) biologists do, if they want

    to make an analysis involving data? • What do (modeled) linguists do, if they want to make an analysis involving data? • Main alternatives for cross-linguistic comparison: parse some on-line data, few limited source (Wiktionary), ASJP
  13. Data Example

  14. Glottolog https://glottolog.org

  15. CLTS / BIPA https://clts.clld.org

  16. Concepticon - I https://concepticon.clld.org/

  17. Concepticon - II • Linking concept lists • Concepticon’s concept

    set “MOTHER” (#1216) is linked to 112 different concepts, including – “ 母亲” in Allen (2007) – “mother, older female relative” in Bengtson (1994) – “his mother” in Davies (1985) – “мать” in a Russian translation of Swadesh (1964)
  18. Concepticon - III • Fix errors: – Swadesh “dull” sometimes

    translated in Chinese as 钝 (dùn, “blunt”), sometimes as 笨 (bèn, “stupid”) – In Concepticon, data collected as 钝 is mapped to concept BLUNT, data collect as 笨 to concept STUPID
  19. CLDF, Glottobank/Lexibank • CLDF, Forkel et al. (2018) • Lexibank

  20. LingPy, Edictor

  21. LingPy, Edictor

  22. LingPy, Edictor

  23. Example - CLICS²

  24. Example - CLICS²

  25. Example - CLICS²

  26. Example - CLICS²

  27. What does it allow? • Computer-assisted language comparison – Data

    must be human- and machine-readable – Interfaces must be lightweight – Software should produce transparent results • Empirical, cross-linguistic priors • Levels of confidence • History of languages as part of human histories – We should like reticulation!
  28. More CALCies Dr. Johann-Mattis List (PI) Dr. Yunfan Lai Mei-Shin

    Wu Nathanael E. Schweikhard External Associates: Dr. Nathan W. Hill (SOAS, London), Dr. Tim Bodt (Universität Bern)
  29. Thank you! • Special thanks to: Robert Forkel, Christoph Rzymski,

    Simon Greenhill, Harald Hammarström, Cormac Anderson, Mary Walworth, Thiago Chacon, Russell Gray • DFG Center for Advanced Studies “Words, Bones, Genes, Tools” (Universität Tübingen) • ERC #206320 • http://calc.digling.org/
  30. Master