Computer-assisted data curation and analysis for historical and typological language comparison

Computer-assisted data curation and analysis for historical and typological language comparison

Slides for the talk at the "Words, Genes, Bones, and Tools" symposium in Tübingen, 2018.

74ebca07ccf49343d1ddaef84d65b78e?s=128

Tiago Tresoldi

December 15, 2018
Tweet

Transcript

  1. 1.

    Computer-assisted data curation and analysis for historical and typological language

    comparison Tiago Tresoldi & Johann-Mattis List Max-Planck-Institut für Menschheitsgeschichte (MPI-SHH), Jena
  2. 3.
  3. 4.

    Historical Linguistics • Shorter time-depth (~6,000 years) • Biased towards

    some families, especially IE • Still emphasis on the tree model and reconstruction of proto-forms • Sometimes more an art than a science – No set of formal guidelines leading to reproducible results, more principles – No standard for data – Good art is better than bad science!
  4. 5.

    Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya aruna

    watar kuwapi English all sea water when German alle see, meer wasser wann French tout mer eau quand Italian tutto mare acqua quando Greek pant thalasa nero pote
  5. 6.

    Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya (A1)

    aruna (B1) watar (C1) kuwapi (D1) English all (A2) sea (B2) water (C1) when (D1) German alle (A2) see (B2), meer (B3) wasser (C1) wann (D1) French tout (A3) mer (B3) eau (C2) quand (D1) Italian tutto (A3) mare (B3) acqua (C2) quando (D1) Greek pant (A4) thalasa (B4) nero (C3) pote (D1)
  6. 8.

    Cognate judgment • Correspondence, not similarity – Latin “god” deus,

    Greek “god” θεός [theos] – English two, Armenian “two” [ երկուս erkus] • As semantically strict as possible – Albanian “sister” motër • Borrowings, resemblances • Concept lists, basic vocabulary
  7. 9.

    Current situation • Exponential increase in digital linguistic data does

    not mean increase in data for historical linguistics • Preparing datasets for large-scale historical and typological language comparison is difficult • Scholars don’t tend to adhere to standards when (and if!) preparing datasets • Field divided in traditional and computational approaches
  8. 11.

    Why should we change things? • Historical linguistics not only

    for the sake of reconstructing proto-forms • The discipline should inform and be informed by other areas of language research • It should also inform and be informed by other fields of the science of human history • We need computer-assisted, not computer- performed research
  9. 12.

    Data • What do (modeled) biologists do, if they want

    to make an analysis involving data? • What do (modeled) linguists do, if they want to make an analysis involving data? • Main alternatives for cross-linguistic comparison: parse some on-line data, few limited source (Wiktionary), ASJP
  10. 17.

    Concepticon - II • Linking concept lists • Concepticon’s concept

    set “MOTHER” (#1216) is linked to 112 different concepts, including – “ 母亲” in Allen (2007) – “mother, older female relative” in Bengtson (1994) – “his mother” in Davies (1985) – “мать” in a Russian translation of Swadesh (1964)
  11. 18.

    Concepticon - III • Fix errors: – Swadesh “dull” sometimes

    translated in Chinese as 钝 (dùn, “blunt”), sometimes as 笨 (bèn, “stupid”) – In Concepticon, data collected as 钝 is mapped to concept BLUNT, data collect as 笨 to concept STUPID
  12. 27.

    What does it allow? • Computer-assisted language comparison – Data

    must be human- and machine-readable – Interfaces must be lightweight – Software should produce transparent results • Empirical, cross-linguistic priors • Levels of confidence • History of languages as part of human histories – We should like reticulation!
  13. 28.

    More CALCies Dr. Johann-Mattis List (PI) Dr. Yunfan Lai Mei-Shin

    Wu Nathanael E. Schweikhard External Associates: Dr. Nathan W. Hill (SOAS, London), Dr. Tim Bodt (Universität Bern)
  14. 29.

    Thank you! • Special thanks to: Robert Forkel, Christoph Rzymski,

    Simon Greenhill, Harald Hammarström, Cormac Anderson, Mary Walworth, Thiago Chacon, Russell Gray • DFG Center for Advanced Studies “Words, Bones, Genes, Tools” (Universität Tübingen) • ERC #206320 • http://calc.digling.org/
  15. 30.