Computer-assisted data curation and analysis for historical and typological language comparison

Computer-assisted data curation and analysis for historical and typological language
comparison Tiago Tresoldi & Johann-Mattis List Max-Planck-Institut für Menschheitsgeschichte (MPI-SHH), Jena

In the beginning was the tree Darwin (1837) Schleicher (1863)

Historical Linguistics • Shorter time-depth (~6,000 years) • Biased towards
some families, especially IE • Still emphasis on the tree model and reconstruction of proto-forms • Sometimes more an art than a science – No set of formal guidelines leading to reproducible results, more principles – No standard for data – Good art is better than bad science!

Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya aruna
watar kuwapi English all sea water when German alle see, meer wasser wann French tout mer eau quand Italian tutto mare acqua quando Greek pant thalasa nero pote

Wordlists and cognates ALL SEA WATER WHEN Hittite dapiya (A1)
aruna (B1) watar (C1) kuwapi (D1) English all (A2) sea (B2) water (C1) when (D1) German alle (A2) see (B2), meer (B3) wasser (C1) wann (D1) French tout (A3) mer (B3) eau (C2) quand (D1) Italian tutto (A3) mare (B3) acqua (C2) quando (D1) Greek pant (A4) thalasa (B4) nero (C3) pote (D1)

NEXUS files Hittite 100010001001 English 010001001001 German 010001101001 French 001000100101
Italian 001000100101 Greek 000100010011

Cognate judgment • Correspondence, not similarity – Latin “god” deus,
Greek “god” θεός [theos] – English two, Armenian “two” [ երկուս erkus] • As semantically strict as possible – Albanian “sister” motër • Borrowings, resemblances • Concept lists, basic vocabulary

Current situation • Exponential increase in digital linguistic data does
not mean increase in data for historical linguistics • Preparing datasets for large-scale historical and typological language comparison is difficult • Scholars don’t tend to adhere to standards when (and if!) preparing datasets • Field divided in traditional and computational approaches

Here to help Source: https://xkcd.com/1831/

Why should we change things? • Historical linguistics not only
for the sake of reconstructing proto-forms • The discipline should inform and be informed by other areas of language research • It should also inform and be informed by other fields of the science of human history • We need computer-assisted, not computer- performed research

Data • What do (modeled) biologists do, if they want
to make an analysis involving data? • What do (modeled) linguists do, if they want to make an analysis involving data? • Main alternatives for cross-linguistic comparison: parse some on-line data, few limited source (Wiktionary), ASJP

Data Example

Glottolog https://glottolog.org

CLTS / BIPA https://clts.clld.org

Concepticon - I https://concepticon.clld.org/

Concepticon - II • Linking concept lists • Concepticon’s concept
set “MOTHER” (#1216) is linked to 112 different concepts, including – “ 母亲” in Allen (2007) – “mother, older female relative” in Bengtson (1994) – “his mother” in Davies (1985) – “мать” in a Russian translation of Swadesh (1964)

Concepticon - III • Fix errors: – Swadesh “dull” sometimes
translated in Chinese as 钝 (dùn, “blunt”), sometimes as 笨 (bèn, “stupid”) – In Concepticon, data collected as 钝 is mapped to concept BLUNT, data collect as 笨 to concept STUPID

CLDF, Glottobank/Lexibank • CLDF, Forkel et al. (2018) • Lexibank

LingPy, Edictor

Example - CLICS²

What does it allow? • Computer-assisted language comparison – Data
must be human- and machine-readable – Interfaces must be lightweight – Software should produce transparent results • Empirical, cross-linguistic priors • Levels of confidence • History of languages as part of human histories – We should like reticulation!

More CALCies Dr. Johann-Mattis List (PI) Dr. Yunfan Lai Mei-Shin
Wu Nathanael E. Schweikhard External Associates: Dr. Nathan W. Hill (SOAS, London), Dr. Tim Bodt (Universität Bern)

Thank you! • Special thanks to: Robert Forkel, Christoph Rzymski,
Simon Greenhill, Harald Hammarström, Cormac Anderson, Mary Walworth, Thiago Chacon, Russell Gray • DFG Center for Advanced Studies “Words, Bones, Genes, Tools” (Universität Tübingen) • ERC #206320 • http://calc.digling.org/

Master

Computer-assisted data curation and analysis fo...

Computer-assisted data curation and analysis for historical and typological language comparison

Tiago Tresoldi

More Decks by Tiago Tresoldi

Other Decks in Education

Featured

Transcript