Computer-Assisted Language Comparison. State of the art and future prospects

LC CA Computer-Assisted Language Comparison State of the art and
future prospects T. Tresoldi N. E. Schweikhard M.-S. Wu Y.-F. Lai J.-M. List Max Planck Institute for the Science of Human History Department of Linguistic and Cultural Evolution CALC Project Jul 13, 2018

Table of Contents 1 Language Comparison The Comparative Method Computational
Approaches 2 State of the Art The Project Data Formats EDICTOR Alignments 3 Future prospects Germanic Word Formation SEAGaL Language and geography

The Comparative Method Techniques for language comparison, the “comparative method”,
haven’t changed substantially since the 19th century:

haven’t changed substantially since the 19th century: 1 conduct intensive language comparison

haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities

haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families

haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time

haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time Crucial tasks, such as cognate identiﬁcation, are partly based in non-formalized knowledge and intuition

Computational Approaches Quantitative methods in use at least since d’Urville
(1830s)

(1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic

(1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic Modern methods: Ringe, G. Starostin, “New Zealand School” Linguistics coming back from synchronicity Bayesian inference in phylogenetics: alleged uninterpretability, models from biology Competing research by non-linguists and non-academic setting (NLP)

Meme time Title text: “We TOLD you it was hard.”
“Yeah, but now that I’VE tried, we KNOW it’s hard.” (XKCD #1831)

Traditional vs. Computational Language Comparison LC CA lacks efficiency consistency
efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility

The Current Scenario Historical language comparison and interdisciplinarity: language change
not studied by and for itself window to human history and a bridge to other disciplines

not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”)

not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”) New languages, new questions: large and diverse language families language families like Sino-Tibetan present “almost unsurmountable obstacles” (Antoine Meillet, 1925)

Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic
data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexiﬁcations Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer

data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexiﬁcations Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results

data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexiﬁcations Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results computational methods cannot replace experts (assist, not replace)

Lingpy Programming library for historical linguistics, state of the art:
multiple phonetic alignments: (List, 2014) 98% (pair scores) automatic cognate detection (List et al., 2017): 89% (B-Cubed scores) phylogenetic reconstruction (Rama et al., 2018): 0.08 (Gen. Quart. Dist.) correspondence pattern identiﬁcation (List u. rev.): NP-hard (no human attempts)

CLDF: Cross-Linguistic Data Formats Data must be machine- and human-readable,
with uniﬁed formats for data storage and exchange. Data curation is facilitated by: Doculect Glottocode Concept Concepticon ID Form Tokens Source Anuta anut1237 EIGHT 1705 varu v a r u POLLEX East Futunan east2447 EIGHT 1705 valu v a l u POLLEX Hawaiian hawa1245 EIGHT 1705 walu w a l u ID: 71458 Kapingamarangi kapi1249 EIGHT 1705 walu w a l u POLLEX Mele Fila mele1250 EIGHT 1705 ebaru B a r u ID: 52375 Nukuria nuku1259 EIGHT 1705 varu v a r u Davletshin (2015) . . . . . . . . . . . . . . . . . . . . . Rapanui rapa1244 EIGHT 1705 va’u v a P u POLLEX Rennell Bellona renn1242 EIGHT 1705 bangu b a Ng u POLLEX spreadsheet formats Validation software Benchmark data Reference catalogs Online publications

The Etymological DICTionary EditOR (EDICTOR) http://edictor.digling.org/

Alignments Automatic alignment of Germanic cognates of “knee”

Sound Correspondences

Concepticon and Cross-Linguistic Colexiﬁcations http://clics.clld.org/

Future prospects Our methods are already helping people by automating
tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here?

tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train

tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks

tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identiﬁcation partial colexiﬁcations can intuition be weighted? suprasegmental relationships and segments in their setting

tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identiﬁcation partial colexiﬁcations can intuition be weighted? suprasegmental relationships and segments in their setting less-studied languages, including sign languages Litmus test: Sino-Tibetan languages

Germanic Word Formation - I Research Questions What are the
interrelations between form, meaning, and frequency? Are they system-dependent, culture-speciﬁc, or universal? How can computer-assisted methods help answering these questions?

Germanic Word Formation - I Research Questions What are the
interrelations between form, meaning, and frequency? Are they system-dependent, culture-speciﬁc, or universal? How can computer-assisted methods help answering these questions? Perspectives concept-based (onomasiological) vs. form-based (semasiological) cross-linguistic vs. language-speciﬁc quantitative vs. qualitative → computer-assisted

Germanic Word Formation - II Morphological Complexity in Basic Vocabulary
question: evolutionary dynamics of lexical change concept-based language(-family)-speciﬁc

question: evolutionary dynamics of lexical change concept-based language(-family)-speciﬁc Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-speciﬁc

question: evolutionary dynamics of lexical change concept-based language(-family)-specific Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-specific Productivity and Promiscuity in Compounding question: language-specific and universal aspects of compoundhood concept- and form-based cross-linguistic (worldwide) and language-specific

Germanic Word Formation - III Concepticon concept-based form-based Morphological Complexity
in Basic Vocabulary Productivity and Promiscuity of Compound Heads Paradigmatic Alternations in Nominal Derivations CLICS Edictor Partial Colexiﬁcations Automatic Morpheme Detection

SEAGaL: South East Asia Gene and Language - I The
degree of the correlation between language and genetic diversity? A group B group C group A language B language C language

degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY?

degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY? Geography? Population size? Bilingualism?

SEAGaL: South East Asia Gene and Language - II Sino-Tibetan
language family. Genomic Bioinformatics tools Ethnic group Linguistic Lexibank, Concepticon, LingPy Reﬁne, reassess and reconstruct

Language and geography: the case of Rgyalrongic - I Rgyalrongic
languages (Sino-Tibetan) exhibit a series of orientational preﬁxes Traditional approaches Fully related to actual topography According to river, mountain, sun, etc. Inconsistent among scholars: wild guesses Not covering all uses Are orientational preﬁxes related to actual geography?

Rgyalrongic languages

Language and geography: the case of Rgyalrongic - II Our
current job is to test how closely orientational prefixes are related to real-world topography Selection of 15 familiar places (villages or towns) Collection of the prefixes used between every two places Draw a map based on the prefixes This map represents the collective memory of the speakers Compare the inferred map with the actual map We may get to understand The original meaning of the orientational prefixes Evolutionary pathways of the orientational prefixes How Rgyalrongic (even Sino-Tibetan) ancestors understood and interpreted geography

Language and geography: the case of Rgyalrongic - III Figure:
The actual map Figure: The inferred map

Thank you... and let’s collaborate! DATA LC CA TOOLS SOFTWARE
INTERFACES http://calc.digling.org/

Computer-Assisted Language Comparison. State of...

Computer-Assisted Language Comparison. State of the art and future prospects

More Decks by Schweikhard

Other Decks in Science

Featured

Transcript