LC CA Computer-Assisted Language Comparison State of the art and future prospects T. Tresoldi N. E. Schweikhard M.-S. Wu Y.-F. Lai J.-M. List Max Planck Institute for the Science of Human History Department of Linguistic and Cultural Evolution CALC Project Jul 13, 2018
Table of Contents 1 Language Comparison The Comparative Method Computational Approaches 2 State of the Art The Project Data Formats EDICTOR Alignments 3 Future prospects Germanic Word Formation SEAGaL Language and geography
The Comparative Method Techniques for language comparison, the “comparative method”, haven’t changed substantially since the 19th century: 1 conduct intensive language comparison
The Comparative Method Techniques for language comparison, the “comparative method”, haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities
The Comparative Method Techniques for language comparison, the “comparative method”, haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families
The Comparative Method Techniques for language comparison, the “comparative method”, haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time
The Comparative Method Techniques for language comparison, the “comparative method”, haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time Crucial tasks, such as cognate identification, are partly based in non-formalized knowledge and intuition
Computational Approaches Quantitative methods in use at least since d’Urville (1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic
Computational Approaches Quantitative methods in use at least since d’Urville (1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic Modern methods: Ringe, G. Starostin, “New Zealand School” Linguistics coming back from synchronicity Bayesian inference in phylogenetics: alleged uninterpretability, models from biology Competing research by non-linguists and non-academic setting (NLP)
Traditional vs. Computational Language Comparison LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
Traditional vs. Computational Language Comparison LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
The Current Scenario Historical language comparison and interdisciplinarity: language change not studied by and for itself window to human history and a bridge to other disciplines
The Current Scenario Historical language comparison and interdisciplinarity: language change not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”)
The Current Scenario Historical language comparison and interdisciplinarity: language change not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”) New languages, new questions: large and diverse language families language families like Sino-Tibetan present “almost unsurmountable obstacles” (Antoine Meillet, 1925)
Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer
Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results
Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results computational methods cannot replace experts (assist, not replace)
CLDF: Cross-Linguistic Data Formats Data must be machine- and human-readable, with unified formats for data storage and exchange. Data curation is facilitated by: Doculect Glottocode Concept Concepticon ID Form Tokens Source Anuta anut1237 EIGHT 1705 varu v a r u POLLEX East Futunan east2447 EIGHT 1705 valu v a l u POLLEX Hawaiian hawa1245 EIGHT 1705 walu w a l u ID: 71458 Kapingamarangi kapi1249 EIGHT 1705 walu w a l u POLLEX Mele Fila mele1250 EIGHT 1705 ebaru B a r u ID: 52375 Nukuria nuku1259 EIGHT 1705 varu v a r u Davletshin (2015) . . . . . . . . . . . . . . . . . . . . . Rapanui rapa1244 EIGHT 1705 va’u v a P u POLLEX Rennell Bellona renn1242 EIGHT 1705 bangu b a Ng u POLLEX spreadsheet formats Validation software Benchmark data Reference catalogs Online publications
Future prospects Our methods are already helping people by automating tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here?
Future prospects Our methods are already helping people by automating tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train
Future prospects Our methods are already helping people by automating tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks
Future prospects Our methods are already helping people by automating tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identification partial colexifications can intuition be weighted? suprasegmental relationships and segments in their setting
Future prospects Our methods are already helping people by automating tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identification partial colexifications can intuition be weighted? suprasegmental relationships and segments in their setting less-studied languages, including sign languages Litmus test: Sino-Tibetan languages
Germanic Word Formation - I Research Questions What are the interrelations between form, meaning, and frequency? Are they system-dependent, culture-specific, or universal? How can computer-assisted methods help answering these questions?
Germanic Word Formation - I Research Questions What are the interrelations between form, meaning, and frequency? Are they system-dependent, culture-specific, or universal? How can computer-assisted methods help answering these questions? Perspectives concept-based (onomasiological) vs. form-based (semasiological) cross-linguistic vs. language-specific quantitative vs. qualitative → computer-assisted
Germanic Word Formation - II Morphological Complexity in Basic Vocabulary question: evolutionary dynamics of lexical change concept-based language(-family)-specific
Germanic Word Formation - II Morphological Complexity in Basic Vocabulary question: evolutionary dynamics of lexical change concept-based language(-family)-specific Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-specific
Germanic Word Formation - II Morphological Complexity in Basic Vocabulary question: evolutionary dynamics of lexical change concept-based language(-family)-specific Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-specific Productivity and Promiscuity in Compounding question: language-specific and universal aspects of compoundhood concept- and form-based cross-linguistic (worldwide) and language-specific
SEAGaL: South East Asia Gene and Language - I The degree of the correlation between language and genetic diversity? A group B group C group A language B language C language
SEAGaL: South East Asia Gene and Language - I The degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY?
SEAGaL: South East Asia Gene and Language - I The degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY? Geography? Population size? Bilingualism?
SEAGaL: South East Asia Gene and Language - II Sino-Tibetan language family. Genomic Bioinformatics tools Ethnic group Linguistic Lexibank, Concepticon, LingPy Refine, reassess and reconstruct
Language and geography: the case of Rgyalrongic - I Rgyalrongic languages (Sino-Tibetan) exhibit a series of orientational prefixes Traditional approaches Fully related to actual topography According to river, mountain, sun, etc. Inconsistent among scholars: wild guesses Not covering all uses Are orientational prefixes related to actual geography?
Language and geography: the case of Rgyalrongic - II Our current job is to test how closely orientational prefixes are related to real-world topography Selection of 15 familiar places (villages or towns) Collection of the prefixes used between every two places Draw a map based on the prefixes This map represents the collective memory of the speakers Compare the inferred map with the actual map We may get to understand The original meaning of the orientational prefixes Evolutionary pathways of the orientational prefixes How Rgyalrongic (even Sino-Tibetan) ancestors understood and interpreted geography