Computer-assisted approaches in historical and typological language comparison

Computer-assisted approaches in historical and typological language comparison Johann-Mattis List

Introduction • While historical linguistics and linguistic typology have been
for a long time based on data collection and analysis, the classical approaches, pre-dating the era of computers, have been challenged recently, as reflected in the quantitative turn in diversity linguistics, which started around the second millennium. • The challenge has created a gap between the classical (or traditional) and the computational (quantitative) approaches, since classical linguists are often skeptical of the new approaches, while the defenders of the new approaches complain about a lack of acceptance among classical linguists.

Classical Approaches historical linguistics - new insights into the history
of the world’s largest language families - initial classification of the world’s languages into 300-400 language families - accumulation of comparative data for many of the world’s languages linguistic typology - insights into language universals and cross-linguistic tendencies - large surveys on grammatical categories in prose - accumulation of structural data for many of the world’s languages

Computer-Based Approaches computational historical linguistics - new insights into the
history of the world’s largest language families - initial classification of the world’s languages into 300-400 language families - accumulation of comparative data for many of the world’s languages computational typology - imputation of features in structural data - accumulation of structural data for many of the world’s languages

Computer-Assisted Approaches - Efficiency: Classical approaches cannot cope with the
large amount of data that has been assembled so far. - Accuracy: Computational approaches cannot compete with the fine-grained analyses that trained experts can deliver in historical and typological language comparison. - Consistency: Classical approaches are hard to learn, hard to apply, and scholars may drastically differ in their respective interpretation of the classical canon of methods. - Flexibility: Computational approaches have serious problems to deal with new problems that have not been addressed by means of computational methods before.

Computer-Assisted Approaches - Efficiency: Computational methods are very efficient and
can easily cope with large amounts of data. - Accuracy: Classical approaches are not error-free but they show a high accuracy, as also witnessed by the success story of the historical language comparison. - Consistency: Computational approaches are straightforward in their application, clear-cut with respect to the tasks they fulfill, and minimally biased with respect to the data they are applied to. - Flexibility: Classical approaches can take all different kinds of evidence into account, thus reflecting the old idea of consilience in the historical sciences, where one searches different kinds of evidence to receive a unique picture.

Computer-Assisted Approaches: CALC Project

Computer-Assisted Approaches: CALC Project - http://calc.digling.org - ERC Starting Grant
(2017-2022) - 2 post-docs, 2 doctoral students, 1 PI - Areal focus (through collaborations): - South-American languages (Tukano, Pano, Gran Chaco) - African languages (Dogon) - Sino-Tibetan languages (Sinitic, Burmish) - Sign languages (European origin) - Indo-European (Germanic) - Content focus (through expertise): - lexical data - annotation, visualization, interfaces, standardization - algorithms for language comparison

Computer-Assisted Approaches: CALC Project Current work in the CALC project:
Papers from 2018 until now - Reflex prediction (with T. A. Bodt, Western Kho-Bwa, Bodt and List 2019) - Borrowing detection (with A. Hantgan, Dogon, Bangime, Hantgan and List forthcoming) - Annotation of rhyme judgments (with N. W. Hill and C. Forster, List et al. 2019) - Fast cognate detection algorithms (with T. Rama, Rama and List 2019) - Sino-Tibetan subgrouping (with L. Sagart, G. Jacques, R. Ryder, V. Thouzeau, S. Greenhill, and Y. Lai, Sagart et al. 2019) - Inference of sound correspondence patterns (List 2019) - Cross-linguistic transcription systems (with C. Anderson, R. Forkel, T. Tresoldi, Anderson et al. 2019) - Cross-linguistic data formats (with R. Forkel, C. Rzymski, S. Greenhill, Forkel et al. 2019) - CLICS² (with S. Greenhill, T. Tresoldi, R. Forkel, List et al. 2018)

Computer-Assisted Approaches: CALC Project Ongoing work in the CALC project:
Databases and Standards - Cross-linguistic data formats (with R. Forkel, C. Rzymski, S. Greenhill, et al., https://cldf.clld.org, since 2018) - Cross-linguistic transcription systems (with C. Anderson, R. Forkel, T. Tresoldi, et al., https://clts.clld.org, since 2018) - Database of Cross-Linguistic Colexifications (with R. Forkel, T. Tresoldi, C. Rzymski, et al., https://clics.clld.org, since 2014) - Concepticon (with N. Schweikhard, R. Forkel, S. Greenhill, T. Tresoldi, et al., https://concepticon.clld.org, since 2016) - EvoBib: Bibliographic database and quote collection (http://calc.digling.org/evobib, since 2014)

Computer-Assisted Approaches: CALC Project Ongoing work in the CALC project:
Software and Interfaces - LingPy: Python library for quantitative tasks in historical linguistics (with R. Forkel, et al., http://lingpy.org, since 2014) - PoePy: Python library for handling annotated rhymes (https://github.com/lingpy/poepy/, since 2019) - SinoPy: Python library for quantitative tasks in Chinese historical linguistics (https://github.com/lingpy/sinopy/ since 2018) - LingRex: Linguistic reconstruction with LingPy (https://github.com/lingpy/lingrex/, since 2018) - EDICTOR: Web-based tool for the curation of etymological data (http://edictor.digling.org, since 2017)

Computer-Assisted Approaches: CALC Project Ongoing (unpublished) work in the CALC
project: - Subgrouping of sign languages (with J. Power and G. Grimm, Power et al. under review) - Large-scale inference of borrowings - Workflows for computer-assisted language comparison (with M.-S. Wu) - Subgrouping of Tukano (with T. Tresoldi and T. Chacon) - Subgrouping of Pano (with R. Zariquiey et al.) - Standards for inter-linear-glossed texts (with N. Sims) - Reconstruction of Proto-Burmish (with N. W. Hill) - Open problems in computational diversity linguistics (series of blogposts, addressing 10 problems to be discussed throughout 2019) - Morpheme-segmented database (with N. Schweikhard and potentially also A. Tjuka)

Topics in Computer-Assisted Language Comparison Modeling, Inference, and Analysis •
Modeling is the first step of all scientific endeavor, since we cannot understand no matter what phenomenon without models. • Inference helps us to use the models to accumulate more data. • Analysis is the step in which we use the inferences to test the original models.

Topics in Computer-Assisted Language Comparison Particular and General Language Comparison
(based on ideas presented by M. Haspelmath) • Particular language comparison does not claim to solve anything about all languages of the world, or the language faculty in an abstract sense, but addresses instead a particular language or language family, its history, or its typological characteristics. • General language comparison should aim at addressing cross-linguistic trends, patterns, but also methods that can be applied to all human languages, regardless of where they come from.

Topics in Computer-Assisted Language Comparison Open Problems We have up
to today many open problems in the field of computational and computer-assisted diversity linguistics. In order to solve them, it is useful to actually write them down once in a while, trying to formalize them. In this way, one can get a better understanding of the problem, also share it with colleagues, and determine, why the problem actually counts as difficult.

Topics in Computer-Assisted Language Comparison Data and Code Management and
Data and Code Sharing With great data comes great responsibility! Unfortunately, it is still the exception rather than the rule when scholars share data and code in a FAIR (Wilkinson et al. 2016) form: findable, accessible, interoperable, and reproducible. At times, scholars even do not share their data at all. It should be obvious that this does not reflect good scientific practice, specifically since most scholars profit from the generosity of their colleagues who share their data. Ideally, diversity linguists would sit together in the future and establish guidelines and recommendations for author, reviewers, and publishers, to help those who work with data and code, to share them correctly.

Topics in Computer-Assisted Language Comparison Interfaces, Standards, and Annotation Interfaces,
standards, and annotation play an increasingly important role for data curation. The most successful and largest projects which have been launched during the last decades (Reflex, STEDT, Tower of Babel, WOLD, WALS, GLD) all make use of specifically designed interfaces that help scholars to curate and analyze their data. For the future, we should think of ways to unify the exchange of data between these interfaces, and also try to work towards general and free interfaces that can be used platform-independently by all scholars, as well as community-wide standards of annotation, that guarantee that the data are still analyzable, even if one does not have access to certain interfaces.

Outlook • The machines won’t replace human linguists (if they
replace them, they will replace everybody, and nobody will know it, from inside the matrix). • Computational approaches, combined with the power of expertise and intuition can make a huge difference in our field. • The degree to which historical linguistics and linguistic typology is already making active use of computational and computer-assisted approaches shows the importance of these fields in comparison to general linguistics, but also may qualify computer-assisted approaches to historical and typological language comparison as role models for data-driven research in the humanities.

Thanks to our collaborators: Thank you for listening! CALC group:
T. Tresoldi, Y. Lai, N. E. Schweikhard, M.-S. Wu Associates (DLCE): R. Forkel, S. J. Greenhill, R. D. Gray, C. Rzymski, and many others Collaborators (external): A. Hantgan, J. Power, R. Zariquiey, N. W. Hill, L. Sagart, G. Jacques, T. Rama, G. Jäger, and many others

Computer-assisted approaches in historical and ...

Computer-assisted approaches in historical and typological language comparison

Johann-Mattis List

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript