Representing concepts for the purpose of cross-linguistic language comparison

Representing concepts for the purpose of cross-linguistic language comparison Johann-Mattis
List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max Planck Institute for the Science of Human History Jena, Germany 2020/09/23 very long title P(A|B)=P(B|A)... 1 / 32

Background 2 / 32 Comparative Linguistics

Background 2 / 32 "All languages change, as long as
they exist." (August Schleicher 1863) walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod Comparative Linguistics

Background 2 / 32 iPod Indo-European Germanic Old English English
p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English walkman "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics

Background 2 / 32 walkman Indo-European Germanic Old English English
p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics

Background Background Background on Language Comparison 3 / 32 Icelandic
Old Indian Old Greek Latin Sanskrit Jacob Grimm Rasmus Rask Undersøgelse om det gamle Nordiske Sprogs Oprindelse 1818 Deutsche Grammatik (Ausgabe II) 1822

Background Background Background on Language Comparison 3 / 32 Icelandic
Old Indian Old Greek Latin Sanskrit Indo-European Method for Language Comparison • intensive language comparison • identify regularly recurring similaritities → prove language relationship → reconstruct development of language families Jacob Grimm Rasmus Rask Undersøgelse om det gamle Nordiske Sprogs Oprindelse 1818 Deutsche Grammatik (Ausgabe II) 1822

Background Comparative Method The Comparative Method 4 / 32

Background Computational Linguistics Computational Historical Linguistics 5 / 32 problems
of computational approaches → lack of ﬂexibility → lack of accuracy → often rely on manually annotated data → produce results in a black-box fashion Breton d - ã n t - Danish d̥ʰ - a n - - Dutch t - ɑ n t - English t - uː - θ - French d - ã - - - German t͜s - aː n - - Greek ð - o̞ n d i Italian d - ɛ n t e Portuguese d - ẽ - t ɨ Spanish d j e n t e /-French | | /-Greek_Mod | | ----| /---| /-Portuguese | | | | | | \---| /-Italian | | | /---| | | | | \-Spanish \---| \---| | | /-Breton | \---| | \-Dutch | | /-English \---| | /-Danish \---| \-German phonetic alignment (List 2012, 2014) phylogenetic reconstruction

6 / 32 The CALC Project

6 / 32 Language families like Sino-Tibetan present "almost unsurmountable
obstacles". (Antoine Meillet 1925) insights → language change → human prehistory → triggers of diversity of life and culture → classical methods reach their limit → computational methods cannot replace experts' experience and intuition obstacles increasing amounts of data historical language comparison large and diverse language families challenges The CALC Project

The CALC Project Starting Point Classical and Computer-Based Language Comparison
7 / 32 LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility

The CALC Project CALC Computer-Assisted Language Comparison 8 / 32
LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility

very long title P(A|B)=P(B|A)... Funding: ERC Starting Grant (2017-2022) Host Institution: MPI-SHH (Jena) Team: 2 Post-Docs, 4 Docs (2 financed by project, 2 financed externally), PI Goal: establish a framework for CALC and show how to apply it to the Sino-Tibetan language family. https://digling.org/calc/

10 / 32 Basics

Basics of CALC Software LingPy 11 / 32 SOFTWARE >>>
from lingpy import * >>> wl = Wordlist('tst') >>> wl.coverage() >>> wl.align() Python Library ✓ over 85 publications based on the software ✓ multiple phonetic alignments (List 2014 ) ✓ automatic cognate detection (List et al. 2017) ✓ correspondence pattern identification (List 2019) State of the Art High Accuracy *h₂ - multiple phonetic alignments (List 2014): - automatic cognate detection (List et al. 2017): - phylogenetic reconstruction (Rama et al. 2018): - correspondence pattern identification (List 2019): 98% (pair scores) 89% (B-Cubed scores) 0.08 (Gen. Quart. Dist.) NP-hard (no human attempts) Ling Py.org

Basics of CALC Interfaces EDICTOR 12 / 32 INTERFACES ID
DOCULECT CONCEPT SEGMENTS N U O ? wOld yuE_5_1liaN_1 moon moon moon moon Běijīng Guǎngzhōu Měixiàn Fúzhōu 1 2 3 4 Conversion and Segmentation Highlighting of Unrecognized Phonetic Symbols yuE_5_1liaN_1 yɛ⁵¹liɑŋ¹ y ɛ ⁵¹ l i ɑ ŋ ¹ annotate data analyze data edit alignments Etymological DICTionary ediTor http://edictor.digling.org List (2017) E D T

Basics of CALC Data Data 13 / 32 DATA CLDF
>>> from pycldf import * >>> ds = Dataset('path') >>> ds.validate() >>> ds.statistics() Validation Software ID CONCEPT IPA COGNACY 1 hand hant 1 2 hand hænd 1 3 ruka ruka 2 4 rẽnka rẽnka 2 ... ... ... ... Spreadsheet Formats Online publication with CLLD pypi.org/project/pycldf/ Glottolog arbitrarité Concepticon CLTS Languages Concepts Speech sounds CLTS siː əl tiː əs cldf.clld.org w3.org/2013/csvw/ Reference catalogs Cross-Linguistic Data Formats Initiative (Forkel, List et al. 2018)

Basics of CALC Data Data 13 / 32 DATA

14 / 32 Examples

14 / 32 Semantic Colexiﬁcation Networks Comparison Data Linking Concepts
Text Integrating Concepts Examples

Examples Linking Concepts Linking Concepts: Starting Point In the past
centuries, scholars have been producing a large amount of concept lists. A concept list is in its simples form a list of concepts (e.g., I, you, he/she, dog, cat) which scholars find interesting for some linguistic, anthropological, or cognitive study. Starting with the work by Morris Swadesh, who proposed basic vocabulary as a concept important for historical linguistics, the compilation of concept lists has increased even more. For a very long time, scholars would just ignore the abundance of different concept lists produced in different fields and never try to systematically compare them. 15 / 32

Examples Linking Concepts Linking Concepts: Data and Analysis In 2016,
we published the first version of the Concepticon project (List et al. 2016, https://concepticon.clld.org), the first attempt to link the numerous concept lists which have been compiled so far. We link concept lists by defining Concept Sets, that is, abstract concepts which are given a unique ID and a gloss (to ease elicitation) along with a definition and (potentially) additional metadata. All items of a given concept list are linked to the Concepticon Concept Sets where possible. By now, Concepticon has 3755 Concept Sets and links to 310 different concept lists. 16 / 32

Examples Linking Concepts Linking Concepts: Data and Analysis We have
regularly maintained and updated the Concepticon since 2016. By now, we have a team of about 8-10 regular contributors. All concept lists that are added to the project are rigorously checked in a code-based review procedure along with computational checks for internal consistency. New lists can be automatically linked to the Concepticon and later manually refined (this works in up to 10 different languages). Concepticon is the basic reference catalog for concepts and elicitation glosses as underlying the Cross-Linguistic Data Formats initiative (Forkel et al. 2018, https://cldf.clld.org). 17 / 32

Examples Linking Concepts Linking Concepts: Results Concepticon is increasingly used
by scholars who want to establish their own questionnaires or surveys for lexical data of the languages of the world. Concepticon is the core component that allowed for the relaunch of the CLICS database (see Semantic Networks, next example). The data is growing at a steady paste and the procedures for error-checking and evaluation are constantly being refined. Our code-based data curation approach has shown to be very eﬀicient for projects with a long-term goal. Individual issues of defining concepts in the way in which we do this in Concepticon have been disseminated in form of discussions in Blog posts (e.g., List 2018). 18 / 32

Examples Linking Concepts Linking Concepts: Plans Version 2.4 is supposed
to bring another larger extension of the Concepticon project by even more concept lists. We work on an integration of Concepticon with the NoRaRe database (last example in this talk). We pursue initial experiments that enhance our automated mapping algorithm (also considering the use of machine learning technologies), which is needed to provide access to Concepticon data for those projects that work with a lot of data (e.g., NLP projects). 19 / 32

Examples Semantic Networks Semantic Networks: Starting Point 20 / 32
forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2 1 1 1 Colexiﬁcation Collective term for polysemy and homophony

forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1

forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2

forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2 1 1 1

Examples Semantic Networks Semantic Networks: Data and Analysis 21 /
32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexiﬁcations CLICS https://clics.clld.org

32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexiﬁcations CLICS https://clics.clld.org Interactive web application for browsing the data

32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexiﬁcations CLICS https://clics.clld.org Interactive web application for browsing the data Test-based data lifting and curation CLDF

Examples Semantic Networks Semantic Networks: Results 22 / 32 List,
J.-M., A. Terhalle, and M. Urban (2013): Using network approaches to enhance the analysis of cross-linguistic polysemies. In: Proceedings of the 10th International Conference on Computational Semantics -- Short Papers. Association for Computational Linguistics 347-353.

Examples Semantic Networks Semantic Networks: Results 22 / 32 Mayer,
T., J.-M. List, A. Terhalle, and M. Urban (2014): An interactive visualization of cross-linguistic colexiﬁcation patterns. In: Visualization as added value in the development, use and evaluation of Linguistic Resources. Workshop organized as part of the International Conference on Language Resources and Evaluation. 1-8.

Examples Semantic Networks Semantic Networks: Results 22 / 32 List,
J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (2018): CLICS². An improved database of cross-linguistic colexiﬁcations assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2. 277-306.

Examples Semantic Networks Semantic Networks: Results 22 / 32 Rzymski,
C., T. Tresoldi, S. Greenhill, M. Wu, N. Schweikhard, M. Koptjevskaja-Tamm, V. Gast, T. Bodt, A. Hantgan, G. Kaiping, S. Chang, Y. Lai, N. Morozova, H. Arjava, N. Hübler, E. Koile, S. Pepper, M. Proos, B. Epps, I. Blanco, C. Hundt, S. Monakhov, K. Pianykh, S. Ramesh, R. Gray, R. Forkel, and J.-M. List (2020): The Database of Cross-Linguistic Colexiﬁcations, reproducible analysis of cross- linguistic polysemies. Scientiﬁc Data 7.13. 1-12.

Examples Semantic Networks Semantic Networks: Results 22 / 32 CLICS¹
(2014) CLICS² (2018) CLICS³ (2020)

Examples Semantic Networks Semantic Networks: Results 22 / 32 Jackson,
J., J. Watts, T. Henry, J.-M. List, P. Mucha, R. Forkel, S. Greenhill, and K. Lindquist (2019): Emotion semantics show both cultural variation and universal structure. Science 366.6472. 1517-1522.

Examples Semantic Networks Semantic Networks: Results 22 / 32

Examples Semantic Networks Semantic Networks: Plans Expanding colexification analyses to
include partial colexifications and directed networks. Creating partial colexification data for testing and training. Conducting targeted colexification studies. 23 / 32

Examples Integrating Concepts Integrating Concepts: Starting Point 24 / 32
Tjuka et al. (under review): 10.31234/osf.io/tgw3z

Examples Integrating Concepts Integrating Concepts: Starting Point There is a
wealth of data about concepts produced by historical linguists, corpus linguistics, computational linguistics, and psycholinguists. These data are rarely properly integrated. But if they were integrated with resources like the Concepticon, this would be fantastic, since it would offer us a large amount of new possibilities for our research. 24 / 32

Examples Integrating Concepts Integrating Concepts: Data and Analysis 25 /
32 Tjuka et al. (under review): 10.31234/osf.io/tgw3z

Examples Integrating Concepts Integrating Concepts: Data and Analysis We apply
our workflow for test-driven data curation to publicly available datasets which provide norms, ratings, or relations for concepts and words. We distinguish manually, semi-automatically, and automatically mapped resources (based on structure and size). We normalize the original data by tagging the columns and making them comparable across the different source datasets. 25 / 32

Examples Integrating Concepts Integrating Concepts: Results First version submitted and
released (Tjuka, Forkel, and List, under review, https://digling.org/norare/). 71 datasets from which 415 word and concept properties could be derived. Data curation workflow could be successfully evaluated (building also on our experience with Concepticon). Data applicability is largely enhanced thanks to the pynorare software API that allows for a quick comparison, but the data can also be easily analyzed with the help of R. 26 / 32

Examples Integrating Concepts Integrating Concepts: Plans Annika Tjuka (first author
of the NoRaRe database) started to carry out different tests of the norms, ratings, and relations in NoRaRe and will pursue doing this. Expanding the database by adding specifically corpus data (e.g., for parallel bible corpus studies) and data from NLP studies (word embeddings). Enhancing the concept mapping algorithms (experiments with Christoph Rzymski). Integrating NoRaRe with the Concepticon web presentation (with Robert Forkel). 27 / 32

28 / 32 Outlook *deh3 - ?

Outlook Ongoing Projects Ongoing Projects Expanding CLICS as part of
our lexibank initiative to lift and retro-standardize lexical data for the purpose of cross-linguistic comparison (see, among others, Forkel and List 2020: CLDFBench). Discussing further integration with psychological approaches by pushing language analysis (Jackson et al. under review). Enhanced approaches to the annotation of colexifications in lexical datasets (with Roberto Zariquiey, based on work presented in Schweikhard and List 2020). 29 / 32

Outlook Planned Projects Planned Projects An extended study on the
semantics of body parts from the perspective of linguistic diversity (work with Annika Tjuka and Damián Blasi). Semantics underlying terms for body and mind (work led by MacCormack and Jackson in collaboration with Watts, and Henry). Creating enhanced, manually annotated datasets for the study of partial colexifications (work with Nathanael Schweikhard). 30 / 32

Outlook Possibilities Possibilities Based on work by Urban (2011), we
can design an approach to detect partial colexifications in a cross-linguistic collection of lexical datasets. Unlike Urban’s claim, these networks reflect both metonymic and metaphorical relations among concepts across multiple languages. Pilot studies show promising results with respect to network structures. 31 / 32

Outlook Possibilities Possibilities 31 / 32 List (in preparation)

32 / 32 Thanks to all who do research with
our group and shared ideas, code, and data with us in the past: Cormac Anderson, Timotheus Bodt, Doug Cooper, Simon J. Greenhill, Russell D. Gray, Robert Forkel, Yunfan Lai, Nathan W. Hill, Jessica K. Ivani, Yunfan Lai, Christoph Rzymski, Nathanael E. Schweikhard, Tiago Tresoldi, and Mei-Shin Wu. Many thanks to the European Research Council for supporting the project "Computer-Assisted Language Comparison" as part of the H2020 Funding Schema in the form of an ERC Starting Grant (2017-2022). Thank You for Listening LC CA COMPUTA- TIONAL HISTORICAL LINGUISTICS COMPA- RATIVE METHOD

Representing concepts for the purpose of cross-...

Representing concepts for the purpose of cross-linguistic language comparison

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript