Representing concepts for the purpose of cross-linguistic language comparison

Representing concepts for the purpose of cross-linguistic language comparison

Keynote held at CARLA 2020, the second international workshop on "Concepts in Action: Representation, Learning, and Application" (2020-09-23, virtual conference, Bolzano, University of Osnabrück).

E01961dd2fbd219a30044ffe27c9fb70?s=128

Johann-Mattis List

September 23, 2020
Tweet

Transcript

  1. Representing concepts for the purpose of cross-linguistic language comparison Johann-Mattis

    List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max Planck Institute for the Science of Human History Jena, Germany 2020/09/23 very long title P(A|B)=P(B|A)... 1 / 32
  2. Background 2 / 32 Comparative Linguistics

  3. Background 2 / 32 "All languages change, as long as

    they exist." (August Schleicher 1863) walkman Indo-European Germanic Old English English p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod Comparative Linguistics
  4. Background 2 / 32 iPod Indo-European Germanic Old English English

    p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English walkman "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics
  5. Background 2 / 32 walkman Indo-European Germanic Old English English

    p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics
  6. Background 2 / 32 walkman Indo-European Germanic Old English English

    p f f f ə a æ ɑː t d d ð eː eː e ə r r r r Germanic German English iPod "All languages change, as long as they exist." (August Schleicher 1863) Comparative Linguistics
  7. Background Background Background on Language Comparison 3 / 32 Icelandic

    Old Indian Old Greek Latin Sanskrit Jacob Grimm Rasmus Rask Undersøgelse om det gamle Nordiske Sprogs Oprindelse 1818 Deutsche Grammatik (Ausgabe II) 1822
  8. Background Background Background on Language Comparison 3 / 32 Icelandic

    Old Indian Old Greek Latin Sanskrit Indo-European Method for Language Comparison • intensive language comparison • identify regularly recurring similaritities → prove language relationship → reconstruct development of language families Jacob Grimm Rasmus Rask Undersøgelse om det gamle Nordiske Sprogs Oprindelse 1818 Deutsche Grammatik (Ausgabe II) 1822
  9. Background Background Background on Language Comparison 3 / 32 Icelandic

    Old Indian Old Greek Latin Sanskrit Indo-European Method for Language Comparison • intensive language comparison • identify regularly recurring similaritities → prove language relationship → reconstruct development of language families Jacob Grimm Rasmus Rask Undersøgelse om det gamle Nordiske Sprogs Oprindelse 1818 Deutsche Grammatik (Ausgabe II) 1822
  10. Background Comparative Method The Comparative Method 4 / 32

  11. Background Comparative Method The Comparative Method 4 / 32

  12. Background Comparative Method The Comparative Method 4 / 32

  13. Background Computational Linguistics Computational Historical Linguistics 5 / 32 problems

    of computational approaches → lack of flexibility → lack of accuracy → often rely on manually annotated data → produce results in a black-box fashion Breton d - ã n t - Danish d̥ʰ - a n - - Dutch t - ɑ n t - English t - uː - θ - French d - ã - - - German t͜s - aː n - - Greek ð - o̞ n d i Italian d - ɛ n t e Portuguese d - ẽ - t ɨ Spanish d j e n t e /-French | | /-Greek_Mod | | ----| /---| /-Portuguese | | | | | | \---| /-Italian | | | /---| | | | | \-Spanish \---| \---| | | /-Breton | \---| | \-Dutch | | /-English \---| | /-Danish \---| \-German phonetic alignment (List 2012, 2014) phylogenetic reconstruction
  14. Background Computational Linguistics Computational Historical Linguistics 5 / 32 problems

    of computational approaches → lack of flexibility → lack of accuracy → often rely on manually annotated data → produce results in a black-box fashion Breton d - ã n t - Danish d̥ʰ - a n - - Dutch t - ɑ n t - English t - uː - θ - French d - ã - - - German t͜s - aː n - - Greek ð - o̞ n d i Italian d - ɛ n t e Portuguese d - ẽ - t ɨ Spanish d j e n t e /-French | | /-Greek_Mod | | ----| /---| /-Portuguese | | | | | | \---| /-Italian | | | /---| | | | | \-Spanish \---| \---| | | /-Breton | \---| | \-Dutch | | /-English \---| | /-Danish \---| \-German phonetic alignment (List 2012, 2014) phylogenetic reconstruction
  15. Background Computational Linguistics Computational Historical Linguistics 5 / 32 problems

    of computational approaches → lack of flexibility → lack of accuracy → often rely on manually annotated data → produce results in a black-box fashion Breton d - ã n t - Danish d̥ʰ - a n - - Dutch t - ɑ n t - English t - uː - θ - French d - ã - - - German t͜s - aː n - - Greek ð - o̞ n d i Italian d - ɛ n t e Portuguese d - ẽ - t ɨ Spanish d j e n t e /-French | | /-Greek_Mod | | ----| /---| /-Portuguese | | | | | | \---| /-Italian | | | /---| | | | | \-Spanish \---| \---| | | /-Breton | \---| | \-Dutch | | /-English \---| | /-Danish \---| \-German phonetic alignment (List 2012, 2014) phylogenetic reconstruction
  16. 6 / 32 The CALC Project

  17. 6 / 32 Language families like Sino-Tibetan present "almost unsurmountable

    obstacles". (Antoine Meillet 1925) insights → language change → human prehistory → triggers of diversity of life and culture → classical methods reach their limit → computational methods cannot replace experts' experience and intuition obstacles increasing amounts of data historical language comparison large and diverse language families challenges The CALC Project
  18. 6 / 32 Language families like Sino-Tibetan present "almost unsurmountable

    obstacles". (Antoine Meillet 1925) insights → language change → human prehistory → triggers of diversity of life and culture → classical methods reach their limit → computational methods cannot replace experts' experience and intuition obstacles increasing amounts of data historical language comparison large and diverse language families challenges The CALC Project
  19. 6 / 32 Language families like Sino-Tibetan present "almost unsurmountable

    obstacles". (Antoine Meillet 1925) insights → language change → human prehistory → triggers of diversity of life and culture → classical methods reach their limit → computational methods cannot replace experts' experience and intuition obstacles increasing amounts of data historical language comparison large and diverse language families challenges The CALC Project
  20. The CALC Project Starting Point Classical and Computer-Based Language Comparison

    7 / 32 LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
  21. The CALC Project Starting Point Classical and Computer-Based Language Comparison

    7 / 32 LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
  22. The CALC Project CALC Computer-Assisted Language Comparison 8 / 32

    LC CA lacks efficiency consistency efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
  23. The CALC Project CALC Computer-Assisted Language Comparison 8 / 32

  24. The CALC Project CALC Computer-Assisted Language Comparison 9 / 32

    very long title P(A|B)=P(B|A)... Funding: ERC Starting Grant (2017-2022) Host Institution: MPI-SHH (Jena) Team: 2 Post-Docs, 4 Docs (2 financed by project, 2 financed externally), PI Goal: establish a framework for CALC and show how to apply it to the Sino-Tibetan language family. https://digling.org/calc/
  25. 10 / 32 Basics

  26. 10 / 32 Basics

  27. 10 / 32 Basics

  28. 10 / 32 Basics

  29. 10 / 32 Basics

  30. 10 / 32 Basics

  31. Basics of CALC Software LingPy 11 / 32 SOFTWARE >>>

    from lingpy import * >>> wl = Wordlist('tst') >>> wl.coverage() >>> wl.align() Python Library ✓ over 85 publications based on the software ✓ multiple phonetic alignments (List 2014 ) ✓ automatic cognate detection (List et al. 2017) ✓ correspondence pattern identification (List 2019) State of the Art High Accuracy *h₂ - multiple phonetic alignments (List 2014): - automatic cognate detection (List et al. 2017): - phylogenetic reconstruction (Rama et al. 2018): - correspondence pattern identification (List 2019): 98% (pair scores) 89% (B-Cubed scores) 0.08 (Gen. Quart. Dist.) NP-hard (no human attempts) Ling Py.org
  32. Basics of CALC Interfaces EDICTOR 12 / 32 INTERFACES ID

    DOCULECT CONCEPT SEGMENTS N U O ? wOld yuE_5_1liaN_1 moon moon moon moon Běijīng Guǎngzhōu Měixiàn Fúzhōu 1 2 3 4 Conversion and Segmentation Highlighting of Unrecognized Phonetic Symbols yuE_5_1liaN_1 yɛ⁵¹liɑŋ¹ y ɛ ⁵¹ l i ɑ ŋ ¹ annotate data analyze data edit alignments Etymological DICTionary ediTor http://edictor.digling.org List (2017) E D T
  33. Basics of CALC Data Data 13 / 32 DATA CLDF

    >>> from pycldf import * >>> ds = Dataset('path') >>> ds.validate() >>> ds.statistics() Validation Software ID CONCEPT IPA COGNACY 1 hand hant 1 2 hand hænd 1 3 ruka ruka 2 4 rẽnka rẽnka 2 ... ... ... ... Spreadsheet Formats Online publication with CLLD pypi.org/project/pycldf/ Glottolog arbitrarité Concepticon CLTS Languages Concepts Speech sounds CLTS siː əl tiː əs cldf.clld.org w3.org/2013/csvw/ Reference catalogs Cross-Linguistic Data Formats Initiative (Forkel, List et al. 2018)
  34. Basics of CALC Data Data 13 / 32 DATA

  35. Basics of CALC Data Data 13 / 32 DATA

  36. 14 / 32 Examples

  37. 14 / 32 Semantic Colexification Networks Comparison Data Linking Concepts

    Text Integrating Concepts Examples
  38. Examples Linking Concepts Linking Concepts: Starting Point In the past

    centuries, scholars have been producing a large amount of concept lists. A concept list is in its simples form a list of concepts (e.g., I, you, he/she, dog, cat) which scholars find interesting for some linguistic, anthropological, or cognitive study. Starting with the work by Morris Swadesh, who proposed basic vocabulary as a concept important for historical linguistics, the compilation of concept lists has increased even more. For a very long time, scholars would just ignore the abundance of different concept lists produced in different fields and never try to systematically compare them. 15 / 32
  39. Examples Linking Concepts Linking Concepts: Data and Analysis In 2016,

    we published the first version of the Concepticon project (List et al. 2016, https://concepticon.clld.org), the first attempt to link the numerous concept lists which have been compiled so far. We link concept lists by defining Concept Sets, that is, abstract concepts which are given a unique ID and a gloss (to ease elicitation) along with a definition and (potentially) additional metadata. All items of a given concept list are linked to the Concepticon Concept Sets where possible. By now, Concepticon has 3755 Concept Sets and links to 310 different concept lists. 16 / 32
  40. Examples Linking Concepts Linking Concepts: Data and Analysis We have

    regularly maintained and updated the Concepticon since 2016. By now, we have a team of about 8-10 regular contributors. All concept lists that are added to the project are rigorously checked in a code-based review procedure along with computational checks for internal consistency. New lists can be automatically linked to the Concepticon and later manually refined (this works in up to 10 different languages). Concepticon is the basic reference catalog for concepts and elicitation glosses as underlying the Cross-Linguistic Data Formats initiative (Forkel et al. 2018, https://cldf.clld.org). 17 / 32
  41. Examples Linking Concepts Linking Concepts: Results Concepticon is increasingly used

    by scholars who want to establish their own questionnaires or surveys for lexical data of the languages of the world. Concepticon is the core component that allowed for the relaunch of the CLICS database (see Semantic Networks, next example). The data is growing at a steady paste and the procedures for error-checking and evaluation are constantly being refined. Our code-based data curation approach has shown to be very efficient for projects with a long-term goal. Individual issues of defining concepts in the way in which we do this in Concepticon have been disseminated in form of discussions in Blog posts (e.g., List 2018). 18 / 32
  42. Examples Linking Concepts Linking Concepts: Plans Version 2.4 is supposed

    to bring another larger extension of the Concepticon project by even more concept lists. We work on an integration of Concepticon with the NoRaRe database (last example in this talk). We pursue initial experiments that enhance our automated mapping algorithm (also considering the use of machine learning technologies), which is needed to provide access to Concepticon data for those projects that work with a lot of data (e.g., NLP projects). 19 / 32
  43. Examples Semantic Networks Semantic Networks: Starting Point 20 / 32

    forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2 1 1 1 Colexification Collective term for polysemy and homophony
  44. Examples Semantic Networks Semantic Networks: Starting Point 20 / 32

    forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1
  45. Examples Semantic Networks Semantic Networks: Starting Point 20 / 32

    forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2
  46. Examples Semantic Networks Semantic Networks: Starting Point 20 / 32

    forest tree wood stem branch root French fɔʀɛ bwɑ aʀbrə bwɑ tʀɔ bʀɑʃ ʀasin Russian lʲes dʲerɪva dʲerɪva stvɔl vʲetvʲ kɔrɪnʲ Croatian ʃuma staːblɔ dr ɔ staːblɔ graːna kɔriɛn Yukaghir aːnmonilʲe saːl saːl tʃilge tʃilge waruluː Yaqui dʒuja dʒuja kuta naːwa budʒa naːwa , v 1 1 2 1 1 1
  47. Examples Semantic Networks Semantic Networks: Data and Analysis 21 /

    32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexifications CLICS https://clics.clld.org
  48. Examples Semantic Networks Semantic Networks: Data and Analysis 21 /

    32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexifications CLICS https://clics.clld.org Interactive web application for browsing the data
  49. Examples Semantic Networks Semantic Networks: Data and Analysis 21 /

    32 INTERFACES SOFTWARE DATA Database of Cross-Linguistic Colexifications CLICS https://clics.clld.org Interactive web application for browsing the data Test-based data lifting and curation CLDF
  50. Examples Semantic Networks Semantic Networks: Results 22 / 32 List,

    J.-M., A. Terhalle, and M. Urban (2013): Using network approaches to enhance the analysis of cross-linguistic polysemies. In: Proceedings of the 10th International Conference on Computational Semantics -- Short Papers. Association for Computational Linguistics 347-353.
  51. Examples Semantic Networks Semantic Networks: Results 22 / 32 Mayer,

    T., J.-M. List, A. Terhalle, and M. Urban (2014): An interactive visualization of cross-linguistic colexification patterns. In: Visualization as added value in the development, use and evaluation of Linguistic Resources. Workshop organized as part of the International Conference on Language Resources and Evaluation. 1-8.
  52. Examples Semantic Networks Semantic Networks: Results 22 / 32 List,

    J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (2018): CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2. 277-306.
  53. Examples Semantic Networks Semantic Networks: Results 22 / 32 Rzymski,

    C., T. Tresoldi, S. Greenhill, M. Wu, N. Schweikhard, M. Koptjevskaja-Tamm, V. Gast, T. Bodt, A. Hantgan, G. Kaiping, S. Chang, Y. Lai, N. Morozova, H. Arjava, N. Hübler, E. Koile, S. Pepper, M. Proos, B. Epps, I. Blanco, C. Hundt, S. Monakhov, K. Pianykh, S. Ramesh, R. Gray, R. Forkel, and J.-M. List (2020): The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies. Scientific Data 7.13. 1-12.
  54. Examples Semantic Networks Semantic Networks: Results 22 / 32 CLICS¹

    (2014) CLICS² (2018) CLICS³ (2020)
  55. Examples Semantic Networks Semantic Networks: Results 22 / 32 Jackson,

    J., J. Watts, T. Henry, J.-M. List, P. Mucha, R. Forkel, S. Greenhill, and K. Lindquist (2019): Emotion semantics show both cultural variation and universal structure. Science 366.6472. 1517-1522.
  56. Examples Semantic Networks Semantic Networks: Results 22 / 32

  57. Examples Semantic Networks Semantic Networks: Results 22 / 32

  58. Examples Semantic Networks Semantic Networks: Results 22 / 32

  59. Examples Semantic Networks Semantic Networks: Results 22 / 32

  60. Examples Semantic Networks Semantic Networks: Plans Expanding colexification analyses to

    include partial colexifications and directed networks. Creating partial colexification data for testing and training. Conducting targeted colexification studies. 23 / 32
  61. Examples Integrating Concepts Integrating Concepts: Starting Point 24 / 32

    Tjuka et al. (under review): 10.31234/osf.io/tgw3z
  62. Examples Integrating Concepts Integrating Concepts: Starting Point There is a

    wealth of data about concepts produced by historical linguists, corpus linguistics, computational linguistics, and psycholinguists. These data are rarely properly integrated. But if they were integrated with resources like the Concepticon, this would be fantastic, since it would offer us a large amount of new possibilities for our research. 24 / 32
  63. Examples Integrating Concepts Integrating Concepts: Data and Analysis 25 /

    32 Tjuka et al. (under review): 10.31234/osf.io/tgw3z
  64. Examples Integrating Concepts Integrating Concepts: Data and Analysis 25 /

    32 Tjuka et al. (under review): 10.31234/osf.io/tgw3z
  65. Examples Integrating Concepts Integrating Concepts: Data and Analysis We apply

    our workflow for test-driven data curation to publicly available datasets which provide norms, ratings, or relations for concepts and words. We distinguish manually, semi-automatically, and automatically mapped resources (based on structure and size). We normalize the original data by tagging the columns and making them comparable across the different source datasets. 25 / 32
  66. Examples Integrating Concepts Integrating Concepts: Results First version submitted and

    released (Tjuka, Forkel, and List, under review, https://digling.org/norare/). 71 datasets from which 415 word and concept properties could be derived. Data curation workflow could be successfully evaluated (building also on our experience with Concepticon). Data applicability is largely enhanced thanks to the pynorare software API that allows for a quick comparison, but the data can also be easily analyzed with the help of R. 26 / 32
  67. Examples Integrating Concepts Integrating Concepts: Plans Annika Tjuka (first author

    of the NoRaRe database) started to carry out different tests of the norms, ratings, and relations in NoRaRe and will pursue doing this. Expanding the database by adding specifically corpus data (e.g., for parallel bible corpus studies) and data from NLP studies (word embeddings). Enhancing the concept mapping algorithms (experiments with Christoph Rzymski). Integrating NoRaRe with the Concepticon web presentation (with Robert Forkel). 27 / 32
  68. 28 / 32 Outlook *deh3 - ?

  69. Outlook Ongoing Projects Ongoing Projects Expanding CLICS as part of

    our lexibank initiative to lift and retro-standardize lexical data for the purpose of cross-linguistic comparison (see, among others, Forkel and List 2020: CLDFBench). Discussing further integration with psychological approaches by pushing language analysis (Jackson et al. under review). Enhanced approaches to the annotation of colexifications in lexical datasets (with Roberto Zariquiey, based on work presented in Schweikhard and List 2020). 29 / 32
  70. Outlook Planned Projects Planned Projects An extended study on the

    semantics of body parts from the perspective of linguistic diversity (work with Annika Tjuka and Damián Blasi). Semantics underlying terms for body and mind (work led by MacCormack and Jackson in collaboration with Watts, and Henry). Creating enhanced, manually annotated datasets for the study of partial colexifications (work with Nathanael Schweikhard). 30 / 32
  71. Outlook Possibilities Possibilities Based on work by Urban (2011), we

    can design an approach to detect partial colexifications in a cross-linguistic collection of lexical datasets. Unlike Urban’s claim, these networks reflect both metonymic and metaphorical relations among concepts across multiple languages. Pilot studies show promising results with respect to network structures. 31 / 32
  72. Outlook Possibilities Possibilities 31 / 32 List (in preparation)

  73. Outlook Possibilities Possibilities 31 / 32 List (in preparation)

  74. 32 / 32 Thanks to all who do research with

    our group and shared ideas, code, and data with us in the past: Cormac Anderson, Timotheus Bodt, Doug Cooper, Simon J. Greenhill, Russell D. Gray, Robert Forkel, Yunfan Lai, Nathan W. Hill, Jessica K. Ivani, Yunfan Lai, Christoph Rzymski, Nathanael E. Schweikhard, Tiago Tresoldi, and Mei-Shin Wu. Many thanks to the European Research Council for supporting the project "Computer-Assisted Language Comparison" as part of the H2020 Funding Schema in the form of an ERC Starting Grant (2017-2022). Thank You for Listening LC CA COMPUTA- TIONAL HISTORICAL LINGUISTICS COMPA- RATIVE METHOD