CLICS2 : An improved database and a computer-assisted framework for cross-linguistic colexifications

CLICS2 : An improved database and a computer-assisted framework for
cross-linguistic colexifications Tiago Tresoldi Computer-Assisted Language Comparison (CALC) Max Planck Institute for the Science of Human History (MPI-SHH / JENA) Santa Fe, 2018-08-27

2 Colexifications? • A colexification, building on Haspelmath (2003) and
François (2008), occurs when, in a same language, two different concepts are expressed by the same word. • Agnostic term for covering homophony and polysemy

3 Early Accounts: People and Ideas • Haspelmath (2003). The
geometry of grammatical meaning. • François (2008). Semantic maps and the typology of colexifications. • Cysouw (2010). Drawing networks from recurrent polysemies. • Steiner, Stadler, and Cysouw (2011). A pipeline for computational historical linguistics. • Urban (2011). Asymmetries in overt marking and directionality in semantic change.

4 Techniques and Ideas • Cysouw (2010) uses polysemy data
to draw networks • Mayer, List, Terhalle, and Urban (2014) develop an interactive way to visualize cross-linguistic colexification • List, Mayer, Terhalle, and Urban (2014) publish the database and the web-application under the name CLICS (Database of Cross-Linguistic Colexifications) • In contrast to earlier single-database attempts, multiple datasets are merged • The community detection procedure is improved by using Infomap (Rosvall and Bergstrom, 2008), an algorithm based on random walks in complex networks

5 CLICS1: Data • IDS (Key and Comrie, 2007) •
233 languages of which 178 were included in CLICS • WOLD (Haspelmath and Tadmor, 2009) • 41 languages of which 33 were included in CLICS • Logos Dictionary (Logos Group) • 60 languages, of which 4 were included in CLICS • Språkbanken project (University of Gothenburg) • 8 SEA languages, of which 6 were included in CLICS

6 CLICS1: Methods • Problem A: Data cannot be displayed
fully, complexity needs to be reduced • Show communities instead of showing all the data • Subgraph-view that cuts out the nearest neighbors of one concept to compensate for data loss in the community view • Problem B: Data is noisy and needs to be corrected • Filter by language families and weight the concept links by frequency of occurrence, following the suggestion of Dellert (2014) • This cuts most of the links from homophony, filtering for polysemy

7 CLICS1: Interface • Backend in PHP, frontend in Javascript
• Transparency and reproducibility • The underlying network with the inferred communities can be downloaded from the website • The entire code for analysis and visualization is available on GitHub • The complete set of wordlists is available from Zenodo

8 CLICS1: Demo • Check it at http://clics.lingpy.org

9 CLICS2: Problems of CLICS1 • Difficult to curate (error
correction, data extension) • Difficult to expand • Difficult do collaborate • Difficult to communicate • Not all users understand that data was aggregated and not collected by the authors, and that corrections are to be considered new, derived datasets • Difficult to catch up • Best practices of data curation were learned while developing CLICS, but those were difficult to integrate in the workflow

10 CLICS2: Ideas • Use the state-of-the-art of available wordlist
data • Separate data from display • CLICS2 does not host data, but uses it • Curate data following the recommendations developed for the Cross-Linguistic Data Formats (CLDF) initiative (Forkel et al., 2017) • Curate the code and the data with the help of a transparent API • Regularly release the data in release circles of about once a year • Practice of Glottolog and other CLLD projects

11 Excursus: CLDF • Aims at increasing the comparability of
cross-linguistic data and analyses • Supports methods for standardization via reference catalogs like Glottolog (Hammarström et al., 2018) and Concepticon (List et al., 2017) • Provides software APIs which help to test whether data conforms to standards • Offers working examples for best practices • Supported by different software frameworks (LingPy, BEASTling, EDICTOR)

12 Excursus: CLDF DEMO • A known dataset: https://github.com/lexibank/asjp

13 Excursus: Reference Catalogs • Linking to Glottolog is advantageous,
as it harvests various types of additional information regarding languages, all of which can be used effortlessly • The Concepticon project (List et al., 2016, List et al., 2018) is less known, but offers the same advantages when dealing with wordlists built by means of “elicitation glosses”

14 Excursus: Concepticon • Link concept labels (“elicitation glosses”) in
published concept lists (questionnaires) to concept sets • Link concept sets to meta-data • Define relations between concept sets • Never link one concept in a given list to more than one concept set (guarantees consistency) • Provide an API to check the consistency of the data and to query the data • Provide a web-interface to browse through the data

15 Excursus: Concepticon

16 Excursus: Concepticon DEMO • Check it at https://concepticon.clld.org/

17 Excursus: Data in CLDF • Since our datasets are
all available in CLDF format, we can easily aggregate them for our new version of CLICS2 • Given problems with concept overlap in the datasets, we offer code examples that can be used to compute mutual coverage statistics allowing users to select subsets of the data optimal for a given analysis

18 CLICS2: Coverage

19 CLICS2 Coverage

20 CLICS2: Software API • By using the Python API,
users are able to use their own data and run their own network analyses • Since all the data for CLICS2 is independently shared and curated, users can also use the data with different parameters • We offer examples on how the CLICS2 data can be computed with the help of the API • By shifting to the CLDF framework, scholars can also create their own CLICS-like websites, since the source code for creation of interactive networks is transparently shipped with the data

21 CLICS2: Features • Drastic increase in data • Drastic
increase in transparency • Drastic increase in replicability • Regular floating releases which feature new data • Strict and clear-cut collaboration guidelines • Rigid policy towards open data • Since we heavily profit from all our collegues who publish their data!

22 CLICS2: Enhanced Browsing • Thanks to the CLLD framework,
the data is now easier to browse and clearly linked to the original datasets • Thanks to a standalone app that can be created, users can browse CLICS2 data with the CLICS1 look-and-feel and even deploy their own version • We are currently experimenting with new visualizations • Showcase: visualization methods developed for the inspection of galaxies (contributed by Thomas Mayer) https://bit.ly/2PaZmxx or http://127.0.0.1:8081/#/galaxy/clics

23 Features: Examples

24 Features: Examples

25 Features: Spaceship

26 Schedule • CLICS2 data is currently being released, see
https://zenodo.org/communities/clics • CLICS2 is deployed online at http://clics.clld.org and published by List, Greenhill, Anderson, Mayer, Tresoldi, and Forkel (2018) • The spaceship visualization will be deployed online later this year

27 Outlook • With CLICS2 we provide a new framework
for the collection and curation of data for the purpose of studying cross- linguistic colexification patterns • Future updates are planned, and we assume that we will be able to increase data further by at least five more larger datasets • CLICS2 is not perfect, and it does not come with any warranty; however, we hope that the improvements in terms of data transparency will make it much easier for scholars to work with the new cross-linguistic colexification database than its predecessor • CLICS2 is our showcase product to have people jumping on board of the CLDF initiative

Thank you!

CLICS2 : An improved database and a computer-as...

CLICS2 : An improved database and a computer-assisted framework for cross-linguistic colexifications

Tiago Tresoldi

More Decks by Tiago Tresoldi

Other Decks in Research

Featured

Transcript