Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLICS2 : An improved database and a computer-assisted framework for cross-linguistic colexifications

CLICS2 : An improved database and a computer-assisted framework for cross-linguistic colexifications

Presentation held during the ""Lexical semantics, language change, and universal translators" workgroup at the Santa Fe Institute.

Tiago Tresoldi

August 27, 2018
Tweet

More Decks by Tiago Tresoldi

Other Decks in Research

Transcript

  1. CLICS2
    :
    An improved database and a
    computer-assisted framework for
    cross-linguistic colexifications
    Tiago Tresoldi
    Computer-Assisted Language Comparison (CALC)
    Max Planck Institute for the
    Science of Human History (MPI-SHH / JENA)
    Santa Fe, 2018-08-27

    View Slide

  2. 2
    Colexifications?

    A colexification, building on Haspelmath (2003)
    and François (2008), occurs when, in a same
    language, two different concepts are expressed by
    the same word.

    Agnostic term for covering homophony and
    polysemy

    View Slide

  3. 3
    Early Accounts: People and Ideas

    Haspelmath (2003). The geometry of grammatical
    meaning.

    François (2008). Semantic maps and the typology
    of colexifications.

    Cysouw (2010). Drawing networks from recurrent
    polysemies.

    Steiner, Stadler, and Cysouw (2011). A pipeline for
    computational historical linguistics.

    Urban (2011). Asymmetries in overt marking and
    directionality in semantic change.

    View Slide

  4. 4
    Techniques and Ideas

    Cysouw (2010) uses polysemy data to draw networks

    Mayer, List, Terhalle, and Urban (2014) develop an
    interactive way to visualize cross-linguistic colexification

    List, Mayer, Terhalle, and Urban (2014) publish the
    database and the web-application under the name CLICS
    (Database of Cross-Linguistic Colexifications)

    In contrast to earlier single-database attempts, multiple
    datasets are merged

    The community detection procedure is improved by using
    Infomap (Rosvall and Bergstrom, 2008), an algorithm
    based on random walks in complex networks

    View Slide

  5. 5
    CLICS1: Data

    IDS (Key and Comrie, 2007)

    233 languages of which 178 were included in CLICS

    WOLD (Haspelmath and Tadmor, 2009)

    41 languages of which 33 were included in CLICS

    Logos Dictionary (Logos Group)

    60 languages, of which 4 were included in CLICS

    Språkbanken project (University of Gothenburg)

    8 SEA languages, of which 6 were included in CLICS

    View Slide

  6. 6
    CLICS1: Methods

    Problem A: Data cannot be displayed fully, complexity
    needs to be reduced

    Show communities instead of showing all the data

    Subgraph-view that cuts out the nearest neighbors of
    one concept to compensate for data loss in the
    community view

    Problem B: Data is noisy and needs to be corrected

    Filter by language families and weight the concept links
    by frequency of occurrence, following the suggestion of
    Dellert (2014)

    This cuts most of the links from homophony, filtering for
    polysemy

    View Slide

  7. 7
    CLICS1: Interface

    Backend in PHP, frontend in Javascript

    Transparency and reproducibility

    The underlying network with the inferred communities
    can be downloaded from the website

    The entire code for analysis and visualization is available
    on GitHub

    The complete set of wordlists is available from Zenodo

    View Slide

  8. 8
    CLICS1: Demo

    Check it at http://clics.lingpy.org

    View Slide

  9. 9
    CLICS2: Problems of CLICS1

    Difficult to curate (error correction, data extension)

    Difficult to expand

    Difficult do collaborate

    Difficult to communicate

    Not all users understand that data was aggregated and
    not collected by the authors, and that corrections are to
    be considered new, derived datasets

    Difficult to catch up

    Best practices of data curation were learned while
    developing CLICS, but those were difficult to integrate in
    the workflow

    View Slide

  10. 10
    CLICS2: Ideas

    Use the state-of-the-art of available wordlist data

    Separate data from display

    CLICS2 does not host data, but uses it

    Curate data following the recommendations developed for
    the Cross-Linguistic Data Formats (CLDF) initiative (Forkel
    et al., 2017)

    Curate the code and the data with the help of a
    transparent API

    Regularly release the data in release circles of about once
    a year

    Practice of Glottolog and other CLLD projects

    View Slide

  11. 11
    Excursus: CLDF

    Aims at increasing the comparability of cross-linguistic
    data and analyses

    Supports methods for standardization via reference
    catalogs like Glottolog (Hammarström et al., 2018) and
    Concepticon (List et al., 2017)

    Provides software APIs which help to test whether data
    conforms to standards

    Offers working examples for best practices

    Supported by different software frameworks (LingPy,
    BEASTling, EDICTOR)

    View Slide

  12. 12
    Excursus: CLDF DEMO

    A known dataset: https://github.com/lexibank/asjp

    View Slide

  13. 13
    Excursus: Reference Catalogs

    Linking to Glottolog is advantageous, as it harvests various
    types of additional information regarding languages, all of
    which can be used effortlessly

    The Concepticon project (List et al., 2016, List et al., 2018)
    is less known, but offers the same advantages when
    dealing with wordlists built by means of “elicitation
    glosses”

    View Slide

  14. 14
    Excursus: Concepticon

    Link concept labels (“elicitation glosses”) in published
    concept lists (questionnaires) to concept sets

    Link concept sets to meta-data

    Define relations between concept sets

    Never link one concept in a given list to more than one
    concept set (guarantees consistency)

    Provide an API to check the consistency of the data and to
    query the data

    Provide a web-interface to browse through the data

    View Slide

  15. 15
    Excursus: Concepticon

    View Slide

  16. 16
    Excursus: Concepticon DEMO

    Check it at https://concepticon.clld.org/

    View Slide

  17. 17
    Excursus: Data in CLDF

    Since our datasets are all available in CLDF format, we can
    easily aggregate them for our new version of CLICS2

    Given problems with concept overlap in the datasets, we
    offer code examples that can be used to compute mutual
    coverage statistics allowing users to select subsets of the
    data optimal for a given analysis

    View Slide

  18. 18
    CLICS2: Coverage

    View Slide

  19. 19
    CLICS2 Coverage

    View Slide

  20. 20
    CLICS2: Software API

    By using the Python API, users are able to use their own
    data and run their own network analyses

    Since all the data for CLICS2 is independently shared and
    curated, users can also use the data with different
    parameters

    We offer examples on how the CLICS2 data can be
    computed with the help of the API

    By shifting to the CLDF framework, scholars can also create
    their own CLICS-like websites, since the source code for
    creation of interactive networks is transparently shipped
    with the data

    View Slide

  21. 21
    CLICS2: Features

    Drastic increase in data

    Drastic increase in transparency

    Drastic increase in replicability

    Regular floating releases which feature new data

    Strict and clear-cut collaboration guidelines

    Rigid policy towards open data

    Since we heavily profit from all our collegues who publish
    their data!

    View Slide

  22. 22
    CLICS2: Enhanced Browsing

    Thanks to the CLLD framework, the data is now easier to
    browse and clearly linked to the original datasets

    Thanks to a standalone app that can be created, users can
    browse CLICS2 data with the CLICS1 look-and-feel and even
    deploy their own version

    We are currently experimenting with new visualizations

    Showcase: visualization methods developed for the
    inspection of galaxies (contributed by Thomas Mayer)
    https://bit.ly/2PaZmxx or
    http://127.0.0.1:8081/#/galaxy/clics

    View Slide

  23. 23
    Features: Examples

    View Slide

  24. 24
    Features: Examples

    View Slide

  25. 25
    Features: Spaceship

    View Slide

  26. 26
    Schedule

    CLICS2 data is currently being released, see
    https://zenodo.org/communities/clics

    CLICS2 is deployed online at http://clics.clld.org and
    published by List, Greenhill, Anderson, Mayer, Tresoldi, and
    Forkel (2018)

    The spaceship visualization will be deployed online later
    this year

    View Slide

  27. 27
    Outlook

    With CLICS2 we provide a new framework for the collection
    and curation of data for the purpose of studying cross-
    linguistic colexification patterns

    Future updates are planned, and we assume that we will be
    able to increase data further by at least five more larger
    datasets

    CLICS2 is not perfect, and it does not come with any
    warranty; however, we hope that the improvements in
    terms of data transparency will make it much easier for
    scholars to work with the new cross-linguistic colexification
    database than its predecessor

    CLICS2 is our showcase product to have people jumping on
    board of the CLDF initiative

    View Slide

  28. Thank you!

    View Slide

  29. 29

    View Slide