$30 off During Our Annual Pro Sale. View Details »

Integrating data in cross-linguistic studies

Integrating data in cross-linguistic studies

Talk held at the meeting "Untangling the linguistic past of the Americas: Collaborative efforts and interdisciplinary approaches in an open science framework" (2020-09-23/25, Lima [virtual conference], PUCP).

Johann-Mattis List

September 23, 2020
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Integrating Data in
    Cross-Linguistic Studies
    Beyond Data Sharing
    Johann-Mattis List (DLCE)

    View Slide

  2. Data Sharing and Data
    Integration
    2

    View Slide

  3. 3
    Data Sharing is Gaining Importance
    - Many studies in historical linguistics and
    linguistic typology present results based on
    data which are not shared.
    - Data sharing becomes increasingly important
    and has been reinforced by some journals
    (Linguistic Typology, Linguistics, LDNC).
    - Reviewers ask increasingly for data and code
    already during review stage and make
    replicability checks a part of the review
    procedure.

    View Slide

  4. 4
    Data Sharing is Gaining Importance
    - Many studies in historical linguistics and
    linguistic typology present results based on
    data which are not shared.
    - Data sharing becomes increasingly important
    and has been reinforced by some journals
    (Linguistic Typology, Linguistics, LDNC).
    - Reviewers ask increasingly for data and code
    already during review stage and make
    replicability checks a part of the review
    procedure.

    View Slide

  5. 5
    Data Sharing is Gaining Importance
    - Many studies in historical linguistics and
    linguistic typology present results based on
    data which are not shared.
    - Data sharing becomes increasingly important
    and has been reinforced by some journals
    (Linguistic Typology, Linguistics, LDNC).
    - Reviewers ask increasingly for data and code
    already during review stage and make
    replicability checks a part of the review
    procedure.

    View Slide

  6. 6
    Data Sharing is Gaining Importance
    - Many studies in historical linguistics and
    linguistic typology present results based on
    data which are not shared.
    - Data sharing becomes increasingly important
    and has been reinforced by some journals
    (Linguistic Typology, Linguistics, LDNC).
    - Reviewers ask increasingly for data and code
    already during review stage and make
    replicability checks a part of the review
    procedure.

    View Slide

  7. 7
    But Data Sharing is Not the Whole Story!
    - Sharing data does not mean that the data can
    be easily processed and reused.
    - But data processing and data reuse
    (interoperability and reusability) are essential
    to
    - find errors in existing datasets,
    - enhance existing datasets, and
    - create new datasets from existing ones.
    - The discussion should not be about sharing
    data alone, but about integrating data.

    View Slide

  8. 8
    But Data Sharing is Not the Whole Story!
    - Sharing data does not mean that the data can
    be easily processed and reused.
    - But data processing and data reuse
    (interoperability and reusability) are essential
    to
    - find errors in existing datasets,
    - enhance existing datasets, and
    - create new datasets from existing ones.
    - The discussion should not be about sharing
    data alone, but about integrating data.

    View Slide

  9. 9
    But Data Sharing is Not the Whole Story!
    - Sharing data does not mean that the data can
    be easily processed and reused.
    - But data processing and data reuse
    (interoperability and reusability) are essential
    to
    - find errors in existing datasets,
    - enhance existing datasets, and
    - create new datasets from existing ones.
    - The discussion should not be about sharing
    data alone, but about integrating data.

    View Slide

  10. 10
    But Data Sharing is Not the Whole Story!
    - Sharing data does not mean that the data can
    be easily processed and reused.
    - But data processing and data reuse
    (interoperability and reusability) are essential
    to
    - find errors in existing datasets,
    - enhance existing datasets, and
    - create new datasets from existing ones.
    - The discussion should not be about sharing
    data alone, but about integrating data.

    View Slide

  11. 11
    How Data can be Successfully Integrated
    - To be integrated, data needs to be
    - internally consistent, and
    - externally comparable.
    - Internal consistency requires that data are
    not only human- but also machine-readable.
    - External comparability requires that data are
    conforming to general standards.
    - The Cross-Linguistic Data Formats (CLDF)
    initiative (https://cldf.clld.org) has presented
    recommendations for the representation of
    integrated cross-linguistic data collections.

    View Slide

  12. 12
    How Data can be Successfully Integrated
    - To be integrated, data needs to be
    - internally consistent, and
    - externally comparable.
    - Internal consistency requires that data are
    not only human- but also machine-readable.
    - External comparability requires that data are
    conforming to general standards.
    - The Cross-Linguistic Data Formats (CLDF)
    initiative (https://cldf.clld.org) has presented
    recommendations for the representation of
    integrated cross-linguistic data collections.

    View Slide

  13. 13
    How Data can be Successfully Integrated
    - To be integrated, data needs to be
    - internally consistent, and
    - externally comparable.
    - Internal consistency requires that data are
    not only human- but also machine-readable.
    - External comparability requires that data are
    conforming to general standards.
    - The Cross-Linguistic Data Formats (CLDF)
    initiative (https://cldf.clld.org) has presented
    recommendations for the representation of
    integrated cross-linguistic data collections.

    View Slide

  14. 14
    How Data can be Successfully Integrated
    - To be integrated, data needs to be
    - internally consistent, and
    - externally comparable.
    - Internal consistency requires that data are
    not only human- but also machine-readable.
    - External comparability requires that data are
    conforming to general standards.
    - The Cross-Linguistic Data Formats (CLDF)
    initiative (https://cldf.clld.org) has presented
    recommendations for the representation of
    integrated cross-linguistic data collections.

    View Slide

  15. 15
    How Data can be Successfully Integrated
    - To be integrated, data needs to be
    - internally consistent, and
    - externally comparable.
    - Internal consistency requires that data are
    not only human- but also machine-readable.
    - External comparability requires that data are
    conforming to general standards.
    - The Cross-Linguistic Data Formats (CLDF)
    initiative (https://cldf.clld.org) has presented
    recommendations for the representation of
    integrated cross-linguistic data collections.

    View Slide

  16. Cross-Linguistic Data
    Formats (CLDF) and
    CLDFBench
    16

    View Slide

  17. CLDF Standard Formats
    17

    View Slide

  18. 18
    CLDF Reference Catalogs
    - Glottolog (https://glottolog.org): reference
    catalog for language varieties
    - Concepticon (https://concepticon.clld.org):
    reference catalog for concepts, linking elicitation
    glosses used in linguistic literature to
    Concepticon Concept Sets
    - Cross-Linguistic Transcription Systems
    (https://clts.clld.org): identifies more than 8000
    possible speech sounds and links transcription
    systems and transcription dataset to them

    View Slide

  19. 19
    CLDF Reference Catalogs
    - Glottolog: offers geolocations, rudimentary
    classifications, and references
    - Concepticon: offering definitions, and now also
    extensive norms, ratins, and relations
    (https://digling.org/norare, frequency, similarity,
    hierarchies, etc.)
    - Cross-Linguistic Transcription Systems: offers
    feature sets for each speech sounds, and
    similarity metrics based on the features

    View Slide

  20. 20
    CLDF Reference Catalogs
    - Glottolog: offers geolocations, rudimentary
    classifications, and references
    - Concepticon: offering definitions, and now also
    extensive norms, ratins, and relations
    (https://digling.org/norare, frequency, similarity,
    hierarchies, etc.)
    - Cross-Linguistic Transcription Systems: offers
    feature sets for each speech sounds, and
    similarity metrics based on the features
    Linking your data to Glottolog, mapping your concepts
    to Concepticon, and converting your transcriptions to
    CLTS will drastically ENRICH them!

    View Slide

  21. Lifting Data with CLDFBench
    21

    View Slide

  22. 22
    Lifting Data with CLDFBench
    - Python package
    (https://github.com/cldf/cld
    fbench)
    - Consistent, code-based,
    test-driven conversion of
    data to CLDF
    - integration with
    Orthography Profiles for
    conversion to CLTS
    - straightforward con-
    version of CLDF data-
    sets into a CLLDF database

    View Slide

  23. CLDF Success Stories: CLICS
    23
    CLICS³ (Rzymski et al. to appear, 2474 languages)
    CLICS¹ (List et al. 2014, 300 languages) CLICS² (List et al. 2018, 1200 languages)
    Increasing coverage in the CLICS database over the past four years.

    View Slide

  24. CLDF Success Stories: Emotion Semantics
    24
    Tackling questions on cognition with cross-linguistic data:
    - lexical data lifted for more than 2000 language varieties
    - automated methods for the creation of colexification networks
    - colexification networks can help to answer various question on
    human cognition

    View Slide

  25. CLDF Success Stories: Emotion Semantics
    25
    Universal Austronesian Indo-European
    Major Results:
    - Individual language families differ in their emotion semantics.
    - Emotions are globally differentiated by hedonistic valence and
    physiological activation.

    View Slide

  26. 26
    Summary on CLDF
    - With CLDF and CLDFBench, integrating your
    lexical and structural data is largely facilitated.
    - Based on our experience with helping colleagues
    to convert their data to CLDF, we can say:
    converting your dataset to CLDF will help you to
    improve it!
    - If you convert your data accompanying
    publications to CLDF, you help not only yourself,
    but also others to build on your work.
    - Our CLDF team at the DLCE in of the MPI-SHH
    in Jena/Leipzig is always ready to help with
    questions.

    View Slide

  27. 27
    Summary on CLDF
    - With CLDF and CLDFBench, integrating your
    lexical and structural data is largely facilitated.
    - Based on our experience with helping colleagues
    to convert their data to CLDF, we can say:
    converting your dataset to CLDF will help you to
    improve it!
    - If you convert your data accompanying
    publications to CLDF, you help not only yourself,
    but also others to build on your work.
    - Our CLDF team at the DLCE in of the MPI-SHH
    in Jena/Leipzig is always ready to help with
    questions.

    View Slide

  28. 28
    Summary on CLDF
    - With CLDF and CLDFBench, integrating your
    lexical and structural data is largely facilitated.
    - Based on our experience with helping colleagues
    to convert their data to CLDF, we can say:
    converting your dataset to CLDF will help you to
    improve it!
    - If you convert your data accompanying
    publications to CLDF, you help not only yourself,
    but also others to build on your work.
    - Our CLDF team at the DLCE in of the MPI-SHH
    in Jena/Leipzig is always ready to help with
    questions.

    View Slide

  29. 29
    Summary on CLDF
    - With CLDF and CLDFBench, integrating your
    lexical and structural data is largely facilitated.
    - Based on our experience with helping colleagues
    to convert their data to CLDF, we can say:
    converting your dataset to CLDF will help you to
    improve it!
    - If you convert your data accompanying
    publications to CLDF, you help not only yourself,
    but also others to build on your work.
    - Our CLDF team at the DLCE in of the MPI-SHH
    in Jena/Leipzig is always ready to help with
    questions.

    View Slide

  30. 30
    Summary on CLDF
    - With CLDF and CLDFBench, integrating your
    lexical and structural data is largely facilitated.
    - Based on our experience with helping colleagues
    to convert their data to CLDF, we can say:
    converting your dataset to CLDF will help you to
    improve it!
    - If you convert your data accompanying
    publications to CLDF, you help not only yourself,
    but also others to build on your work.
    - Our CLDF team at the DLCE in of the MPI-SHH
    in Jena/Leipzig is always ready to help with
    questions.

    View Slide

  31. Enhanced,
    Machine-Readable
    Annotation of
    Etymologies
    31

    View Slide

  32. 32
    Why Cognacy is Not Enough
    Cognacy is not necessarily
    - binary in the light of partial cognates,
    - “horizontal” in the light of
    language-internal word families,
    - implying regular sound change and
    alignability in the light of non-concatenative
    derivation
    - flat, but rather hierarchical, depending on
    the purpose of the analysis.

    View Slide

  33. 33
    What Can We Do About it?
    - Develop enhanced frameworks for
    annotation of etymologies.
    - Develop interfaces which support the
    annotation.
    - Develop software for preprocessing, which
    can be used to automatically annotate data in
    a computer-assisted setting.

    View Slide

  34. 34
    Where Are We Now?
    - Annotation Frameworks:
    - Hill and List (2017)
    - Schweikhard and List (2020)
    - Interfaces:
    - List (2017): EDICTOR tool
    (https://digling.org/edictor/)
    - Software:
    - List et al. (2020): LingPy (http://lingpy.org)
    - Phonetic alignment (List 2014)
    - Partial cognate detection (List et al. 2016)
    - Cognate detection (List et al. 2017)
    - Sound correspondence patterns (List 2019)
    - Workflows (Wu et al. 2020)

    View Slide

  35. Sequence Comparison with LingPy
    35

    View Slide

  36. Etymological Annotation with EDICTOR
    36

    View Slide

  37. 37
    Improving Etymological Annotation
    1. Representation of Forms
    - Word forms should be represented as lists of
    sound segments.
    - not [tʰantsʰən] but [tʰ a n t
    ͡ sʰ ə n]
    - Word forms should conform to CLTS:
    - not [tʰ a n t
    ͡ sʰ ə n] but [tʰ a n tsʰ ə n]
    - Word forms should be further segmented
    into morphemes.
    - not [tʰ a n tsʰ ə n] but [tʰ a n tsʰ + ə n]

    View Slide

  38. 38
    Improving Etymological Annotation
    2. Representation of Meanings
    - Base meanings should be linked to
    Concepticon (if available).
    - “apple (noun)” -> 1320 APPLE
    - Individual meanings of morphemes per
    word should be glossed, distinguishing lexical
    (autosemantic) from grammatical
    (synsemantic) meanings.
    - [a pf ə l + b au m] -> APPLE TREE
    - [tʰ a n tsʰ + ə n] -> DANCE _infinitive

    View Slide

  39. 39
    Improving Etymological Annotation
    3. Representation of Cognacy
    - Fully cognate words receive the same ID
    (independent of their meaning slot), called
    COGID in LingPy/EDICTOR.
    - Partial cognate words receive an ID for each
    of their morphemes.
    - Alignments should only be provided for
    partial cognates.

    View Slide

  40. 40
    Improving Etymological Annotation
    Enhanced Data Representation in EDICTOR (Pano data
    by Zariquiey et al. in preparation).

    View Slide

  41. Integration Beyond the
    Lexicon
    41

    View Slide

  42. 42
    Integrating Data Beyond the Lexicon
    CLDF Datasets
    - https://github.com/cldf-datasets/
    Interlinear-Glossed Text
    - https://github.com/cldf/pyigt
    Parallel texts
    - ???

    View Slide

  43. ¡Gracias a todos!
    43
    Gracias a:
    Cormac Anderson, Timotheus A. Bodt, Thiago Chacon, Ilya Chechuro, Doug
    Cooper, Robert Forkel, Volker Gast, Russell Gray, Simon J. Greenhill, Guido
    Grimm, Teague Henry, Nathan W. Hill, Joshua Jackson, Guillaume Jacques, Maria
    Koptjevskaja Tamm, Yunfan Lai, Kristen Lindquist, Kristina Pianykh, Justin Power,
    Robin Ryder, Christoph Rzymski, Laurent Sagart, Nathanael E. Schweikhard,
    Valentin Thouzeau, Annika Tjuka, Tiago Tresoldi, Mary E. Walworth, Joseph
    Watts, Mei-Shin Wu, Roberto Zariquiey, and many more...

    View Slide