Integrating data in cross-linguistic studies

Integrating data in cross-linguistic studies

Talk held at the meeting "Untangling the linguistic past of the Americas: Collaborative efforts and interdisciplinary approaches in an open science framework" (2020-09-23/25, Lima [virtual conference], PUCP).

E01961dd2fbd219a30044ffe27c9fb70?s=128

Johann-Mattis List

September 23, 2020
Tweet

Transcript

  1. Integrating Data in Cross-Linguistic Studies Beyond Data Sharing Johann-Mattis List

    (DLCE)
  2. Data Sharing and Data Integration 2

  3. 3 Data Sharing is Gaining Importance - Many studies in

    historical linguistics and linguistic typology present results based on data which are not shared. - Data sharing becomes increasingly important and has been reinforced by some journals (Linguistic Typology, Linguistics, LDNC). - Reviewers ask increasingly for data and code already during review stage and make replicability checks a part of the review procedure.
  4. 4 Data Sharing is Gaining Importance - Many studies in

    historical linguistics and linguistic typology present results based on data which are not shared. - Data sharing becomes increasingly important and has been reinforced by some journals (Linguistic Typology, Linguistics, LDNC). - Reviewers ask increasingly for data and code already during review stage and make replicability checks a part of the review procedure.
  5. 5 Data Sharing is Gaining Importance - Many studies in

    historical linguistics and linguistic typology present results based on data which are not shared. - Data sharing becomes increasingly important and has been reinforced by some journals (Linguistic Typology, Linguistics, LDNC). - Reviewers ask increasingly for data and code already during review stage and make replicability checks a part of the review procedure.
  6. 6 Data Sharing is Gaining Importance - Many studies in

    historical linguistics and linguistic typology present results based on data which are not shared. - Data sharing becomes increasingly important and has been reinforced by some journals (Linguistic Typology, Linguistics, LDNC). - Reviewers ask increasingly for data and code already during review stage and make replicability checks a part of the review procedure.
  7. 7 But Data Sharing is Not the Whole Story! -

    Sharing data does not mean that the data can be easily processed and reused. - But data processing and data reuse (interoperability and reusability) are essential to - find errors in existing datasets, - enhance existing datasets, and - create new datasets from existing ones. - The discussion should not be about sharing data alone, but about integrating data.
  8. 8 But Data Sharing is Not the Whole Story! -

    Sharing data does not mean that the data can be easily processed and reused. - But data processing and data reuse (interoperability and reusability) are essential to - find errors in existing datasets, - enhance existing datasets, and - create new datasets from existing ones. - The discussion should not be about sharing data alone, but about integrating data.
  9. 9 But Data Sharing is Not the Whole Story! -

    Sharing data does not mean that the data can be easily processed and reused. - But data processing and data reuse (interoperability and reusability) are essential to - find errors in existing datasets, - enhance existing datasets, and - create new datasets from existing ones. - The discussion should not be about sharing data alone, but about integrating data.
  10. 10 But Data Sharing is Not the Whole Story! -

    Sharing data does not mean that the data can be easily processed and reused. - But data processing and data reuse (interoperability and reusability) are essential to - find errors in existing datasets, - enhance existing datasets, and - create new datasets from existing ones. - The discussion should not be about sharing data alone, but about integrating data.
  11. 11 How Data can be Successfully Integrated - To be

    integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.
  12. 12 How Data can be Successfully Integrated - To be

    integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.
  13. 13 How Data can be Successfully Integrated - To be

    integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.
  14. 14 How Data can be Successfully Integrated - To be

    integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.
  15. 15 How Data can be Successfully Integrated - To be

    integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.
  16. Cross-Linguistic Data Formats (CLDF) and CLDFBench 16

  17. CLDF Standard Formats 17

  18. 18 CLDF Reference Catalogs - Glottolog (https://glottolog.org): reference catalog for

    language varieties - Concepticon (https://concepticon.clld.org): reference catalog for concepts, linking elicitation glosses used in linguistic literature to Concepticon Concept Sets - Cross-Linguistic Transcription Systems (https://clts.clld.org): identifies more than 8000 possible speech sounds and links transcription systems and transcription dataset to them
  19. 19 CLDF Reference Catalogs - Glottolog: offers geolocations, rudimentary classifications,

    and references - Concepticon: offering definitions, and now also extensive norms, ratins, and relations (https://digling.org/norare, frequency, similarity, hierarchies, etc.) - Cross-Linguistic Transcription Systems: offers feature sets for each speech sounds, and similarity metrics based on the features
  20. 20 CLDF Reference Catalogs - Glottolog: offers geolocations, rudimentary classifications,

    and references - Concepticon: offering definitions, and now also extensive norms, ratins, and relations (https://digling.org/norare, frequency, similarity, hierarchies, etc.) - Cross-Linguistic Transcription Systems: offers feature sets for each speech sounds, and similarity metrics based on the features Linking your data to Glottolog, mapping your concepts to Concepticon, and converting your transcriptions to CLTS will drastically ENRICH them!
  21. Lifting Data with CLDFBench 21

  22. 22 Lifting Data with CLDFBench - Python package (https://github.com/cldf/cld fbench)

    - Consistent, code-based, test-driven conversion of data to CLDF - integration with Orthography Profiles for conversion to CLTS - straightforward con- version of CLDF data- sets into a CLLDF database
  23. CLDF Success Stories: CLICS 23 CLICS³ (Rzymski et al. to

    appear, 2474 languages) CLICS¹ (List et al. 2014, 300 languages) CLICS² (List et al. 2018, 1200 languages) Increasing coverage in the CLICS database over the past four years.
  24. CLDF Success Stories: Emotion Semantics 24 Tackling questions on cognition

    with cross-linguistic data: - lexical data lifted for more than 2000 language varieties - automated methods for the creation of colexification networks - colexification networks can help to answer various question on human cognition
  25. CLDF Success Stories: Emotion Semantics 25 Universal Austronesian Indo-European Major

    Results: - Individual language families differ in their emotion semantics. - Emotions are globally differentiated by hedonistic valence and physiological activation.
  26. 26 Summary on CLDF - With CLDF and CLDFBench, integrating

    your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.
  27. 27 Summary on CLDF - With CLDF and CLDFBench, integrating

    your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.
  28. 28 Summary on CLDF - With CLDF and CLDFBench, integrating

    your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.
  29. 29 Summary on CLDF - With CLDF and CLDFBench, integrating

    your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.
  30. 30 Summary on CLDF - With CLDF and CLDFBench, integrating

    your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.
  31. Enhanced, Machine-Readable Annotation of Etymologies 31

  32. 32 Why Cognacy is Not Enough Cognacy is not necessarily

    - binary in the light of partial cognates, - “horizontal” in the light of language-internal word families, - implying regular sound change and alignability in the light of non-concatenative derivation - flat, but rather hierarchical, depending on the purpose of the analysis.
  33. 33 What Can We Do About it? - Develop enhanced

    frameworks for annotation of etymologies. - Develop interfaces which support the annotation. - Develop software for preprocessing, which can be used to automatically annotate data in a computer-assisted setting.
  34. 34 Where Are We Now? - Annotation Frameworks: - Hill

    and List (2017) - Schweikhard and List (2020) - Interfaces: - List (2017): EDICTOR tool (https://digling.org/edictor/) - Software: - List et al. (2020): LingPy (http://lingpy.org) - Phonetic alignment (List 2014) - Partial cognate detection (List et al. 2016) - Cognate detection (List et al. 2017) - Sound correspondence patterns (List 2019) - Workflows (Wu et al. 2020)
  35. Sequence Comparison with LingPy 35

  36. Etymological Annotation with EDICTOR 36

  37. 37 Improving Etymological Annotation 1. Representation of Forms - Word

    forms should be represented as lists of sound segments. - not [tʰantsʰən] but [tʰ a n t ͡ sʰ ə n] - Word forms should conform to CLTS: - not [tʰ a n t ͡ sʰ ə n] but [tʰ a n tsʰ ə n] - Word forms should be further segmented into morphemes. - not [tʰ a n tsʰ ə n] but [tʰ a n tsʰ + ə n]
  38. 38 Improving Etymological Annotation 2. Representation of Meanings - Base

    meanings should be linked to Concepticon (if available). - “apple (noun)” -> 1320 APPLE - Individual meanings of morphemes per word should be glossed, distinguishing lexical (autosemantic) from grammatical (synsemantic) meanings. - [a pf ə l + b au m] -> APPLE TREE - [tʰ a n tsʰ + ə n] -> DANCE _infinitive
  39. 39 Improving Etymological Annotation 3. Representation of Cognacy - Fully

    cognate words receive the same ID (independent of their meaning slot), called COGID in LingPy/EDICTOR. - Partial cognate words receive an ID for each of their morphemes. - Alignments should only be provided for partial cognates.
  40. 40 Improving Etymological Annotation Enhanced Data Representation in EDICTOR (Pano

    data by Zariquiey et al. in preparation).
  41. Integration Beyond the Lexicon 41

  42. 42 Integrating Data Beyond the Lexicon CLDF Datasets - https://github.com/cldf-datasets/

    Interlinear-Glossed Text - https://github.com/cldf/pyigt Parallel texts - ???
  43. ¡Gracias a todos! 43 Gracias a: Cormac Anderson, Timotheus A.

    Bodt, Thiago Chacon, Ilya Chechuro, Doug Cooper, Robert Forkel, Volker Gast, Russell Gray, Simon J. Greenhill, Guido Grimm, Teague Henry, Nathan W. Hill, Joshua Jackson, Guillaume Jacques, Maria Koptjevskaja Tamm, Yunfan Lai, Kristen Lindquist, Kristina Pianykh, Justin Power, Robin Ryder, Christoph Rzymski, Laurent Sagart, Nathanael E. Schweikhard, Valentin Thouzeau, Annika Tjuka, Tiago Tresoldi, Mary E. Walworth, Joseph Watts, Mei-Shin Wu, Roberto Zariquiey, and many more...