Integrating data in cross-linguistic studies

Integrating Data in Cross-Linguistic Studies Beyond Data Sharing Johann-Mattis List
(DLCE)

Data Sharing and Data Integration 2

3 Data Sharing is Gaining Importance - Many studies in
historical linguistics and linguistic typology present results based on data which are not shared. - Data sharing becomes increasingly important and has been reinforced by some journals (Linguistic Typology, Linguistics, LDNC). - Reviewers ask increasingly for data and code already during review stage and make replicability checks a part of the review procedure.

7 But Data Sharing is Not the Whole Story! -
Sharing data does not mean that the data can be easily processed and reused. - But data processing and data reuse (interoperability and reusability) are essential to - ﬁnd errors in existing datasets, - enhance existing datasets, and - create new datasets from existing ones. - The discussion should not be about sharing data alone, but about integrating data.

11 How Data can be Successfully Integrated - To be
integrated, data needs to be - internally consistent, and - externally comparable. - Internal consistency requires that data are not only human- but also machine-readable. - External comparability requires that data are conforming to general standards. - The Cross-Linguistic Data Formats (CLDF) initiative (https://cldf.clld.org) has presented recommendations for the representation of integrated cross-linguistic data collections.

Cross-Linguistic Data Formats (CLDF) and CLDFBench 16

CLDF Standard Formats 17

18 CLDF Reference Catalogs - Glottolog (https://glottolog.org): reference catalog for
language varieties - Concepticon (https://concepticon.clld.org): reference catalog for concepts, linking elicitation glosses used in linguistic literature to Concepticon Concept Sets - Cross-Linguistic Transcription Systems (https://clts.clld.org): identiﬁes more than 8000 possible speech sounds and links transcription systems and transcription dataset to them

19 CLDF Reference Catalogs - Glottolog: offers geolocations, rudimentary classifications,
and references - Concepticon: offering definitions, and now also extensive norms, ratins, and relations (https://digling.org/norare, frequency, similarity, hierarchies, etc.) - Cross-Linguistic Transcription Systems: offers feature sets for each speech sounds, and similarity metrics based on the features

20 CLDF Reference Catalogs - Glottolog: offers geolocations, rudimentary classifications,
and references - Concepticon: offering definitions, and now also extensive norms, ratins, and relations (https://digling.org/norare, frequency, similarity, hierarchies, etc.) - Cross-Linguistic Transcription Systems: offers feature sets for each speech sounds, and similarity metrics based on the features Linking your data to Glottolog, mapping your concepts to Concepticon, and converting your transcriptions to CLTS will drastically ENRICH them!

Lifting Data with CLDFBench 21

22 Lifting Data with CLDFBench - Python package (https://github.com/cldf/cld fbench)
- Consistent, code-based, test-driven conversion of data to CLDF - integration with Orthography Proﬁles for conversion to CLTS - straightforward conversion of CLDF datasets into a CLLDF database

CLDF Success Stories: CLICS 23 CLICS³ (Rzymski et al. to
appear, 2474 languages) CLICS¹ (List et al. 2014, 300 languages) CLICS² (List et al. 2018, 1200 languages) Increasing coverage in the CLICS database over the past four years.

CLDF Success Stories: Emotion Semantics 24 Tackling questions on cognition
with cross-linguistic data: - lexical data lifted for more than 2000 language varieties - automated methods for the creation of colexiﬁcation networks - colexiﬁcation networks can help to answer various question on human cognition

CLDF Success Stories: Emotion Semantics 25 Universal Austronesian Indo-European Major
Results: - Individual language families diﬀer in their emotion semantics. - Emotions are globally diﬀerentiated by hedonistic valence and physiological activation.

26 Summary on CLDF - With CLDF and CLDFBench, integrating
your lexical and structural data is largely facilitated. - Based on our experience with helping colleagues to convert their data to CLDF, we can say: converting your dataset to CLDF will help you to improve it! - If you convert your data accompanying publications to CLDF, you help not only yourself, but also others to build on your work. - Our CLDF team at the DLCE in of the MPI-SHH in Jena/Leipzig is always ready to help with questions.

Enhanced, Machine-Readable Annotation of Etymologies 31

32 Why Cognacy is Not Enough Cognacy is not necessarily
- binary in the light of partial cognates, - “horizontal” in the light of language-internal word families, - implying regular sound change and alignability in the light of non-concatenative derivation - ﬂat, but rather hierarchical, depending on the purpose of the analysis.

33 What Can We Do About it? - Develop enhanced
frameworks for annotation of etymologies. - Develop interfaces which support the annotation. - Develop software for preprocessing, which can be used to automatically annotate data in a computer-assisted setting.

34 Where Are We Now? - Annotation Frameworks: - Hill
and List (2017) - Schweikhard and List (2020) - Interfaces: - List (2017): EDICTOR tool (https://digling.org/edictor/) - Software: - List et al. (2020): LingPy (http://lingpy.org) - Phonetic alignment (List 2014) - Partial cognate detection (List et al. 2016) - Cognate detection (List et al. 2017) - Sound correspondence patterns (List 2019) - Workﬂows (Wu et al. 2020)

Sequence Comparison with LingPy 35

Etymological Annotation with EDICTOR 36

37 Improving Etymological Annotation 1. Representation of Forms - Word
forms should be represented as lists of sound segments. - not [tʰantsʰən] but [tʰ a n t ͡ sʰ ə n] - Word forms should conform to CLTS: - not [tʰ a n t ͡ sʰ ə n] but [tʰ a n tsʰ ə n] - Word forms should be further segmented into morphemes. - not [tʰ a n tsʰ ə n] but [tʰ a n tsʰ + ə n]

38 Improving Etymological Annotation 2. Representation of Meanings - Base
meanings should be linked to Concepticon (if available). - “apple (noun)” -> 1320 APPLE - Individual meanings of morphemes per word should be glossed, distinguishing lexical (autosemantic) from grammatical (synsemantic) meanings. - [a pf ə l + b au m] -> APPLE TREE - [tʰ a n tsʰ + ə n] -> DANCE _inﬁnitive

39 Improving Etymological Annotation 3. Representation of Cognacy - Fully
cognate words receive the same ID (independent of their meaning slot), called COGID in LingPy/EDICTOR. - Partial cognate words receive an ID for each of their morphemes. - Alignments should only be provided for partial cognates.

40 Improving Etymological Annotation Enhanced Data Representation in EDICTOR (Pano
data by Zariquiey et al. in preparation).

Integration Beyond the Lexicon 41

42 Integrating Data Beyond the Lexicon CLDF Datasets - https://github.com/cldf-datasets/
Interlinear-Glossed Text - https://github.com/cldf/pyigt Parallel texts - ???

¡Gracias a todos! 43 Gracias a: Cormac Anderson, Timotheus A.
Bodt, Thiago Chacon, Ilya Chechuro, Doug Cooper, Robert Forkel, Volker Gast, Russell Gray, Simon J. Greenhill, Guido Grimm, Teague Henry, Nathan W. Hill, Joshua Jackson, Guillaume Jacques, Maria Koptjevskaja Tamm, Yunfan Lai, Kristen Lindquist, Kristina Pianykh, Justin Power, Robin Ryder, Christoph Rzymski, Laurent Sagart, Nathanael E. Schweikhard, Valentin Thouzeau, Annika Tjuka, Tiago Tresoldi, Mary E. Walworth, Joseph Watts, Mei-Shin Wu, Roberto Zariquiey, and many more...

Integrating data in cross-linguistic studies

Integrating data in cross-linguistic studies

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript