Linguistic Catalogs and DLCE tools

Linguistic Catalogs and DLCE tools

Presentation at the RRDM-SHH Workshop

74ebca07ccf49343d1ddaef84d65b78e?s=128

Tiago Tresoldi

January 30, 2020
Tweet

Transcript

  1. 2.

    Why catalogs? FAIR data are normalized data Primary keys, Foreign

    keys Crash-course on the relational model DLCE catalogs & tools Exercises 2
  2. 4.
  3. 5.

    Four pets: A young Yorkshire named “Brutus” An old Pitbull

    named “Fluff” A Manx cat named “Simba” An ultra-centenarian tortoise named “Darwin” Each one is a “tuple” or “record” 3 . 3
  4. 6.

    Organizing in a table should be trivial Species Breed Name

    Age Dog Yorkshire Brutus 1 Dog Pitbull Fluff 10 Cat Manx Simba 3 Tortoise Darwin 120 3 . 4
  5. 7.

    An ID (“primary key”) makes sense ID Species Breed Name

    Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 3 4 Tortoise Darwin 120 3 . 5
  6. 8.

    We can use the “Species” as a foreign key ID

    Species Name Emoji 1 Cat F. catus 2 Dog C. familiaris 3 Tortoise C. nigra 3 . 6
  7. 9.

    Imagine there is an error in the data ID Species

    Breed Name Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 7
  8. 10.

    We could extend one of the tables… ID Species Name

    Emoji Median max age 1 Cat F. catus 14 2 Dog C. familiaris 10 3 Tortoise C. nigra 100 3 . 8
  9. 11.

    … and get a new table (a “JOIN” operation) with

    animals over the median max age ID Species Breed Name Age 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 9
  10. 12.

    If we had an additional table with people, we could

    link pets to their owners This allows new JOIN operations, depending on the information in the “PEOPLE” table, e.g. What is the most common pet species among teenagers who like apples? What is the mean age of cats owned by German speakers? 3 . 10
  11. 13.

    In linguistics, with all databases and catalogs, we could do

    complex queries such as: Display a sorted list of the ten most frequent vowels (CLTS) in kinship terms (Lexibank/Concepticon) among non-Indo- European African languages (Glottolog) that don’t express pronominal subjects by clitics (WALS/Grambank) and with a speaking population between 1,000 and 10,000 (third- party database) 3 . 11
  12. 14.
  13. 16.
  14. 17.

    Glottolog provides a catalogue of the world’s “languoids” (language families,

    languages, and dialects) It assigns a unique and stable identifier (the Glottocode) to (in principle) all such languoids Languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research 4 . 2
  15. 18.

    “Doculects” are not necessarily languages Think about reconstructions The genealogical

    classification is intended for navigation It is by definition very conservative Other catalogues: Ethnologue, ISO-639 FAIR data and Academic principles 4 . 3
  16. 20.
  17. 22.
  18. 23.

    A resource for linking concept lists Concepticon links concept labels

    from different concept lists to concept sets. Each concept set is given a unique identifier , a unique label, and a human-readable definition. No point in discussing if it is a proper “ontology” : we use it as a normalized catalogue for linking otherwise “airtight” datasets It is being linked to these “proper” ontologies! 5 . 2
  19. 24.
  20. 25.
  21. 28.
  22. 29.

    “voiceless post-alveolar sibilant “voiceless post-alveolar sibilant affricate consonant” affricate consonant”

    IPA: tʃ (two Unicode characters, U+0074 U+0283) ʧ (single Unicode character, U+02A7) With bar on top t͡ʃ (U+0074 U+0361 U+0283) , or below t͜ʃ (U+0074 U+035C U+0283) Non-IPA: APA č, NAPA tᶴ, X-SAMPA ts\, ASJP C Dozens of orthographies 6 . 2
  23. 30.

    Orthographic Profiles Orthographic Profiles Deal with non-normal data (e.g., homoglyphs)

    and segments Latin a (U+0061), Cyrillic а (U+0430) Combining and pre-composed characters: é and é Labial click ʘ and Sun ☉ Segmentation looks trivial but is fundamental 6 . 3
  24. 31.
  25. 32.
  26. 33.

    Value Segments b b baba b ɐ b ɐ C"eC

    tʃʼ e tʃ baz b ɐ ? jed ? e d ziz ? ? ? 6 . 5
  27. 34.

    Exercises! Exercises! Use Glottolog to identify the language/dialects you work

    with, or your native one(s). Use Concepticon and find the right cognate set for the following glosses “pineapple” “fly (verb)” “Brazilian Coral Snake” “musical keyboard” 7
  28. 36.

    Software and data installed? Software and data installed? Go to

    your terminal, activate the environment, and type command not found (or equivalent) means that software installation is incomplete no entries means you still need to install the catalogs (sorry, our fault!): $ cldfbench catinfo $ cldfbench catconfig 9