Four pets: A young Yorkshire named “Brutus” An old Pitbull named “Fluff” A Manx cat named “Simba” An ultra-centenarian tortoise named “Darwin” Each one is a “tuple” or “record” 3 . 3
An ID (“primary key”) makes sense ID Species Breed Name Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 3 4 Tortoise Darwin 120 3 . 5
Imagine there is an error in the data ID Species Breed Name Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 7
… and get a new table (a “JOIN” operation) with animals over the median max age ID Species Breed Name Age 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 9
If we had an additional table with people, we could link pets to their owners This allows new JOIN operations, depending on the information in the “PEOPLE” table, e.g. What is the most common pet species among teenagers who like apples? What is the mean age of cats owned by German speakers? 3 . 10
In linguistics, with all databases and catalogs, we could do complex queries such as: Display a sorted list of the ten most frequent vowels (CLTS) in kinship terms (Lexibank/Concepticon) among non-Indo- European African languages (Glottolog) that don’t express pronominal subjects by clitics (WALS/Grambank) and with a speaking population between 1,000 and 10,000 (third- party database) 3 . 11
Glottolog provides a catalogue of the world’s “languoids” (language families, languages, and dialects) It assigns a unique and stable identifier (the Glottocode) to (in principle) all such languoids Languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research 4 . 2
“Doculects” are not necessarily languages Think about reconstructions The genealogical classification is intended for navigation It is by definition very conservative Other catalogues: Ethnologue, ISO-639 FAIR data and Academic principles 4 . 3
A resource for linking concept lists Concepticon links concept labels from different concept lists to concept sets. Each concept set is given a unique identifier , a unique label, and a human-readable definition. No point in discussing if it is a proper “ontology” : we use it as a normalized catalogue for linking otherwise “airtight” datasets It is being linked to these “proper” ontologies! 5 . 2
Orthographic Profiles Orthographic Profiles Deal with non-normal data (e.g., homoglyphs) and segments Latin a (U+0061), Cyrillic а (U+0430) Combining and pre-composed characters: é and é Labial click ʘ and Sun ☉ Segmentation looks trivial but is fundamental 6 . 3
Exercises! Exercises! Use Glottolog to identify the language/dialects you work with, or your native one(s). Use Concepticon and find the right cognate set for the following glosses “pineapple” “fly (verb)” “Brazilian Coral Snake” “musical keyboard” 7
Software and data installed? Software and data installed? Go to your terminal, activate the environment, and type command not found (or equivalent) means that software installation is incomplete no entries means you still need to install the catalogs (sorry, our fault!): $ cldfbench catinfo $ cldfbench catconfig 9