Linguistic Catalogs and DLCE tools

Linguistic Catalogs & Linguistic Catalogs & DLCE tools DLCE tools
Tiago Tresoldi Jena, 30/01/2020 1

Why catalogs? FAIR data are normalized data Primary keys, Foreign
keys Crash-course on the relational model DLCE catalogs & tools Exercises 2

Relational model Relational model 3 . 1

Four pets: A young Yorkshire named “Brutus” An old Pitbull
named “Fluff” A Manx cat named “Simba” An ultra-centenarian tortoise named “Darwin” Each one is a “tuple” or “record” 3 . 3

Organizing in a table should be trivial Species Breed Name
Age Dog Yorkshire Brutus 1 Dog Pitbull Fluff 10 Cat Manx Simba 3 Tortoise Darwin 120 3 . 4

An ID (“primary key”) makes sense ID Species Breed Name
Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 3 4 Tortoise Darwin 120 3 . 5

We can use the “Species” as a foreign key ID
Species Name Emoji 1 Cat F. catus 2 Dog C. familiaris 3 Tortoise C. nigra 3 . 6

Imagine there is an error in the data ID Species
Breed Name Age 1 Dog Yorkshire Brutus 1 2 Dog Pitbull Fluff 10 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 7

We could extend one of the tables… ID Species Name
Emoji Median max age 1 Cat F. catus 14 2 Dog C. familiaris 10 3 Tortoise C. nigra 100 3 . 8

… and get a new table (a “JOIN” operation) with
animals over the median max age ID Species Breed Name Age 3 Cat Manx Simba 33 4 Tortoise Darwin 120 3 . 9

If we had an additional table with people, we could
link pets to their owners This allows new JOIN operations, depending on the information in the “PEOPLE” table, e.g. What is the most common pet species among teenagers who like apples? What is the mean age of cats owned by German speakers? 3 . 10

In linguistics, with all databases and catalogs, we could do
complex queries such as: Display a sorted list of the ten most frequent vowels (CLTS) in kinship terms (Lexibank/Concepticon) among non-Indo- European African languages (Glottolog) that don’t express pronominal subjects by clitics (WALS/Grambank) and with a speaking population between 1,000 and 10,000 (third- party database) 3 . 11

3 . 12

Glottolog Glottolog https://glottolog.org/ 4 . 1

Glottolog provides a catalogue of the world’s “languoids” (language families,
languages, and dialects) It assigns a unique and stable identifier (the Glottocode) to (in principle) all such languoids Languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research 4 . 2

“Doculects” are not necessarily languages Think about reconstructions The genealogical
classification is intended for navigation It is by definition very conservative Other catalogues: Ethnologue, ISO-639 FAIR data and Academic principles 4 . 3

As a Python library As a Python library 4 .
4

Concepticon Concepticon https://concepticon.clld.org/ 5 . 1

A resource for linking concept lists Concepticon links concept labels
from different concept lists to concept sets. Each concept set is given a unique identifier , a unique label, and a human-readable definition. No point in discussing if it is a proper “ontology” : we use it as a normalized catalogue for linking otherwise “airtight” datasets It is being linked to these “proper” ontologies! 5 . 2

On-line at: https://digling.org/calc/concepticon/ 5 . 5

CLTS CLTS https://clts.clld.org/ 6 . 1

“voiceless post-alveolar sibilant “voiceless post-alveolar sibilant affricate consonant” affricate consonant”
IPA: tʃ (two Unicode characters, U+0074 U+0283) ʧ (single Unicode character, U+02A7) With bar on top t͡ʃ (U+0074 U+0361 U+0283) , or below t͜ʃ (U+0074 U+035C U+0283) Non-IPA: APA č, NAPA tᶴ, X-SAMPA ts\, ASJP C Dozens of orthographies 6 . 2

Orthographic Proﬁles Orthographic Proﬁles Deal with non-normal data (e.g., homoglyphs)
and segments Latin a (U+0061), Cyrillic а (U+0430) Combining and pre-composed characters: é and é Labial click ʘ and Sun ☉ Segmentation looks trivial but is fundamental 6 . 3

Value Segments b b baba b ɐ b ɐ C"eC
tʃʼ e tʃ baz b ɐ ? jed ? e d ziz ? ? ? 6 . 5

Exercises! Exercises! Use Glottolog to identify the language/dialects you work
with, or your native one(s). Use Concepticon and find the right cognate set for the following glosses “pineapple” “fly (verb)” “Brazilian Coral Snake” “musical keyboard” 7

Orthographic proﬁles Orthographic proﬁles We’ll experiment with profiles in the
afternoon. 8

Software and data installed? Software and data installed? Go to
your terminal, activate the environment, and type command not found (or equivalent) means that software installation is incomplete no entries means you still need to install the catalogs (sorry, our fault!): $ cldfbench catinfo $ cldfbench catconfig 9

Thank you and see you later! [email protected] 10

Linguistic Catalogs and DLCE tools

Linguistic Catalogs and DLCE tools

Tiago Tresoldi

More Decks by Tiago Tresoldi

Other Decks in Education

Featured

Transcript

Linguistic Catalogs & Linguistic Catalogs & DLCE tools DLCE tools

Why catalogs? FAIR data are normalized data Primary keys, Foreign

Relational model Relational model 3 . 1

3 . 2

Four pets: A young Yorkshire named “Brutus” An old Pitbull

Organizing in a table should be trivial Species Breed Name

An ID (“primary key”) makes sense ID Species Breed Name

We can use the “Species” as a foreign key ID

Imagine there is an error in the data ID Species

We could extend one of the tables… ID Species Name

… and get a new table (a “JOIN” operation) with

If we had an additional table with people, we could

In linguistics, with all databases and catalogs, we could do

3 . 12

Glottolog Glottolog https://glottolog.org/ 4 . 1

Glottolog provides a catalogue of the world’s “languoids” (language families,

“Doculects” are not necessarily languages Think about reconstructions The genealogical

As a Python library As a Python library 4 .

4 . 5

Concepticon Concepticon https://concepticon.clld.org/ 5 . 1

A resource for linking concept lists Concepticon links concept labels

5 . 3

5 . 4

On-line at: https://digling.org/calc/concepticon/ 5 . 5

CLTS CLTS https://clts.clld.org/ 6 . 1

“voiceless post-alveolar sibilant “voiceless post-alveolar sibilant affricate consonant” affricate consonant”

Orthographic Proﬁles Orthographic Proﬁles Deal with non-normal data (e.g., homoglyphs)

6 . 4

Value Segments b b baba b ɐ b ɐ C"eC

Exercises! Exercises! Use Glottolog to identify the language/dialects you work

Orthographic proﬁles Orthographic proﬁles We’ll experiment with profiles in the

Software and data installed? Software and data installed? Go to

Thank you and see you later! [email protected] 10