Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linguistic Catalogs and DLCE tools

Linguistic Catalogs and DLCE tools

Presentation at the RRDM-SHH Workshop

Tiago Tresoldi

January 30, 2020
Tweet

More Decks by Tiago Tresoldi

Other Decks in Education

Transcript

  1. Linguistic Catalogs &
    Linguistic Catalogs &
    DLCE tools
    DLCE tools
    Tiago Tresoldi
    Jena, 30/01/2020
    1

    View Slide

  2. Why catalogs?
    FAIR data are normalized data
    Primary keys, Foreign keys
    Crash-course on the relational model
    DLCE catalogs & tools
    Exercises
    2

    View Slide

  3. Relational model
    Relational model
    3 . 1

    View Slide

  4. 3 . 2

    View Slide

  5. Four pets:
    A young Yorkshire named “Brutus”
    An old Pitbull named “Fluff”
    A Manx cat named “Simba”
    An ultra-centenarian tortoise named “Darwin”
    Each one is a “tuple” or “record”
    3 . 3

    View Slide

  6. Organizing in a table should be trivial
    Species Breed Name Age
    Dog Yorkshire Brutus 1
    Dog Pitbull Fluff 10
    Cat Manx Simba 3
    Tortoise Darwin 120
    3 . 4

    View Slide

  7. An ID (“primary key”) makes sense
    ID Species Breed Name Age
    1 Dog Yorkshire Brutus 1
    2 Dog Pitbull Fluff 10
    3 Cat Manx Simba 3
    4 Tortoise Darwin 120
    3 . 5

    View Slide

  8. We can use the “Species” as a foreign key
    ID Species Name Emoji
    1 Cat F. catus
    2 Dog C. familiaris
    3 Tortoise C. nigra
    3 . 6

    View Slide

  9. Imagine there is an error in the data
    ID Species Breed Name Age
    1 Dog Yorkshire Brutus 1
    2 Dog Pitbull Fluff 10
    3 Cat Manx Simba 33
    4 Tortoise Darwin 120
    3 . 7

    View Slide

  10. We could extend one of the tables…
    ID Species Name Emoji Median max
    age
    1 Cat F. catus 14
    2 Dog C.
    familiaris
    10
    3 Tortoise C. nigra 100
    3 . 8

    View Slide

  11. … and get a new table (a “JOIN” operation) with
    animals over the median max age
    ID Species Breed Name Age
    3 Cat Manx Simba 33
    4 Tortoise Darwin 120
    3 . 9

    View Slide

  12. If we had an additional table with people, we could
    link pets to their owners
    This allows new JOIN operations, depending on
    the information in the “PEOPLE” table, e.g.
    What is the most common pet species among
    teenagers who like apples?
    What is the mean age of cats owned by German
    speakers?
    3 . 10

    View Slide

  13. In linguistics, with all databases and catalogs, we
    could do complex queries such as:
    Display a sorted list of the ten most frequent
    vowels (CLTS) in kinship terms
    (Lexibank/Concepticon) among non-Indo-
    European African languages (Glottolog) that
    don’t express pronominal subjects by clitics
    (WALS/Grambank) and with a speaking
    population between 1,000 and 10,000 (third-
    party database)
    3 . 11

    View Slide

  14. 3 . 12

    View Slide

  15. Glottolog
    Glottolog
    https://glottolog.org/
    4 . 1

    View Slide

  16. View Slide

  17. Glottolog provides a catalogue of the world’s
    “languoids” (language families, languages, and
    dialects)
    It assigns a unique and stable identifier (the
    Glottocode) to (in principle) all such languoids
    Languoids are organized via a genealogical
    classification (the Glottolog tree) that is based on
    available historical-comparative research
    4 . 2

    View Slide

  18. “Doculects” are not necessarily languages
    Think about reconstructions
    The genealogical classification is intended for
    navigation
    It is by definition very conservative
    Other catalogues: Ethnologue, ISO-639
    FAIR data and Academic principles
    4 . 3

    View Slide

  19. As a Python library
    As a Python library
    4 . 4

    View Slide

  20. 4 . 5

    View Slide

  21. Concepticon
    Concepticon
    https://concepticon.clld.org/
    5 . 1

    View Slide

  22. View Slide

  23. A resource for linking concept lists
    Concepticon links concept labels from different
    concept lists to concept sets. Each concept set is
    given a unique identifier , a unique label, and a
    human-readable definition.
    No point in discussing if it is a proper “ontology”
    : we use it as a normalized catalogue for linking
    otherwise “airtight” datasets
    It is being linked to these “proper” ontologies!
    5 . 2

    View Slide

  24. 5 . 3

    View Slide

  25. 5 . 4

    View Slide

  26. On-line at: https://digling.org/calc/concepticon/
    5 . 5

    View Slide

  27. CLTS
    CLTS
    https://clts.clld.org/
    6 . 1

    View Slide

  28. View Slide

  29. “voiceless post-alveolar sibilant
    “voiceless post-alveolar sibilant
    affricate consonant”
    affricate consonant”
    IPA:
    tʃ (two Unicode characters, U+0074 U+0283)
    ʧ (single Unicode character, U+02A7)
    With bar on top t͡ʃ (U+0074 U+0361 U+0283) , or
    below t͜ʃ (U+0074 U+035C U+0283)
    Non-IPA: APA č, NAPA tᶴ, X-SAMPA ts\, ASJP C
    Dozens of orthographies
    6 . 2

    View Slide

  30. Orthographic Profiles
    Orthographic Profiles
    Deal with non-normal data (e.g., homoglyphs) and
    segments
    Latin a (U+0061), Cyrillic а (U+0430)
    Combining and pre-composed characters: é and
    é
    Labial click ʘ and Sun ☉
    Segmentation looks trivial but is fundamental
    6 . 3

    View Slide

  31. 6 . 4

    View Slide

  32. View Slide

  33. Value Segments
    b b
    baba b ɐ b ɐ
    C"eC tʃʼ e tʃ
    baz b ɐ ?
    jed ? e d
    ziz ? ? ?
    6 . 5

    View Slide

  34. Exercises!
    Exercises!
    Use Glottolog to identify the language/dialects you
    work with, or your native one(s).
    Use Concepticon and find the right cognate set for
    the following glosses
    “pineapple”
    “fly (verb)”
    “Brazilian Coral Snake”
    “musical keyboard”
    7

    View Slide

  35. Orthographic profiles
    Orthographic profiles
    We’ll experiment with profiles in the afternoon.
    8

    View Slide

  36. Software and data installed?
    Software and data installed?
    Go to your terminal, activate the environment, and
    type
    command not found (or equivalent) means that
    software installation is incomplete
    no entries means you still need to install the
    catalogs (sorry, our fault!):
    $ cldfbench catinfo
    $ cldfbench catconfig
    9

    View Slide

  37. Thank you and see you later!
    [email protected]
    10

    View Slide