Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

Talk, held at the Linguistic Annotation and Philology workshop (2017/07/06-07, Leipzig University)

Johann-Mattis List

July 07, 2017
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Annotation and Analysis of Cross-Linguistic Lexical
    Data in Historical Linguistics
    Toward the Establishment of Standards and Best Practices
    Johann-Mattis List
    Research Group “Computer-Assisted Language Comparison”
    Department of Linguistic and Cultural Evolution
    Max-Planck Institute for the Science of Human History
    Jena, Germany
    2017/07/07
    very
    long
    title
    P(A|B)=P(B|A)...
    1 / 36

    View Slide

  2. Background
    Background
    data
    2 / 36

    View Slide

  3. Background Data
    Data in Linguistics
    Linguistics is a data-driven discipline:
    3 / 36

    View Slide

  4. Background Data
    Data in Linguistics
    Linguistics is a data-driven discipline:
    grammars
    3 / 36

    View Slide

  5. Background Data
    Data in Linguistics
    Linguistics is a data-driven discipline:
    grammars
    dictionaries
    3 / 36

    View Slide

  6. Background Data
    Data in Linguistics
    Linguistics is a data-driven discipline:
    grammars
    dictionaries
    texts
    3 / 36

    View Slide

  7. Background Data
    Data in Linguistics
    Linguistics is a data-driven discipline:
    grammars
    dictionaries
    texts
    etc.
    3 / 36

    View Slide

  8. Background Data
    Data in Comparative Linguistics
    Comparative linguistics is also a data-driven discipline:
    4 / 36

    View Slide

  9. Background Data
    Data in Comparative Linguistics
    Comparative linguistics is also a data-driven discipline:
    historical grammars
    4 / 36

    View Slide

  10. Background Data
    Data in Comparative Linguistics
    Comparative linguistics is also a data-driven discipline:
    historical grammars
    etymological dictionaries
    4 / 36

    View Slide

  11. Background Data
    Data in Comparative Linguistics
    Comparative linguistics is also a data-driven discipline:
    historical grammars
    etymological dictionaries
    annotated texts
    4 / 36

    View Slide

  12. Background Data
    Data in Comparative Linguistics
    Comparative linguistics is also a data-driven discipline:
    historical grammars
    etymological dictionaries
    annotated texts
    etc.
    4 / 36

    View Slide

  13. Background Annotation
    Annotation in Linguistics
    Annotation of data is crucial in linguistics:
    5 / 36

    View Slide

  14. Background Annotation
    Annotation in Linguistics
    Annotation of data is crucial in linguistics:
    inter-linear glossed text
    5 / 36

    View Slide

  15. Background Annotation
    Annotation in Linguistics
    Annotation of data is crucial in linguistics:
    inter-linear glossed text
    “glossing” a words meaning in the literature
    5 / 36

    View Slide

  16. Background Annotation
    Annotation in Linguistics
    Annotation of data is crucial in linguistics:
    inter-linear glossed text
    “glossing” a words meaning in the literature
    etc.
    5 / 36

    View Slide

  17. Background Annotation
    Annotation in Comparative Linguistics
    Annotation of data is also crucial in comparative linguistics:
    6 / 36

    View Slide

  18. Background Annotation
    Annotation in Comparative Linguistics
    Annotation of data is also crucial in comparative linguistics:
    indicating that words are cognate
    6 / 36

    View Slide

  19. Background Annotation
    Annotation in Comparative Linguistics
    Annotation of data is also crucial in comparative linguistics:
    indicating that words are cognate
    using phonetic transcriptions to indicate how words sound
    6 / 36

    View Slide

  20. Background Annotation
    Annotation in Comparative Linguistics
    Annotation of data is also crucial in comparative linguistics:
    indicating that words are cognate
    using phonetic transcriptions to indicate how words sound
    etc.
    6 / 36

    View Slide

  21. Background Analysis
    Analysis in Linguistics
    Analysis of data is crucial in linguistics:
    7 / 36

    View Slide

  22. Background Analysis
    Analysis in Linguistics
    Analysis of data is crucial in linguistics:
    syntactic analysis in any syntactic framework (ideally with less than
    three examples and many asterisks)
    7 / 36

    View Slide

  23. Background Analysis
    Analysis in Linguistics
    Analysis of data is crucial in linguistics:
    syntactic analysis in any syntactic framework (ideally with less than
    three examples and many asterisks)
    etc.
    7 / 36

    View Slide

  24. Background Analysis
    Analysis in Comparative Linguistics
    Analysis of data is also crucial in comparative linguistics:
    8 / 36

    View Slide

  25. Background Analysis
    Analysis in Comparative Linguistics
    Analysis of data is also crucial in comparative linguistics:
    alignment analysis (show where words are cognate, i.e., in which
    segments they coincide)
    8 / 36

    View Slide

  26. Background Analysis
    Analysis in Comparative Linguistics
    Analysis of data is also crucial in comparative linguistics:
    alignment analysis (show where words are cognate, i.e., in which
    segments they coincide)
    correspondence pattern analysis (show the most frequently
    corresponding sounds for a number of related languages)
    8 / 36

    View Slide

  27. Background Analysis
    Analysis in Comparative Linguistics
    Analysis of data is also crucial in comparative linguistics:
    alignment analysis (show where words are cognate, i.e., in which
    segments they coincide)
    correspondence pattern analysis (show the most frequently
    corresponding sounds for a number of related languages)
    etymological analysis (show how the original meaning of a word
    evolved, etc.)
    8 / 36

    View Slide

  28. Background Analysis
    Analysis in Comparative Linguistics
    Analysis of data is also crucial in comparative linguistics:
    alignment analysis (show where words are cognate, i.e., in which
    segments they coincide)
    correspondence pattern analysis (show the most frequently
    corresponding sounds for a number of related languages)
    etymological analysis (show how the original meaning of a word
    evolved, etc.)
    etc.
    8 / 36

    View Slide

  29. Problems
    Problems
    !
    9 / 36

    View Slide

  30. Problems Availability
    Availability
    Linguists all over the world publish papers without sharing their data and
    analyses. SO SAD for science! #MakeLinguisticsGreatAgain #FakePubs
    10 / 36

    View Slide

  31. Problems Availability
    Availability
    Sharing is not yet considered as economic in linguistic research:
    11 / 36

    View Slide

  32. Problems Availability
    Availability
    Sharing is not yet considered as economic in linguistic research:
    journals still accept too many papers which do not share full material
    and code they used for their analyses
    11 / 36

    View Slide

  33. Problems Availability
    Availability
    Sharing is not yet considered as economic in linguistic research:
    journals still accept too many papers which do not share full material
    and code they used for their analyses
    if authors share, they make it often very difficult to use the data for
    testing by converting data which was in tabular form originally into
    PDF, which may at times even not even be searchable, etc.
    11 / 36

    View Slide

  34. Problems Availability
    Availability
    Sharing is not yet considered as economic in linguistic research:
    journals still accept too many papers which do not share full material
    and code they used for their analyses
    if authors share, they make it often very difficult to use the data for
    testing by converting data which was in tabular form originally into
    PDF, which may at times even not even be searchable, etc.
    even if authors share, they may often forget to share their explicit
    annotations and analyses, thus leaving the readers alone with the
    raw data
    11 / 36

    View Slide

  35. Problems Transparency
    Transparency
    Taken from a blog by Bengtson (2017) at http://euskararenjatorria.net/?p=26071
    12 / 36

    View Slide

  36. Problems Transparency
    Transparency
    Annotations and analyses often lack transparency and cannot be
    understood in the raw form in which they are provided:
    13 / 36

    View Slide

  37. Problems Transparency
    Transparency
    Annotations and analyses often lack transparency and cannot be
    understood in the raw form in which they are provided:
    words are declared as cognate, but nobody really knows WHERE
    the words should be cognate
    13 / 36

    View Slide

  38. Problems Transparency
    Transparency
    Annotations and analyses often lack transparency and cannot be
    understood in the raw form in which they are provided:
    words are declared as cognate, but nobody really knows WHERE
    the words should be cognate
    correspondences are labeled as “regular” without giving sufficient
    proof for this neither in the texts nor in the data
    13 / 36

    View Slide

  39. Problems Transparency
    Transparency
    Annotations and analyses often lack transparency and cannot be
    understood in the raw form in which they are provided:
    words are declared as cognate, but nobody really knows WHERE
    the words should be cognate
    correspondences are labeled as “regular” without giving sufficient
    proof for this neither in the texts nor in the data
    sources are omitted and readers are left alone with a bunch of word
    lists of which nobody knows where they were exactly taken from
    13 / 36

    View Slide

  40. Problems Comparability
    Comparability
    東童孺務口覯後松公蜂注足濡送重局功粟婁句主數束芻用豐葚譖洧侑訧鮪囿又尤友有右萏三男南琛諶煁甚枕風深任弓牛龜舊裘雄丘郿桋騤厎佽湄袺蓍鴟侐悸穗惟秭遟湝只祗茨履葵匕死視唯季矢偕祁比旨脂指耆毗階維妣屎師資咨兕
    Karlgren (1954)
    Li (1971)
    Wáng (1980)
    Zhèngzhāng (2003)
    Pān (2000)
    Starostin (1989)
    Baxter and Sagart (2014)
    Schuessler (2007)
    ? a e i o u æ ɑ ɔ ə ɯ ʊ
    Different reconstructions for Old Chinese, taken from List et al. (2017)
    14 / 36

    View Slide

  41. Problems Comparability
    Comparability
    It is extremely hard to compare different opinions in comparative
    linguistics:
    15 / 36

    View Slide

  42. Problems Comparability
    Comparability
    It is extremely hard to compare different opinions in comparative
    linguistics:
    data sources are insufficiently reported
    15 / 36

    View Slide

  43. Problems Comparability
    Comparability
    It is extremely hard to compare different opinions in comparative
    linguistics:
    data sources are insufficiently reported
    computational studies are based on different subsets of the data
    15 / 36

    View Slide

  44. Problems Comparability
    Comparability
    It is extremely hard to compare different opinions in comparative
    linguistics:
    data sources are insufficiently reported
    computational studies are based on different subsets of the data
    analyses exclude or focus on specific parts of a given dataset
    15 / 36

    View Slide

  45. Problems Summary
    Summary
    16 / 36

    View Slide

  46. Problems Summary
    Summary
    Although linguistics and especially comparative linguistics is
    an extremely data-driven discipline, we often still behave as
    if it was all about prose and discourse. In times when digital
    applications make the lives of scientists a lot easier, linguists
    need to find ways to transfer their analyses into the new age.
    This can only be done if we drastically increase the availabil-
    ity, transparency, and comparability of data in linguistics.
    16 / 36

    View Slide

  47. Cross-Linguistic Data Formats
    Cross-Linguistic Data Formats
    17 / 36

    View Slide

  48. Cross-Linguistic Data Formats General Ideas
    General Ideas
    The Cross-Linguistic Data Formats initiative (Forkel et al. 2016,
    http://cldf.clld.org) aims at increasing the comparability of
    cross-linguistic data and analyses. It comes along with:
    18 / 36

    View Slide

  49. Cross-Linguistic Data Formats General Ideas
    General Ideas
    The Cross-Linguistic Data Formats initiative (Forkel et al. 2016,
    http://cldf.clld.org) aims at increasing the comparability of
    cross-linguistic data and analyses. It comes along with:
    standardization efforts (linguistic meta-data-bases like Glottolog
    and Concepticon),
    18 / 36

    View Slide

  50. Cross-Linguistic Data Formats General Ideas
    General Ideas
    The Cross-Linguistic Data Formats initiative (Forkel et al. 2016,
    http://cldf.clld.org) aims at increasing the comparability of
    cross-linguistic data and analyses. It comes along with:
    standardization efforts (linguistic meta-data-bases like Glottolog
    and Concepticon),
    software APIs which help to test whether data conforms to
    standards, and
    18 / 36

    View Slide

  51. Cross-Linguistic Data Formats General Ideas
    General Ideas
    The Cross-Linguistic Data Formats initiative (Forkel et al. 2016,
    http://cldf.clld.org) aims at increasing the comparability of
    cross-linguistic data and analyses. It comes along with:
    standardization efforts (linguistic meta-data-bases like Glottolog
    and Concepticon),
    software APIs which help to test whether data conforms to
    standards, and
    working examples for best practice.
    18 / 36

    View Slide

  52. Cross-Linguistic Data Formats General Ideas
    General Ideas
    As of now, a couple of software tools (LingPy, Beastling, EDICTOR)
    support CLDF to some degree. In the future, we hope that the number of
    users will increase, and that the community helps to develop the formats
    further.
    19 / 36

    View Slide

  53. Cross-Linguistic Data Formats Technical Aspects
    Technical Aspects
    20 / 36

    View Slide

  54. Cross-Linguistic Data Formats Technical Aspects
    Technical Aspects
    See http://github.com/glottobank/cldf for details,
    discussions, and working examples.
    20 / 36

    View Slide

  55. Cross-Linguistic Data Formats Technical Aspects
    Technical Aspects
    See http://github.com/glottobank/cldf for details,
    discussions, and working examples.
    Format for machine-readable specification is CSV with metadata in
    JSON, following the W3C’s Model for Tabular Data and Metadata
    on the Web
    (http://www.w3.org/TR/tabular-data-model/).
    20 / 36

    View Slide

  56. Cross-Linguistic Data Formats Technical Aspects
    Technical Aspects
    See http://github.com/glottobank/cldf for details,
    discussions, and working examples.
    Format for machine-readable specification is CSV with metadata in
    JSON, following the W3C’s Model for Tabular Data and Metadata
    on the Web
    (http://www.w3.org/TR/tabular-data-model/).
    CLDF ontology builds and expands upon the General Ontology for
    Linguistic Description (GOLD).
    20 / 36

    View Slide

  57. Cross-Linguistic Data Formats Technical Aspects
    Technical Aspects
    See http://github.com/glottobank/cldf for details,
    discussions, and working examples.
    Format for machine-readable specification is CSV with metadata in
    JSON, following the W3C’s Model for Tabular Data and Metadata
    on the Web
    (http://www.w3.org/TR/tabular-data-model/).
    CLDF ontology builds and expands upon the General Ontology for
    Linguistic Description (GOLD).
    A pycldf API in Python is currently in preparation and will help users
    to evaluate whether their data conforms to CLDF.
    20 / 36

    View Slide

  58. Cross-Linguistic Data Formats Technical Aspects
    The Zen of CLDF (following Robert Forkel)
    A lot of the guidelines put forward for Python code in The Zen of Python
    (https://www.python.org/dev/peps/pep-0020/) can also be
    used to characterize CLDF. In particular
    Explicit is better than implicit.
    21 / 36

    View Slide

  59. Cross-Linguistic Data Formats Technical Aspects
    The Zen of CLDF (following Robert Forkel)
    A lot of the guidelines put forward for Python code in The Zen of Python
    (https://www.python.org/dev/peps/pep-0020/) can also be
    used to characterize CLDF. In particular
    Explicit is better than implicit.
    Readability counts.
    21 / 36

    View Slide

  60. Cross-Linguistic Data Formats Technical Aspects
    The Zen of CLDF (following Robert Forkel)
    A lot of the guidelines put forward for Python code in The Zen of Python
    (https://www.python.org/dev/peps/pep-0020/) can also be
    used to characterize CLDF. In particular
    Explicit is better than implicit.
    Readability counts.
    Errors should never pass silently (Unless explicitly silenced).
    21 / 36

    View Slide

  61. Cross-Linguistic Data Formats Technical Aspects
    The Zen of CLDF (following Robert Forkel)
    A lot of the guidelines put forward for Python code in The Zen of Python
    (https://www.python.org/dev/peps/pep-0020/) can also be
    used to characterize CLDF. In particular
    Explicit is better than implicit.
    Readability counts.
    Errors should never pass silently (Unless explicitly silenced).
    Simple is better than complex.
    21 / 36

    View Slide

  62. Cross-Linguistic Data Formats Technical Aspects
    The Zen of CLDF (following Robert Forkel)
    A lot of the guidelines put forward for Python code in The Zen of Python
    (https://www.python.org/dev/peps/pep-0020/) can also be
    used to characterize CLDF. In particular
    Explicit is better than implicit.
    Readability counts.
    Errors should never pass silently (Unless explicitly silenced).
    Simple is better than complex.
    In the face of ambiguity, refuse the temptation to guess.
    21 / 36

    View Slide

  63. Cross-Linguistic Data Formats Standards
    Standards
    "The language, the meaning, and
    the form of the linguistic sign are
    the triple pillars upon which CLDF
    rests. Together, they will restore
    comparability of linguistic data in
    the world."
    — Tommen Baratheon
    (King of the Iron Throne, † in 6x10)
    22 / 36

    View Slide

  64. Cross-Linguistic Data Formats Standards
    Standards
    use Glottolog (Hammarström et al. 2017) to refer to language
    varieties
    23 / 36

    View Slide

  65. Cross-Linguistic Data Formats Standards
    Standards
    use Glottolog (Hammarström et al. 2017) to refer to language
    varieties
    use Concepticon (List et al. 2016) to refer to concepts
    23 / 36

    View Slide

  66. Cross-Linguistic Data Formats Standards
    Standards
    use Glottolog (Hammarström et al. 2017) to refer to language
    varieties
    use Concepticon (List et al. 2016) to refer to concepts
    develop a cross-linguistic phonetic notation system to evaluate
    whether phonetic transcriptions for word forms conform to
    cross-linguistic standards
    23 / 36

    View Slide

  67. Cross-Linguistic Data Formats Standards
    Standards: Concepticon
    Concepticon (List et al. 2016)
    link concept labels in published concept lists
    (questionnaires) to concept sets
    link concept sets to meta-data
    define relations between concept sets
    never link one concept in a given list to more than one
    concept set (guarantees consistency)
    provide an API to check the consistency of the data and
    to query the data
    provide a web-interface to browse through the data
    24 / 36

    View Slide

  68. Cross-Linguistic Data Formats Standards
    http://concepticon.clld.org
    25 / 36

    View Slide

  69. Cross-Linguistic Data Formats Standards
    Standards: Cross-Linguistic Phonetic Notations
    There are too many IPAs!
    26 / 36

    View Slide

  70. Cross-Linguistic Data Formats Standards
    Standards: Cross-Linguistic Phonetic Notations
    There are too many IPAs!
    26 / 36

    View Slide

  71. Cross-Linguistic Data Formats Standards
    Standards: Cross-Linguistic Phonetic Notations
    Cross-Linguistic Phonetic Notations (in prep.)
    normalize ambiguities of IPA
    establish a pool of phonetic segments which are linked
    to other datasets (Phoible, Moran et al. 2014, PBase,
    Mielke 2008, etc.)
    provide or link to feature systems or additional kinds of
    metadata
    provide generic transformation tables for frequently
    used phonetic transcription systems
    provide an API to check the consistency of a given
    transcription
    27 / 36

    View Slide

  72. Cross-Linguistic Data Formats Standards
    GLD to CLPN
    https://github.com/LinguList/Hmong-Mien_Language_family/
    28 / 36

    View Slide

  73. Word Lists in CLDF
    WORDLIST
    DATA
    Word Lists in CLDF
    29 / 36

    View Slide

  74. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    30 / 36

    View Slide

  75. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    Language and Language_ID (Glottolog)
    30 / 36

    View Slide

  76. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    Language and Language_ID (Glottolog)
    Concept and Paramter_ID (Concepticon)
    30 / 36

    View Slide

  77. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    Language and Language_ID (Glottolog)
    Concept and Paramter_ID (Concepticon)
    Value, Form, and Segments (CLPN)
    30 / 36

    View Slide

  78. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    Language and Language_ID (Glottolog)
    Concept and Paramter_ID (Concepticon)
    Value, Form, and Segments (CLPN)
    Source (key of BibTex file)
    30 / 36

    View Slide

  79. Word Lists in CLDF Basic Aspects
    Basic Aspects
    ID (unique identifier for each word)
    Language and Language_ID (Glottolog)
    Concept and Paramter_ID (Concepticon)
    Value, Form, and Segments (CLPN)
    Source (key of BibTex file)
    Comment (free text)
    30 / 36

    View Slide

  80. Word Lists in CLDF Etymological Annotation
    Etymological Annotation
    Concept
    Cognate Set
    Alignment
    MEANING
    RELATION
    Morphemes
    Slice
    Form
    Value
    Segments
    FORM
    ID
    A B
    requires
    important
    language-internal
    language-external
    Language
    LANGUAGE
    31 / 36

    View Slide

  81. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    32 / 36

    View Slide

  82. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    32 / 36

    View Slide

  83. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    online available at http://edictor.digling.org
    32 / 36

    View Slide

  84. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    online available at http://edictor.digling.org
    description available in List (2017)
    32 / 36

    View Slide

  85. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    online available at http://edictor.digling.org
    description available in List (2017)
    works client-side and offline
    32 / 36

    View Slide

  86. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    online available at http://edictor.digling.org
    description available in List (2017)
    works client-side and offline
    allows for morpheme annotation, alignment analysis, cognate
    annotation, and phonological analysis
    32 / 36

    View Slide

  87. Word Lists in CLDF Examples
    Examples: The Etymological Dictionary Editor
    web-based tool for the annotation of etymological data
    online available at http://edictor.digling.org
    description available in List (2017)
    works client-side and offline
    allows for morpheme annotation, alignment analysis, cognate
    annotation, and phonological analysis
    serves as a litmus test for the expressive power of wordlists in CLDF
    32 / 36

    View Slide

  88. Word Lists in CLDF Examples
    http://edictor.digling.org
    33 / 36

    View Slide

  89. Outlook
    Outlook
    34 / 36

    View Slide

  90. 35 / 36

    View Slide

  91. 35 / 36

    View Slide

  92. 35 / 36

    View Slide

  93. Danke für Ihre Aufmerksamkeit!
    36 / 36

    View Slide