Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

Talk, held at the Linguistic Annotation and Philology workshop (2017/07/06-07, Leipzig University)

E01961dd2fbd219a30044ffe27c9fb70?s=128

Johann-Mattis List

July 07, 2017
Tweet

Transcript

  1. Annotation and Analysis of Cross-Linguistic Lexical Data in Historical Linguistics

    Toward the Establishment of Standards and Best Practices Johann-Mattis List Research Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2017/07/07 very long title P(A|B)=P(B|A)... 1 / 36
  2. Background Background data 2 / 36

  3. Background Data Data in Linguistics Linguistics is a data-driven discipline:

    3 / 36
  4. Background Data Data in Linguistics Linguistics is a data-driven discipline:

    grammars 3 / 36
  5. Background Data Data in Linguistics Linguistics is a data-driven discipline:

    grammars dictionaries 3 / 36
  6. Background Data Data in Linguistics Linguistics is a data-driven discipline:

    grammars dictionaries texts 3 / 36
  7. Background Data Data in Linguistics Linguistics is a data-driven discipline:

    grammars dictionaries texts etc. 3 / 36
  8. Background Data Data in Comparative Linguistics Comparative linguistics is also

    a data-driven discipline: 4 / 36
  9. Background Data Data in Comparative Linguistics Comparative linguistics is also

    a data-driven discipline: historical grammars 4 / 36
  10. Background Data Data in Comparative Linguistics Comparative linguistics is also

    a data-driven discipline: historical grammars etymological dictionaries 4 / 36
  11. Background Data Data in Comparative Linguistics Comparative linguistics is also

    a data-driven discipline: historical grammars etymological dictionaries annotated texts 4 / 36
  12. Background Data Data in Comparative Linguistics Comparative linguistics is also

    a data-driven discipline: historical grammars etymological dictionaries annotated texts etc. 4 / 36
  13. Background Annotation Annotation in Linguistics Annotation of data is crucial

    in linguistics: 5 / 36
  14. Background Annotation Annotation in Linguistics Annotation of data is crucial

    in linguistics: inter-linear glossed text 5 / 36
  15. Background Annotation Annotation in Linguistics Annotation of data is crucial

    in linguistics: inter-linear glossed text “glossing” a words meaning in the literature 5 / 36
  16. Background Annotation Annotation in Linguistics Annotation of data is crucial

    in linguistics: inter-linear glossed text “glossing” a words meaning in the literature etc. 5 / 36
  17. Background Annotation Annotation in Comparative Linguistics Annotation of data is

    also crucial in comparative linguistics: 6 / 36
  18. Background Annotation Annotation in Comparative Linguistics Annotation of data is

    also crucial in comparative linguistics: indicating that words are cognate 6 / 36
  19. Background Annotation Annotation in Comparative Linguistics Annotation of data is

    also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound 6 / 36
  20. Background Annotation Annotation in Comparative Linguistics Annotation of data is

    also crucial in comparative linguistics: indicating that words are cognate using phonetic transcriptions to indicate how words sound etc. 6 / 36
  21. Background Analysis Analysis in Linguistics Analysis of data is crucial

    in linguistics: 7 / 36
  22. Background Analysis Analysis in Linguistics Analysis of data is crucial

    in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) 7 / 36
  23. Background Analysis Analysis in Linguistics Analysis of data is crucial

    in linguistics: syntactic analysis in any syntactic framework (ideally with less than three examples and many asterisks) etc. 7 / 36
  24. Background Analysis Analysis in Comparative Linguistics Analysis of data is

    also crucial in comparative linguistics: 8 / 36
  25. Background Analysis Analysis in Comparative Linguistics Analysis of data is

    also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) 8 / 36
  26. Background Analysis Analysis in Comparative Linguistics Analysis of data is

    also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) 8 / 36
  27. Background Analysis Analysis in Comparative Linguistics Analysis of data is

    also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) 8 / 36
  28. Background Analysis Analysis in Comparative Linguistics Analysis of data is

    also crucial in comparative linguistics: alignment analysis (show where words are cognate, i.e., in which segments they coincide) correspondence pattern analysis (show the most frequently corresponding sounds for a number of related languages) etymological analysis (show how the original meaning of a word evolved, etc.) etc. 8 / 36
  29. Problems Problems ! 9 / 36

  30. Problems Availability Availability Linguists all over the world publish papers

    without sharing their data and analyses. SO SAD for science! #MakeLinguisticsGreatAgain #FakePubs 10 / 36
  31. Problems Availability Availability Sharing is not yet considered as economic

    in linguistic research: 11 / 36
  32. Problems Availability Availability Sharing is not yet considered as economic

    in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses 11 / 36
  33. Problems Availability Availability Sharing is not yet considered as economic

    in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very difficult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. 11 / 36
  34. Problems Availability Availability Sharing is not yet considered as economic

    in linguistic research: journals still accept too many papers which do not share full material and code they used for their analyses if authors share, they make it often very difficult to use the data for testing by converting data which was in tabular form originally into PDF, which may at times even not even be searchable, etc. even if authors share, they may often forget to share their explicit annotations and analyses, thus leaving the readers alone with the raw data 11 / 36
  35. Problems Transparency Transparency Taken from a blog by Bengtson (2017)

    at http://euskararenjatorria.net/?p=26071 12 / 36
  36. Problems Transparency Transparency Annotations and analyses often lack transparency and

    cannot be understood in the raw form in which they are provided: 13 / 36
  37. Problems Transparency Transparency Annotations and analyses often lack transparency and

    cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate 13 / 36
  38. Problems Transparency Transparency Annotations and analyses often lack transparency and

    cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving sufficient proof for this neither in the texts nor in the data 13 / 36
  39. Problems Transparency Transparency Annotations and analyses often lack transparency and

    cannot be understood in the raw form in which they are provided: words are declared as cognate, but nobody really knows WHERE the words should be cognate correspondences are labeled as “regular” without giving sufficient proof for this neither in the texts nor in the data sources are omitted and readers are left alone with a bunch of word lists of which nobody knows where they were exactly taken from 13 / 36
  40. Problems Comparability Comparability 東童孺務口覯後松公蜂注足濡送重局功粟婁句主數束芻用豐葚譖洧侑訧鮪囿又尤友有右萏三男南琛諶煁甚枕風深任弓牛龜舊裘雄丘郿桋騤厎佽湄袺蓍鴟侐悸穗惟秭遟湝只祗茨履葵匕死視唯季矢偕祁比旨脂指耆毗階維妣屎師資咨兕 Karlgren (1954) Li (1971) Wáng (1980)

    Zhèngzhāng (2003) Pān (2000) Starostin (1989) Baxter and Sagart (2014) Schuessler (2007) ? a e i o u æ ɑ ɔ ə ɯ ʊ Different reconstructions for Old Chinese, taken from List et al. (2017) 14 / 36
  41. Problems Comparability Comparability It is extremely hard to compare different

    opinions in comparative linguistics: 15 / 36
  42. Problems Comparability Comparability It is extremely hard to compare different

    opinions in comparative linguistics: data sources are insufficiently reported 15 / 36
  43. Problems Comparability Comparability It is extremely hard to compare different

    opinions in comparative linguistics: data sources are insufficiently reported computational studies are based on different subsets of the data 15 / 36
  44. Problems Comparability Comparability It is extremely hard to compare different

    opinions in comparative linguistics: data sources are insufficiently reported computational studies are based on different subsets of the data analyses exclude or focus on specific parts of a given dataset 15 / 36
  45. Problems Summary Summary 16 / 36

  46. Problems Summary Summary Although linguistics and especially comparative linguistics is

    an extremely data-driven discipline, we often still behave as if it was all about prose and discourse. In times when digital applications make the lives of scientists a lot easier, linguists need to find ways to transfer their analyses into the new age. This can only be done if we drastically increase the availabil- ity, transparency, and comparability of data in linguistics. 16 / 36
  47. Cross-Linguistic Data Formats Cross-Linguistic Data Formats 17 / 36

  48. Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data

    Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: 18 / 36
  49. Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data

    Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), 18 / 36
  50. Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data

    Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and 18 / 36
  51. Cross-Linguistic Data Formats General Ideas General Ideas The Cross-Linguistic Data

    Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic data and analyses. It comes along with: standardization efforts (linguistic meta-data-bases like Glottolog and Concepticon), software APIs which help to test whether data conforms to standards, and working examples for best practice. 18 / 36
  52. Cross-Linguistic Data Formats General Ideas General Ideas As of now,

    a couple of software tools (LingPy, Beastling, EDICTOR) support CLDF to some degree. In the future, we hope that the number of users will increase, and that the community helps to develop the formats further. 19 / 36
  53. Cross-Linguistic Data Formats Technical Aspects Technical Aspects 20 / 36

  54. Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for

    details, discussions, and working examples. 20 / 36
  55. Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for

    details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). 20 / 36
  56. Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for

    details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). 20 / 36
  57. Cross-Linguistic Data Formats Technical Aspects Technical Aspects See http://github.com/glottobank/cldf for

    details, discussions, and working examples. Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/). CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD). A pycldf API in Python is currently in preparation and will help users to evaluate whether their data conforms to CLDF. 20 / 36
  58. Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following

    Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. 21 / 36
  59. Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following

    Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. 21 / 36
  60. Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following

    Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). 21 / 36
  61. Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following

    Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. 21 / 36
  62. Cross-Linguistic Data Formats Technical Aspects The Zen of CLDF (following

    Robert Forkel) A lot of the guidelines put forward for Python code in The Zen of Python (https://www.python.org/dev/peps/pep-0020/) can also be used to characterize CLDF. In particular Explicit is better than implicit. Readability counts. Errors should never pass silently (Unless explicitly silenced). Simple is better than complex. In the face of ambiguity, refuse the temptation to guess. 21 / 36
  63. Cross-Linguistic Data Formats Standards Standards "The language, the meaning, and

    the form of the linguistic sign are the triple pillars upon which CLDF rests. Together, they will restore comparability of linguistic data in the world." — Tommen Baratheon (King of the Iron Throne, † in 6x10) 22 / 36
  64. Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al.

    2017) to refer to language varieties 23 / 36
  65. Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al.

    2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts 23 / 36
  66. Cross-Linguistic Data Formats Standards Standards use Glottolog (Hammarström et al.

    2017) to refer to language varieties use Concepticon (List et al. 2016) to refer to concepts develop a cross-linguistic phonetic notation system to evaluate whether phonetic transcriptions for word forms conform to cross-linguistic standards 23 / 36
  67. Cross-Linguistic Data Formats Standards Standards: Concepticon Concepticon (List et al.

    2016) link concept labels in published concept lists (questionnaires) to concept sets link concept sets to meta-data define relations between concept sets never link one concept in a given list to more than one concept set (guarantees consistency) provide an API to check the consistency of the data and to query the data provide a web-interface to browse through the data 24 / 36
  68. Cross-Linguistic Data Formats Standards http://concepticon.clld.org 25 / 36

  69. Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations There are

    too many IPAs! 26 / 36
  70. Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations There are

    too many IPAs! 26 / 36
  71. Cross-Linguistic Data Formats Standards Standards: Cross-Linguistic Phonetic Notations Cross-Linguistic Phonetic

    Notations (in prep.) normalize ambiguities of IPA establish a pool of phonetic segments which are linked to other datasets (Phoible, Moran et al. 2014, PBase, Mielke 2008, etc.) provide or link to feature systems or additional kinds of metadata provide generic transformation tables for frequently used phonetic transcription systems provide an API to check the consistency of a given transcription 27 / 36
  72. Cross-Linguistic Data Formats Standards GLD to CLPN https://github.com/LinguList/Hmong-Mien_Language_family/ 28 /

    36
  73. Word Lists in CLDF WORDLIST DATA Word Lists in CLDF

    29 / 36
  74. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) 30 / 36
  75. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) Language and Language_ID (Glottolog) 30 / 36
  76. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) 30 / 36
  77. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) 30 / 36
  78. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex file) 30 / 36
  79. Word Lists in CLDF Basic Aspects Basic Aspects ID (unique

    identifier for each word) Language and Language_ID (Glottolog) Concept and Paramter_ID (Concepticon) Value, Form, and Segments (CLPN) Source (key of BibTex file) Comment (free text) 30 / 36
  80. Word Lists in CLDF Etymological Annotation Etymological Annotation Concept Cognate

    Set Alignment MEANING RELATION Morphemes Slice Form Value Segments FORM ID A B requires important language-internal language-external Language LANGUAGE 31 / 36
  81. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    32 / 36
  82. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data 32 / 36
  83. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data online available at http://edictor.digling.org 32 / 36
  84. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) 32 / 36
  85. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline 32 / 36
  86. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis 32 / 36
  87. Word Lists in CLDF Examples Examples: The Etymological Dictionary Editor

    web-based tool for the annotation of etymological data online available at http://edictor.digling.org description available in List (2017) works client-side and offline allows for morpheme annotation, alignment analysis, cognate annotation, and phonological analysis serves as a litmus test for the expressive power of wordlists in CLDF 32 / 36
  88. Word Lists in CLDF Examples http://edictor.digling.org 33 / 36

  89. Outlook Outlook 34 / 36

  90. 35 / 36

  91. 35 / 36

  92. 35 / 36

  93. Danke für Ihre Aufmerksamkeit! 36 / 36