Computer-Assisted Language Comparison. State of the art and future prospects

Computer-Assisted Language Comparison. State of the art and future prospects

Talk, held at the workshop "Historische und vergleichende Linguistik in Jena" (Max Planck Insititute for the Science of Human History, 2018/07/13).

8f49fcca6feb41b08b84a5b785bd2f4f?s=128

Schweikhard

July 13, 2018
Tweet

Transcript

  1. LC CA Computer-Assisted Language Comparison State of the art and

    future prospects T. Tresoldi N. E. Schweikhard M.-S. Wu Y.-F. Lai J.-M. List Max Planck Institute for the Science of Human History Department of Linguistic and Cultural Evolution CALC Project Jul 13, 2018
  2. Table of Contents 1 Language Comparison The Comparative Method Computational

    Approaches 2 State of the Art The Project Data Formats EDICTOR Alignments 3 Future prospects Germanic Word Formation SEAGaL Language and geography
  3. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century:
  4. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century: 1 conduct intensive language comparison
  5. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities
  6. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families
  7. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time
  8. The Comparative Method Techniques for language comparison, the “comparative method”,

    haven’t changed substantially since the 19th century: 1 conduct intensive language comparison 2 identify regular recurring similarities 3 reconstruct the development of languages and their families Issues: Usually done manually, by small groups, over long a time Crucial tasks, such as cognate identification, are partly based in non-formalized knowledge and intuition
  9. Computational Approaches Quantitative methods in use at least since d’Urville

    (1830s)
  10. Computational Approaches Quantitative methods in use at least since d’Urville

    (1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic
  11. Computational Approaches Quantitative methods in use at least since d’Urville

    (1830s) Computational methods: Swadesh, Greenberg, S. Starostin lexicostatistics and data normalization vs. glottochronology and mass comparison research on non-consensual supra-families, such as Nostratic Modern methods: Ringe, G. Starostin, “New Zealand School” Linguistics coming back from synchronicity Bayesian inference in phylogenetics: alleged uninterpretability, models from biology Competing research by non-linguists and non-academic setting (NLP)
  12. Meme time Title text: “We TOLD you it was hard.”

    “Yeah, but now that I’VE tried, we KNOW it’s hard.” (XKCD #1831)
  13. Traditional vs. Computational Language Comparison LC CA lacks efficiency consistency

    efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
  14. Traditional vs. Computational Language Comparison LC CA lacks efficiency consistency

    efficiency accuracy COMPA- RATIVE METHOD COMPUTA- TIONAL HISTORICAL LINGUISTICS flexibility
  15. The Current Scenario Historical language comparison and interdisciplinarity: language change

    not studied by and for itself window to human history and a bridge to other disciplines
  16. The Current Scenario Historical language comparison and interdisciplinarity: language change

    not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”)
  17. The Current Scenario Historical language comparison and interdisciplinarity: language change

    not studied by and for itself window to human history and a bridge to other disciplines Quantitative (and particularly Bayesian) turn: classical methods reached their limits in some cases open access, collaboration (no more “lone wolves”) New languages, new questions: large and diverse language families language families like Sino-Tibetan present “almost unsurmountable obstacles” (Antoine Meillet, 1925)
  18. Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic

    data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer
  19. Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic

    data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results
  20. Guidelines of our project ERC-funded project: Computer-Assisted Language Comparison Linguistic

    data Manual alignment Manual sound correspondence Manual cognate judgment ... CLDF EDICTOR LingPy Concepticon Cross-Linguistic Colexifications Software, data, and tools should complement the traditional approach: interdisciplinary approach: adapt rather than transfer allow experts to access and understand the results computational methods cannot replace experts (assist, not replace)
  21. Lingpy Programming library for historical linguistics, state of the art:

    multiple phonetic alignments: (List, 2014) 98% (pair scores) automatic cognate detection (List et al., 2017): 89% (B-Cubed scores) phylogenetic reconstruction (Rama et al., 2018): 0.08 (Gen. Quart. Dist.) correspondence pattern identification (List u. rev.): NP-hard (no human attempts)
  22. CLDF: Cross-Linguistic Data Formats Data must be machine- and human-readable,

    with unified formats for data storage and exchange. Data curation is facilitated by: Doculect Glottocode Concept Concepticon ID Form Tokens Source Anuta anut1237 EIGHT 1705 varu v a r u POLLEX East Futunan east2447 EIGHT 1705 valu v a l u POLLEX Hawaiian hawa1245 EIGHT 1705 walu w a l u ID: 71458 Kapingamarangi kapi1249 EIGHT 1705 walu w a l u POLLEX Mele Fila mele1250 EIGHT 1705 ebaru B a r u ID: 52375 Nukuria nuku1259 EIGHT 1705 varu v a r u Davletshin (2015) . . . . . . . . . . . . . . . . . . . . . Rapanui rapa1244 EIGHT 1705 va’u v a P u POLLEX Rennell Bellona renn1242 EIGHT 1705 bangu b a Ng u POLLEX spreadsheet formats Validation software Benchmark data Reference catalogs Online publications
  23. The Etymological DICTionary EditOR (EDICTOR) http://edictor.digling.org/

  24. Alignments Automatic alignment of Germanic cognates of “knee”

  25. Sound Correspondences

  26. Concepticon and Cross-Linguistic Colexifications http://clics.clld.org/

  27. Future prospects Our methods are already helping people by automating

    tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here?
  28. Future prospects Our methods are already helping people by automating

    tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train
  29. Future prospects Our methods are already helping people by automating

    tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks
  30. Future prospects Our methods are already helping people by automating

    tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identification partial colexifications can intuition be weighted? suprasegmental relationships and segments in their setting
  31. Future prospects Our methods are already helping people by automating

    tedious tasks of comparative linguistics (over 50 publications citing LingPy!). Where can we move from here? Educate and train Stochastic methods, decision trees, neural networks New questions to explore likeliness of random resemblance morphology in cognate identification partial colexifications can intuition be weighted? suprasegmental relationships and segments in their setting less-studied languages, including sign languages Litmus test: Sino-Tibetan languages
  32. Germanic Word Formation - I Research Questions What are the

    interrelations between form, meaning, and frequency? Are they system-dependent, culture-specific, or universal? How can computer-assisted methods help answering these questions?
  33. Germanic Word Formation - I Research Questions What are the

    interrelations between form, meaning, and frequency? Are they system-dependent, culture-specific, or universal? How can computer-assisted methods help answering these questions? Perspectives concept-based (onomasiological) vs. form-based (semasiological) cross-linguistic vs. language-specific quantitative vs. qualitative → computer-assisted
  34. Germanic Word Formation - II Morphological Complexity in Basic Vocabulary

    question: evolutionary dynamics of lexical change concept-based language(-family)-specific
  35. Germanic Word Formation - II Morphological Complexity in Basic Vocabulary

    question: evolutionary dynamics of lexical change concept-based language(-family)-specific Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-specific
  36. Germanic Word Formation - II Morphological Complexity in Basic Vocabulary

    question: evolutionary dynamics of lexical change concept-based language(-family)-specific Paradigmatic Alternations in Nominal Derivations question: causes of paradigmatic alternations form-based language(-family)-specific Productivity and Promiscuity in Compounding question: language-specific and universal aspects of compoundhood concept- and form-based cross-linguistic (worldwide) and language-specific
  37. Germanic Word Formation - III Concepticon concept-based form-based Morphological Complexity

    in Basic Vocabulary Productivity and Promiscuity of Compound Heads Paradigmatic Alternations in Nominal Derivations CLICS Edictor Partial Colexifications Automatic Morpheme Detection
  38. SEAGaL: South East Asia Gene and Language - I The

    degree of the correlation between language and genetic diversity? A group B group C group A language B language C language
  39. SEAGaL: South East Asia Gene and Language - I The

    degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY?
  40. SEAGaL: South East Asia Gene and Language - I The

    degree of the correlation between language and genetic diversity? A group B group C group A language B language C language WHY? Geography? Population size? Bilingualism?
  41. SEAGaL: South East Asia Gene and Language - II Sino-Tibetan

    language family. Genomic Bioinformatics tools Ethnic group Linguistic Lexibank, Concepticon, LingPy Refine, reassess and reconstruct
  42. Language and geography: the case of Rgyalrongic - I Rgyalrongic

    languages (Sino-Tibetan) exhibit a series of orientational prefixes Traditional approaches Fully related to actual topography According to river, mountain, sun, etc. Inconsistent among scholars: wild guesses Not covering all uses Are orientational prefixes related to actual geography?
  43. Rgyalrongic languages

  44. Language and geography: the case of Rgyalrongic - II Our

    current job is to test how closely orientational prefixes are related to real-world topography Selection of 15 familiar places (villages or towns) Collection of the prefixes used between every two places Draw a map based on the prefixes This map represents the collective memory of the speakers Compare the inferred map with the actual map We may get to understand The original meaning of the orientational prefixes Evolutionary pathways of the orientational prefixes How Rgyalrongic (even Sino-Tibetan) ancestors understood and interpreted geography
  45. Language and geography: the case of Rgyalrongic - III Figure:

    The actual map Figure: The inferred map
  46. Thank you... and let’s collaborate! DATA LC CA TOOLS SOFTWARE

    INTERFACES http://calc.digling.org/