$30 off During Our Annual Pro Sale. View Details »

Computer-Assisted Language Comparison. State of the art and future prospects

Computer-Assisted Language Comparison. State of the art and future prospects

Talk, held at the workshop "Historische und vergleichende Linguistik in Jena" (Max Planck Insititute for the Science of Human History, 2018/07/13).

Schweikhard

July 13, 2018
Tweet

More Decks by Schweikhard

Other Decks in Science

Transcript

  1. LC
    CA
    Computer-Assisted Language Comparison
    State of the art and future prospects
    T. Tresoldi N. E. Schweikhard M.-S. Wu Y.-F. Lai
    J.-M. List
    Max Planck Institute for the Science of Human History
    Department of Linguistic and Cultural Evolution
    CALC Project
    Jul 13, 2018

    View Slide

  2. Table of Contents
    1 Language Comparison
    The Comparative Method
    Computational Approaches
    2 State of the Art
    The Project
    Data Formats
    EDICTOR
    Alignments
    3 Future prospects
    Germanic Word Formation
    SEAGaL
    Language and geography

    View Slide

  3. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:

    View Slide

  4. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:
    1 conduct intensive language comparison

    View Slide

  5. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:
    1 conduct intensive language comparison
    2 identify regular recurring similarities

    View Slide

  6. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:
    1 conduct intensive language comparison
    2 identify regular recurring similarities
    3 reconstruct the development of languages and their families

    View Slide

  7. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:
    1 conduct intensive language comparison
    2 identify regular recurring similarities
    3 reconstruct the development of languages and their families
    Issues:
    Usually done manually, by small groups, over long a time

    View Slide

  8. The Comparative Method
    Techniques for language comparison, the “comparative method”,
    haven’t changed substantially since the 19th century:
    1 conduct intensive language comparison
    2 identify regular recurring similarities
    3 reconstruct the development of languages and their families
    Issues:
    Usually done manually, by small groups, over long a time
    Crucial tasks, such as cognate identification, are partly based
    in non-formalized knowledge and intuition

    View Slide

  9. Computational Approaches
    Quantitative methods in use at least since d’Urville (1830s)

    View Slide

  10. Computational Approaches
    Quantitative methods in use at least since d’Urville (1830s)
    Computational methods: Swadesh, Greenberg, S. Starostin
    lexicostatistics and data normalization vs. glottochronology
    and mass comparison
    research on non-consensual supra-families, such as Nostratic

    View Slide

  11. Computational Approaches
    Quantitative methods in use at least since d’Urville (1830s)
    Computational methods: Swadesh, Greenberg, S. Starostin
    lexicostatistics and data normalization vs. glottochronology
    and mass comparison
    research on non-consensual supra-families, such as Nostratic
    Modern methods: Ringe, G. Starostin, “New Zealand School”
    Linguistics coming back from synchronicity
    Bayesian inference in phylogenetics: alleged uninterpretability,
    models from biology
    Competing research by non-linguists and non-academic setting
    (NLP)

    View Slide

  12. Meme time
    Title text: “We TOLD you it was hard.” “Yeah, but now that I’VE
    tried, we KNOW it’s hard.” (XKCD #1831)

    View Slide

  13. Traditional vs. Computational Language Comparison
    LC
    CA
    lacks
    efficiency
    consistency
    efficiency
    accuracy
    COMPA-
    RATIVE
    METHOD
    COMPUTA-
    TIONAL
    HISTORICAL
    LINGUISTICS
    flexibility

    View Slide

  14. Traditional vs. Computational Language Comparison
    LC
    CA
    lacks
    efficiency
    consistency
    efficiency
    accuracy
    COMPA-
    RATIVE
    METHOD
    COMPUTA-
    TIONAL
    HISTORICAL
    LINGUISTICS
    flexibility

    View Slide

  15. The Current Scenario
    Historical language comparison and interdisciplinarity:
    language change not studied by and for itself
    window to human history and a bridge to other disciplines

    View Slide

  16. The Current Scenario
    Historical language comparison and interdisciplinarity:
    language change not studied by and for itself
    window to human history and a bridge to other disciplines
    Quantitative (and particularly Bayesian) turn:
    classical methods reached their limits in some cases
    open access, collaboration (no more “lone wolves”)

    View Slide

  17. The Current Scenario
    Historical language comparison and interdisciplinarity:
    language change not studied by and for itself
    window to human history and a bridge to other disciplines
    Quantitative (and particularly Bayesian) turn:
    classical methods reached their limits in some cases
    open access, collaboration (no more “lone wolves”)
    New languages, new questions:
    large and diverse language families
    language families like Sino-Tibetan present “almost
    unsurmountable obstacles” (Antoine Meillet, 1925)

    View Slide

  18. Guidelines of our project
    ERC-funded project: Computer-Assisted Language Comparison
    Linguistic data
    Manual alignment
    Manual sound correspondence
    Manual cognate judgment
    ...
    CLDF
    EDICTOR
    LingPy
    Concepticon
    Cross-Linguistic Colexifications
    Software, data, and tools should complement the traditional
    approach:
    interdisciplinary approach: adapt rather than transfer

    View Slide

  19. Guidelines of our project
    ERC-funded project: Computer-Assisted Language Comparison
    Linguistic data
    Manual alignment
    Manual sound correspondence
    Manual cognate judgment
    ...
    CLDF
    EDICTOR
    LingPy
    Concepticon
    Cross-Linguistic Colexifications
    Software, data, and tools should complement the traditional
    approach:
    interdisciplinary approach: adapt rather than transfer
    allow experts to access and understand the results

    View Slide

  20. Guidelines of our project
    ERC-funded project: Computer-Assisted Language Comparison
    Linguistic data
    Manual alignment
    Manual sound correspondence
    Manual cognate judgment
    ...
    CLDF
    EDICTOR
    LingPy
    Concepticon
    Cross-Linguistic Colexifications
    Software, data, and tools should complement the traditional
    approach:
    interdisciplinary approach: adapt rather than transfer
    allow experts to access and understand the results
    computational methods cannot replace experts (assist, not
    replace)

    View Slide

  21. Lingpy
    Programming library for historical linguistics, state of the art:
    multiple phonetic alignments: (List, 2014) 98% (pair scores)
    automatic cognate detection (List et al., 2017): 89%
    (B-Cubed scores)
    phylogenetic reconstruction (Rama et al., 2018): 0.08 (Gen.
    Quart. Dist.)
    correspondence pattern identification (List u. rev.): NP-hard
    (no human attempts)

    View Slide

  22. CLDF: Cross-Linguistic Data Formats
    Data must be machine- and human-readable, with unified formats
    for data storage and exchange. Data curation is facilitated by:
    Doculect Glottocode Concept Concepticon ID Form Tokens Source
    Anuta anut1237 EIGHT 1705 varu v a r u POLLEX
    East Futunan east2447 EIGHT 1705 valu v a l u POLLEX
    Hawaiian hawa1245 EIGHT 1705 walu w a l u ID: 71458
    Kapingamarangi kapi1249 EIGHT 1705 walu w a l u POLLEX
    Mele Fila mele1250 EIGHT 1705 ebaru B a r u ID: 52375
    Nukuria nuku1259 EIGHT 1705 varu v a r u Davletshin (2015)
    . . . . . . . . . . . . . . . . . . . . .
    Rapanui rapa1244 EIGHT 1705 va’u v a P u POLLEX
    Rennell Bellona renn1242 EIGHT 1705 bangu b a Ng u POLLEX
    spreadsheet formats Validation software Benchmark data
    Reference catalogs Online publications

    View Slide

  23. The Etymological DICTionary EditOR (EDICTOR)
    http://edictor.digling.org/

    View Slide

  24. Alignments
    Automatic alignment of Germanic cognates of “knee”

    View Slide

  25. Sound Correspondences

    View Slide

  26. Concepticon and Cross-Linguistic Colexifications
    http://clics.clld.org/

    View Slide

  27. Future prospects
    Our methods are already helping people by automating tedious
    tasks of comparative linguistics (over 50 publications citing
    LingPy!). Where can we move from here?

    View Slide

  28. Future prospects
    Our methods are already helping people by automating tedious
    tasks of comparative linguistics (over 50 publications citing
    LingPy!). Where can we move from here?
    Educate and train

    View Slide

  29. Future prospects
    Our methods are already helping people by automating tedious
    tasks of comparative linguistics (over 50 publications citing
    LingPy!). Where can we move from here?
    Educate and train
    Stochastic methods, decision trees, neural networks

    View Slide

  30. Future prospects
    Our methods are already helping people by automating tedious
    tasks of comparative linguistics (over 50 publications citing
    LingPy!). Where can we move from here?
    Educate and train
    Stochastic methods, decision trees, neural networks
    New questions to explore
    likeliness of random resemblance
    morphology in cognate identification
    partial colexifications
    can intuition be weighted?
    suprasegmental relationships and segments in their setting

    View Slide

  31. Future prospects
    Our methods are already helping people by automating tedious
    tasks of comparative linguistics (over 50 publications citing
    LingPy!). Where can we move from here?
    Educate and train
    Stochastic methods, decision trees, neural networks
    New questions to explore
    likeliness of random resemblance
    morphology in cognate identification
    partial colexifications
    can intuition be weighted?
    suprasegmental relationships and segments in their setting
    less-studied languages, including sign languages
    Litmus test: Sino-Tibetan languages

    View Slide

  32. Germanic Word Formation - I
    Research Questions
    What are the interrelations between form, meaning, and
    frequency?
    Are they system-dependent, culture-specific, or universal?
    How can computer-assisted methods help answering these
    questions?

    View Slide

  33. Germanic Word Formation - I
    Research Questions
    What are the interrelations between form, meaning, and
    frequency?
    Are they system-dependent, culture-specific, or universal?
    How can computer-assisted methods help answering these
    questions?
    Perspectives
    concept-based (onomasiological) vs. form-based
    (semasiological)
    cross-linguistic vs. language-specific
    quantitative vs. qualitative → computer-assisted

    View Slide

  34. Germanic Word Formation - II
    Morphological Complexity in Basic Vocabulary
    question: evolutionary dynamics of lexical change
    concept-based
    language(-family)-specific

    View Slide

  35. Germanic Word Formation - II
    Morphological Complexity in Basic Vocabulary
    question: evolutionary dynamics of lexical change
    concept-based
    language(-family)-specific
    Paradigmatic Alternations in Nominal Derivations
    question: causes of paradigmatic alternations
    form-based
    language(-family)-specific

    View Slide

  36. Germanic Word Formation - II
    Morphological Complexity in Basic Vocabulary
    question: evolutionary dynamics of lexical change
    concept-based
    language(-family)-specific
    Paradigmatic Alternations in Nominal Derivations
    question: causes of paradigmatic alternations
    form-based
    language(-family)-specific
    Productivity and Promiscuity in Compounding
    question: language-specific and universal aspects of
    compoundhood
    concept- and form-based
    cross-linguistic (worldwide) and language-specific

    View Slide

  37. Germanic Word Formation - III
    Concepticon
    concept-based
    form-based
    Morphological Complexity
    in Basic Vocabulary
    Productivity and Promiscuity
    of Compound Heads
    Paradigmatic Alternations
    in Nominal Derivations
    CLICS
    Edictor
    Partial Colexifications
    Automatic Morpheme
    Detection

    View Slide

  38. SEAGaL: South East Asia Gene and Language - I
    The degree of the correlation between language and genetic
    diversity?
    A group B group C group A language B language C language

    View Slide

  39. SEAGaL: South East Asia Gene and Language - I
    The degree of the correlation between language and genetic
    diversity?
    A group B group C group A language B language C language
    WHY?

    View Slide

  40. SEAGaL: South East Asia Gene and Language - I
    The degree of the correlation between language and genetic
    diversity?
    A group B group C group A language B language C language
    WHY?
    Geography? Population size? Bilingualism?

    View Slide

  41. SEAGaL: South East Asia Gene and Language - II
    Sino-Tibetan language family.
    Genomic
    Bioinformatics tools
    Ethnic group Linguistic
    Lexibank, Concepticon, LingPy
    Refine, reassess and reconstruct

    View Slide

  42. Language and geography: the case of Rgyalrongic - I
    Rgyalrongic languages (Sino-Tibetan) exhibit a series of
    orientational prefixes
    Traditional approaches
    Fully related to actual topography
    According to river, mountain, sun, etc.
    Inconsistent among scholars: wild guesses
    Not covering all uses
    Are orientational prefixes related to actual geography?

    View Slide

  43. Rgyalrongic languages

    View Slide

  44. Language and geography: the case of Rgyalrongic - II
    Our current job is to test how closely orientational prefixes are
    related to real-world topography
    Selection of 15 familiar places (villages or towns)
    Collection of the prefixes used between every two places
    Draw a map based on the prefixes
    This map represents the collective memory of the speakers
    Compare the inferred map with the actual map
    We may get to understand
    The original meaning of the orientational prefixes
    Evolutionary pathways of the orientational prefixes
    How Rgyalrongic (even Sino-Tibetan) ancestors understood
    and interpreted geography

    View Slide

  45. Language and geography: the case of Rgyalrongic - III
    Figure: The actual map
    Figure: The inferred map

    View Slide

  46. Thank you... and let’s collaborate!
    DATA
    LC
    CA
    TOOLS SOFTWARE
    INTERFACES
    http://calc.digling.org/

    View Slide