Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A cross-linguistic computational approach on chance resemblances

A cross-linguistic computational approach on chance resemblances

Chance resemblances can be an anathema in linguistics, given the difficulties in identifying them, and, occasionally, their usage as evidence against the comparative method. Conversely, they also tend to be among the objections towards computational approaches, as only expert knowledge and supplementary evidences would be able to confidently distinguish among vertical transmission, horizontal transmission, and chance resemblance. Limitations of this kind have been advanced since the beginnings of the comparative method, such in the frequently cited false correspondence between Latin *deus* and Ancient Greek *θεός* (both meaning “god”). The reduction, or rather expansion, *ad absurdum* of such difficulty is demonstrated both by the long tradition of folk etymologies, motivated by the assumption that surface similarities are too strong to be due to chance, and by the recurrent claims of amateur linguists on impossible relationships such as, for example, between Ainu and Etruscan.Among the sources for such difficulties is the fact that we have no clear definition on chance similarity, in general loosely defined as “words that sound similar”, particularly when there is a limited semantic leeway and preferably when judged as such with the support of the phonotactics of the languages involved. The uncertainties in terms of definition translate into limited sets of concrete examples, leading to the absence of baselines.In this talk, we investigate the question of how to create a baseline for expected probability of chance resemblance according to different typological parameters. As such, we will present the results of a cross-linguistic and computational inquiry on chance resemblances, following three different experiments. In the first, developing on Rosenfelder (2002), is a purely mathematical modeling that calculates the probability of random correspondences on a set of linguistic models of very simplified phonological and semantic assumptions. The second applies state-of-the-art algorithms for automatic cognate detection (List et al., 2018b) on languages randomly generated from phonological and semantic parameters collected from real languages of different typologies, in a massive comparison that allows to highlight which factors contribute the more to the perception of similarity; in fact, we will also explore the possibility of later re-using the dataset to collect resemblance judgements according to experts. The third and most important experiment uses actual linguistic data from a cross-linguistic database, Lexibank (forthcoming), by applying the same methods to languages pairs of varying phylogenetic relationships (Hammarström et al., 2018), which, combined with semantic information linked to Concepticon (List et al., 2018a) and CLICS (List et al., 2018c). Our results support the hypothesis that chance resemblances, even across unrelated languages, are common and indeed expected even with minor leeway, and that chance resemblances need to judged on a per-language pair and not per-potential cognate pair basis. This should allow to orient future experiment setup, besides offering a preliminary baseline on the expectancy of chance resemblance according to given sets of parameters, including language proximity in terms of lineage, possibly being incorporated into future work on automatic borrowing detection.

Tiago Tresoldi

August 22, 2019
Tweet

More Decks by Tiago Tresoldi

Other Decks in Research

Transcript

  1. A cross-linguistic computational approach on chance
    resemblances
    Tiago Tresoldi
    Max Planck Institute for the Science of Human History (MPI-SHH, Jena)
    Computer-Assisted Language Comparison (CALC) Project
    August 22nd, 2019
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 1 / 33

    View Slide

  2. Contents
    1 Introduction
    2 Mathematical Model
    3 Simulated data
    4 Cross-Linguistic Comparison
    5 Results
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 2 / 33

    View Slide

  3. Fantastic cognates and where to find them
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 3 / 33

    View Slide

  4. Background
    Mathematical and cross-linguistic attempts at pseudocognacy
    measurement take hold with the diffusion of the quantitative methods
    (Hymes 1960), particularly in the reception of proposals such as those
    of Greenberg (Amerind) and Bomhard (Nostratic)
    Far fetched cases, such as Quechua and Semitic (Aedo, 2001)
    Many methods, not always distinct from cognacy detection, have
    been implemented since then (Kay 1964, Guy 1994, Ringe 1992)
    While criticized by Baxter (1996), Ringe set a common approach that
    can be found in many subsequent methods, like Kessler (2007),
    Mortarino (2009), Turchin (2010), Kilani (2015) and others
    Innovations with new methods for cognate detection, as in List (2012,
    2014), J¨
    ager (2017), Rama (2018)
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 4 / 33

    View Slide

  5. What I did
    Experiment with
    mathematical model
    simulated data
    real cross-linguistic data
    Consider both semantic and phonological leeway
    Allow different components in the same workflow, such as different
    scorers, different alignment methods, etc.
    Use phonology and sound classes, not edit distance on orthography
    Not only distinguishing true from false cognates, but have a baseline
    for pseudocognate expectancy
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 5 / 33

    View Slide

  6. Mathematical model
    Most approaches evaluate on
    purely mathematical grounds
    first, including Ringe (1992),
    with abstract models on
    languages simulated both in
    phonology and semantics
    In practice, the solution tends to
    be based on binomial probability
    distributions with replacement,
    with some measure for accepting
    or rejecting the null hypothesis
    Figure: Distribution of matches (Kilani
    2015:Figure 1).
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 6 / 33

    View Slide

  7. Results
    Pattern Cons Vowels Lexemes p(0) p(1..5) p(x) ≥ 0.50
    CV 5 6 100 0.03 0.84 3
    CV 5 6 1000 ∼0.00 ∼0.00 33
    CV 7 6 100 0.08 0.87 3
    CV 20 14 1000 0.02 0.82 4
    CVC 5 6 100 0.57 0.42 -
    CVC 5 6 1000 ∼0.00 0.51 5
    CVC 20 14 100 0.97 0.02 -
    CVC 20 14 1000 0.77 0.22 -
    Table: Probabilities for no matches and one to five matches (=“perfect
    correspondence”) on random samples of different sizes (Lexemes) and inventory
    properties (Pattern, Consonants, Vowels). The last column reports the number of
    matches m at which the probability p(1..m) is above 50%, if any.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 7 / 33

    View Slide

  8. Quantifying resemblance
    Resemblance
    The resemblance between two sequences x and y, where 0.0 expresses
    identity, is defined as the phonological correspondence between the two
    sequences. It is potentially corrected by a mesure of semantic distance,
    which can by definition set the resemblance to infinite if such distance is
    above a a given threshold t.
    The basic source of phonological correspondence is an alignment
    score, according to either global or local scorers, modelling either
    historical or perceptual similarity.
    In this experiment, phonological correspondence is computed from
    sequence alignment using the SCA scorer (List 2012), and defined as
    1 − ALMxy
    max(ALMxx ,ALMyy )
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 8 / 33

    View Slide

  9. Phonological correspondence scores
    Seqx Seqy Score
    /papa/ /papa/ 0.00
    /papa/ /baba/ 0.00
    /papa/ /bubu/ 0.06
    /papa/ /bubuj/ 0.33
    /papa/ /epapa/ 0.27
    /papa/ /[email protected]˜
    en:o/ 0.89
    Seqx Seqy Score
    /mano/ /mano/ (IT) 0.00
    /mano/ /mano/ (ES) 0.00
    /mano/ /m˜

    w/ (PT) 0.55
    /mano/ /m˜
    E/ (FR) 0.57
    /mano/ /hænd/ (EN) 0.65
    /mano/ /Sou3/ (ZH) 0.88
    Table: Scores for different crafted examples (left table) and words for
    concept “HAND” in different languages. A score of 0.00 indicates identity,
    1.00 is the maximum theoretical distance.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 9 / 33

    View Slide

  10. Simulation
    Results of simulation with randomly generated data from attested
    distributions of patterns, lengths, and inventories, for different vocabulary
    sizes and cluster sizes. The tables report the scores at percentiles of 5%,
    25%, and 50%, as well as the three lowest scoring pairs in each simulation.
    Size Cluster 5% 25% 50% Best Pairs
    5 1 0.61 0.75 0.76 kodu / zUvitu
    olu / rete
    semozle / uson
    40 1 0.57 0.65 0.76 bIntu / pUpoti
    hondog / muvteka
    nufse / hufo
    40 5 0.36 0.50 0.60 defe / tIvi
    khabip / uebub
    begu / pemi
    40 10 0.37 0.47 0.54 fuveo / vefni
    phEbi / pemi
    deksab / togapak
    Size Cluster 5% 25% 50% Best Pairs
    100 10 0.20 0.42 0.55 phEbi / pObi
    [email protected] / zize
    olu / aìa
    200 10 0.20 0.43 0.53 veja / vea
    saoga / zuku
    pi:ba / bæpo
    1000 10 0.30 0.43 0.54 da: / taPa
    ao / auj
    ko: / goe
    1000 50 0.06 0.31 0.40 udu / udu
    e: / eu
    de: / teuj
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 10 / 33

    View Slide

  11. Empirical Experiment
    Analysis on real data: 1479 language, 193 different families
    (Glottolog), 416 different genera (own extension to WALS)
    587,198 forms, 2488 comparative concepts linked to Concepticon
    (min 25, max 1850)
    948 pairwise comparisons between randomly sampled language pairs,
    460 involving different families, 488 within the same family (of which
    446 in the same genus)
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 11 / 33

    View Slide

  12. Shared cognates (LexStat)
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 12 / 33

    View Slide

  13. Empirical results - I
    Resemblances between Erzya erzy1239 (Uralic, Russia) and Ilit
    kuna1268 (Kunama, Eritrea):
    FIRE /tol/ and /tom/
    HAND /kedj/ and /kon/
    THOU /ton/ and /en/
    TONGUE /kelj/ and /Nel/
    TOOTH /pej/ and /m2j/
    WE /mjiñ/ and /5me/
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 13 / 33

    View Slide

  14. Empirical results - II
    Resemblances between English stan1293 and Mandarin mand1415:
    BACK /bæk/ and /peI/
    BLUNT /d2l/ and /tun/
    COUGH /kOf/ and /kh7/
    FOAM /[email protected]/ and /phAUmwO/
    GIVE /gIv/ and /keI/
    GO /[email protected]/ and /[email protected]/
    LEAD (GUIDE) /li:d/ and /liNtAU/
    LONG /lON/ and /úùhAN/
    PERSON /mæn/ and /õ@n/
    SAY /seI/ and /ùuo/
    SEA /si:/ and /xaI/
    SING /sN/ and /úùhANk7/
    THEY /DEI/ and /tha/
    WHO /hu:/ and /ùui/
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 14 / 33

    View Slide

  15. Empirical results - III
    Resemblances between Dorze dorz1235 (Ta Ne Omotic, Ethiopia)
    and Seze seze1235 (Blue Nile Mao, Ethiopia):
    EYE /5fe/ and /5wi/
    FIRE /t5m5/ and /t5ma/
    HAND /kuSe/ and /kusa/
    HORN (ANATOMY) /k5tS’e/ and /k5li/
    NOSE /sidi/ and /sinta/
    ONE /isino/ and /iSila/
    SUN /5w5/ and /5wa/
    TOOTH /[email protected]/ and /5tsa/
    TWO /n5mij5/ and /nomba/
    WATER /h5ts’e/ and /h5Nsi/
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 15 / 33

    View Slide

  16. Semantic leeway - I
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 16 / 33

    View Slide

  17. Semantic leeway - II
    Two main alternatives
    Allow a form to freely move within its semantic cluster (no matter how
    defined)
    Link all concepts in a weighted graph, accepting all colexifications
    (even from single-case homonimies), smoothing, and computing the
    distance with a shortest path algorithm
    Only reporting the first alternative here
    Experiments with random clusters, testing for the hypothesis that, the
    larger the semantic leeway, the easier it is to find a resemblance
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 17 / 33

    View Slide

  18. Mean resemblances - No clusters
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 18 / 33

    View Slide

  19. Mean resemblances - CLICS2 clusters
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 19 / 33

    View Slide

  20. Mean resemblances - 10-concept clusters
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 20 / 33

    View Slide

  21. Mean resemblances - 20-concept clusters
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 21 / 33

    View Slide

  22. Mean resemblances - Single cluster
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 22 / 33

    View Slide

  23. Resemblance score percentiles
    As expected, the bigger the semantic leeway, the easier it is to find
    chance resemblances:
    1 10 20 Single-cluster
    5% 0.09 0.12 0.54 0.09 0.12 0.51 0.09 0.11 0.49 0.02 0.03 0.17
    25% 0.31 0.34 0.73 0.31 0.33 0.70 0.30 0.33 0.68 0.15 0.16 0.37
    50% 0.52 0.54 0.82 0.51 0.53 0.79 0.50 0.52 0.77 0.30 0.31 0.48
    75% 0.69 0.71 0.89 0.68 0.69 0.87 0.67 0.68 0.85 0.42 0.42 0.57
    95% 0.87 0.87 0.96 0.85 0.86 0.95 0.84 0.85 0.93 0.56 0.56 0.68
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 23 / 33

    View Slide

  24. Conclusion
    The larger analysis offered some interesting side-products:
    A matrix of phoneme correspondence in cognates (refined scoring
    matrix, local scoring matrices, asymmetric matrices)
    The weighted network for semantic exploration (to be released)
    We confirm that historical linguistics must be based on regular
    correspondences, not resemblance
    Which is more exciting than it sounds, because we can start measuring
    the probability of bad proposals (like the ”Semitic Quechua”)
    Exploration is also helping
    to generate my matrix of perceptual phoneme similarity (Tresoldi 2016,
    extending Mielke (2008)
    work on feature alignment (List 2019) and sub–optimal alignment
    (Tresoldi forth.)
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 24 / 33

    View Slide

  25. Perceptual similarity
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 25 / 33

    View Slide

  26. Synthesis
    Chance resemblances are not only common, but expected
    If our methods are not finding them, they might be too strict
    If done properly, widening the semantic space can actually improve
    finding cognates
    We still need to work on methods for that
    Semantic change does not follow ontology
    A future step could be investigating the role of phonosymbolism
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 26 / 33

    View Slide

  27. References I
    Baxter, W. H., and Manaster Ramer, A.
    Review: On calculating the factor of chance in language comparison.
    by donald a. ringe, jr. philadelphia: The american philosophical
    society, 1992. pp. 110.
    Diachronica 8, 2 (1996), 371–384.
    Baxter, W. H., and Manaster Ramer, A.
    Beyond lumping and splitting: Probabilistic issues in historical
    linguistics.
    McDonald Institute for Archaeological Research, Cambridge, 2000,
    pp. 167–188.
    Blevins, J.
    Evolutionary phonology: The emergence of sound patterns.
    Cambridge University Press, 2004.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 27 / 33

    View Slide

  28. References II
    Damerau, F.
    Mechanization of cognate recognition in comparative linguistics.
    Linguistics 13, 148 (1975), 5–30.
    Guy, J. B.
    An algorithm for identifying cognates in bilingual word-lists and its
    applicability to machine translation.
    Journal of Quantitative Linguistics 1, 1 (1994), 35–42.
    Kessler, B.
    Word similarity metrics and multilateral comparison.
    In Proceedings of Ninth Meeting of the ACL Special Interest Group in
    Computational Morphology and Phonology (2007), Association for
    Computational Linguistics, pp. 6–14.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 28 / 33

    View Slide

  29. References III
    Kilani, M.
    Calculating false cognates. an extension of the baxter manaster-ramer
    solution and its application to the case of pre-greek.
    Diachronica 32, 3 (2015), 331–364.
    Kondrak, G.
    Identification of cognates and recurrent sound correspondences in
    word lists.
    TAL 50, 2 (2009), 201–235.
    List, J.-M.
    Computer-Assisted Language Comparison: Reconciling computational
    and classical approaches in historical linguistics.
    Max Planck Institute for the Science of Human History, Jena, 2016.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 29 / 33

    View Slide

  30. References IV
    List, J.-M., Greenhill, S. J., and Gray, R. D.
    The potential of automatic word comparison for historical linguistics.
    PloS one 12, 1 (2017), e0170046.
    Mortarino, C.
    A statistical test useful in historical linguistics.
    In Proceedings of the XLII scientific meeting of the Italian statistical
    society. 2004, p. 107–110.
    Mortarino, C.
    An improved statistical test for historical linguistics.
    Statistical Methods and Applications 18, 2 (2009), 193–204.
    Nichols, J.
    The comparative method as heuristic.
    Oxford University Press, New York, 1996, pp. 39–71.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 30 / 33

    View Slide

  31. References V
    Nichols, J.
    Quasi-cognates and lexical type shifts: Rigorous distance measures for
    long-range comparison.
    In Phylogenetic methods and the prehistory of languages (Cambridge
    UK , Oxford UK , Oakville CT USA ,, 2006), P. Forster and
    C. Renfrew, Eds., McDonald Institute monographs, McDonald
    Institute for Archaeological Research; Distributed by Orbow Books,
    pp. 57–66.
    Rama, T., Borin, L., Mikros, G., and Macutek, J.
    Comparative evaluation of string similarity measures for automatic
    language classification., 2015.
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 31 / 33

    View Slide

  32. References VI
    Ringe, D. A.
    On calculating the factor of chance in language comparison.
    Transactions of the American Philosophical Society 82, 1 (1992),
    1–110.
    Tresoldi, T., Anderson, C., and List, J.-M.
    Modelling sound change with the help of multi-tiered sequence
    representations.
    In Pozna´
    n Linguistic Meeting 2018 (Pozna´
    n, 2018), ?, p. ?
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 32 / 33

    View Slide

  33. A cross-linguistic computational approach on chance
    resemblances
    Tiago Tresoldi
    Max Planck Institute for the Science of Human History (MPI-SHH, Jena)
    Computer-Assisted Language Comparison (CALC) Project
    August 22nd, 2019
    Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 33 / 33

    View Slide