Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A cross-linguistic computational approach on chance resemblances

A cross-linguistic computational approach on chance resemblances

Chance resemblances can be an anathema in linguistics, given the difficulties in identifying them, and, occasionally, their usage as evidence against the comparative method. Conversely, they also tend to be among the objections towards computational approaches, as only expert knowledge and supplementary evidences would be able to confidently distinguish among vertical transmission, horizontal transmission, and chance resemblance. Limitations of this kind have been advanced since the beginnings of the comparative method, such in the frequently cited false correspondence between Latin *deus* and Ancient Greek *θεός* (both meaning “god”). The reduction, or rather expansion, *ad absurdum* of such difficulty is demonstrated both by the long tradition of folk etymologies, motivated by the assumption that surface similarities are too strong to be due to chance, and by the recurrent claims of amateur linguists on impossible relationships such as, for example, between Ainu and Etruscan.Among the sources for such difficulties is the fact that we have no clear definition on chance similarity, in general loosely defined as “words that sound similar”, particularly when there is a limited semantic leeway and preferably when judged as such with the support of the phonotactics of the languages involved. The uncertainties in terms of definition translate into limited sets of concrete examples, leading to the absence of baselines.In this talk, we investigate the question of how to create a baseline for expected probability of chance resemblance according to different typological parameters. As such, we will present the results of a cross-linguistic and computational inquiry on chance resemblances, following three different experiments. In the first, developing on Rosenfelder (2002), is a purely mathematical modeling that calculates the probability of random correspondences on a set of linguistic models of very simplified phonological and semantic assumptions. The second applies state-of-the-art algorithms for automatic cognate detection (List et al., 2018b) on languages randomly generated from phonological and semantic parameters collected from real languages of different typologies, in a massive comparison that allows to highlight which factors contribute the more to the perception of similarity; in fact, we will also explore the possibility of later re-using the dataset to collect resemblance judgements according to experts. The third and most important experiment uses actual linguistic data from a cross-linguistic database, Lexibank (forthcoming), by applying the same methods to languages pairs of varying phylogenetic relationships (Hammarström et al., 2018), which, combined with semantic information linked to Concepticon (List et al., 2018a) and CLICS (List et al., 2018c). Our results support the hypothesis that chance resemblances, even across unrelated languages, are common and indeed expected even with minor leeway, and that chance resemblances need to judged on a per-language pair and not per-potential cognate pair basis. This should allow to orient future experiment setup, besides offering a preliminary baseline on the expectancy of chance resemblance according to given sets of parameters, including language proximity in terms of lineage, possibly being incorporated into future work on automatic borrowing detection.

Tiago Tresoldi

August 22, 2019
Tweet

More Decks by Tiago Tresoldi

Other Decks in Research

Transcript

  1. A cross-linguistic computational approach on chance resemblances Tiago Tresoldi Max

    Planck Institute for the Science of Human History (MPI-SHH, Jena) Computer-Assisted Language Comparison (CALC) Project August 22nd, 2019 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 1 / 33
  2. Contents 1 Introduction 2 Mathematical Model 3 Simulated data 4

    Cross-Linguistic Comparison 5 Results Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 2 / 33
  3. Fantastic cognates and where to find them Tresoldi, T. (MPI-SHH)

    Cross-linguistic Chance Resemblances August 22nd, 2019 3 / 33
  4. Background Mathematical and cross-linguistic attempts at pseudocognacy measurement take hold

    with the diffusion of the quantitative methods (Hymes 1960), particularly in the reception of proposals such as those of Greenberg (Amerind) and Bomhard (Nostratic) Far fetched cases, such as Quechua and Semitic (Aedo, 2001) Many methods, not always distinct from cognacy detection, have been implemented since then (Kay 1964, Guy 1994, Ringe 1992) While criticized by Baxter (1996), Ringe set a common approach that can be found in many subsequent methods, like Kessler (2007), Mortarino (2009), Turchin (2010), Kilani (2015) and others Innovations with new methods for cognate detection, as in List (2012, 2014), J¨ ager (2017), Rama (2018) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 4 / 33
  5. What I did Experiment with mathematical model simulated data real

    cross-linguistic data Consider both semantic and phonological leeway Allow different components in the same workflow, such as different scorers, different alignment methods, etc. Use phonology and sound classes, not edit distance on orthography Not only distinguishing true from false cognates, but have a baseline for pseudocognate expectancy Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 5 / 33
  6. Mathematical model Most approaches evaluate on purely mathematical grounds first,

    including Ringe (1992), with abstract models on languages simulated both in phonology and semantics In practice, the solution tends to be based on binomial probability distributions with replacement, with some measure for accepting or rejecting the null hypothesis Figure: Distribution of matches (Kilani 2015:Figure 1). Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 6 / 33
  7. Results Pattern Cons Vowels Lexemes p(0) p(1..5) p(x) ≥ 0.50

    CV 5 6 100 0.03 0.84 3 CV 5 6 1000 ∼0.00 ∼0.00 33 CV 7 6 100 0.08 0.87 3 CV 20 14 1000 0.02 0.82 4 CVC 5 6 100 0.57 0.42 - CVC 5 6 1000 ∼0.00 0.51 5 CVC 20 14 100 0.97 0.02 - CVC 20 14 1000 0.77 0.22 - Table: Probabilities for no matches and one to five matches (=“perfect correspondence”) on random samples of different sizes (Lexemes) and inventory properties (Pattern, Consonants, Vowels). The last column reports the number of matches m at which the probability p(1..m) is above 50%, if any. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 7 / 33
  8. Quantifying resemblance Resemblance The resemblance between two sequences x and

    y, where 0.0 expresses identity, is defined as the phonological correspondence between the two sequences. It is potentially corrected by a mesure of semantic distance, which can by definition set the resemblance to infinite if such distance is above a a given threshold t. The basic source of phonological correspondence is an alignment score, according to either global or local scorers, modelling either historical or perceptual similarity. In this experiment, phonological correspondence is computed from sequence alignment using the SCA scorer (List 2012), and defined as 1 − ALMxy max(ALMxx ,ALMyy ) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 8 / 33
  9. Phonological correspondence scores Seqx Seqy Score /papa/ /papa/ 0.00 /papa/

    /baba/ 0.00 /papa/ /bubu/ 0.06 /papa/ /bubuj/ 0.33 /papa/ /epapa/ 0.27 /papa/ /w1tS@h˜ en:o/ 0.89 Seqx Seqy Score /mano/ /mano/ (IT) 0.00 /mano/ /mano/ (ES) 0.00 /mano/ /m˜ 5˜ w/ (PT) 0.55 /mano/ /m˜ E/ (FR) 0.57 /mano/ /hænd/ (EN) 0.65 /mano/ /Sou3/ (ZH) 0.88 Table: Scores for different crafted examples (left table) and words for concept “HAND” in different languages. A score of 0.00 indicates identity, 1.00 is the maximum theoretical distance. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 9 / 33
  10. Simulation Results of simulation with randomly generated data from attested

    distributions of patterns, lengths, and inventories, for different vocabulary sizes and cluster sizes. The tables report the scores at percentiles of 5%, 25%, and 50%, as well as the three lowest scoring pairs in each simulation. Size Cluster 5% 25% 50% Best Pairs 5 1 0.61 0.75 0.76 kodu / zUvitu olu / rete semozle / uson 40 1 0.57 0.65 0.76 bIntu / pUpoti hondog / muvteka nufse / hufo 40 5 0.36 0.50 0.60 defe / tIvi khabip / uebub begu / pemi 40 10 0.37 0.47 0.54 fuveo / vefni phEbi / pemi deksab / togapak Size Cluster 5% 25% 50% Best Pairs 100 10 0.20 0.42 0.55 phEbi / pObi zas@ / zize olu / aìa 200 10 0.20 0.43 0.53 veja / vea saoga / zuku pi:ba / bæpo 1000 10 0.30 0.43 0.54 da: / taPa ao / auj ko: / goe 1000 50 0.06 0.31 0.40 udu / udu e: / eu de: / teuj Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 10 / 33
  11. Empirical Experiment Analysis on real data: 1479 language, 193 different

    families (Glottolog), 416 different genera (own extension to WALS) 587,198 forms, 2488 comparative concepts linked to Concepticon (min 25, max 1850) 948 pairwise comparisons between randomly sampled language pairs, 460 involving different families, 488 within the same family (of which 446 in the same genus) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 11 / 33
  12. Empirical results - I Resemblances between Erzya erzy1239 (Uralic, Russia)

    and Ilit kuna1268 (Kunama, Eritrea): FIRE /tol/ and /tom/ HAND /kedj/ and /kon/ THOU /ton/ and /en/ TONGUE /kelj/ and /Nel/ TOOTH /pej/ and /m2j/ WE /mjiñ/ and /5me/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 13 / 33
  13. Empirical results - II Resemblances between English stan1293 and Mandarin

    mand1415: BACK /bæk/ and /peI/ BLUNT /d2l/ and /tun/ COUGH /kOf/ and /kh7/ FOAM /f@Um/ and /phAUmwO/ GIVE /gIv/ and /keI/ GO /g@U/ and /ts@U/ LEAD (GUIDE) /li:d/ and /liNtAU/ LONG /lON/ and /úùhAN/ PERSON /mæn/ and /õ@n/ SAY /seI/ and /ùuo/ SEA /si:/ and /xaI/ SING /sN/ and /úùhANk7/ THEY /DEI/ and /tha/ WHO /hu:/ and /ùui/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 14 / 33
  14. Empirical results - III Resemblances between Dorze dorz1235 (Ta Ne

    Omotic, Ethiopia) and Seze seze1235 (Blue Nile Mao, Ethiopia): EYE /5fe/ and /5wi/ FIRE /t5m5/ and /t5ma/ HAND /kuSe/ and /kusa/ HORN (ANATOMY) /k5tS’e/ and /k5li/ NOSE /sidi/ and /sinta/ ONE /isino/ and /iSila/ SUN /5w5/ and /5wa/ TOOTH /5ts@/ and /5tsa/ TWO /n5mij5/ and /nomba/ WATER /h5ts’e/ and /h5Nsi/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 15 / 33
  15. Semantic leeway - II Two main alternatives Allow a form

    to freely move within its semantic cluster (no matter how defined) Link all concepts in a weighted graph, accepting all colexifications (even from single-case homonimies), smoothing, and computing the distance with a shortest path algorithm Only reporting the first alternative here Experiments with random clusters, testing for the hypothesis that, the larger the semantic leeway, the easier it is to find a resemblance Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 17 / 33
  16. Resemblance score percentiles As expected, the bigger the semantic leeway,

    the easier it is to find chance resemblances: 1 10 20 Single-cluster 5% 0.09 0.12 0.54 0.09 0.12 0.51 0.09 0.11 0.49 0.02 0.03 0.17 25% 0.31 0.34 0.73 0.31 0.33 0.70 0.30 0.33 0.68 0.15 0.16 0.37 50% 0.52 0.54 0.82 0.51 0.53 0.79 0.50 0.52 0.77 0.30 0.31 0.48 75% 0.69 0.71 0.89 0.68 0.69 0.87 0.67 0.68 0.85 0.42 0.42 0.57 95% 0.87 0.87 0.96 0.85 0.86 0.95 0.84 0.85 0.93 0.56 0.56 0.68 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 23 / 33
  17. Conclusion The larger analysis offered some interesting side-products: A matrix

    of phoneme correspondence in cognates (refined scoring matrix, local scoring matrices, asymmetric matrices) The weighted network for semantic exploration (to be released) We confirm that historical linguistics must be based on regular correspondences, not resemblance Which is more exciting than it sounds, because we can start measuring the probability of bad proposals (like the ”Semitic Quechua”) Exploration is also helping to generate my matrix of perceptual phoneme similarity (Tresoldi 2016, extending Mielke (2008) work on feature alignment (List 2019) and sub–optimal alignment (Tresoldi forth.) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 24 / 33
  18. Synthesis Chance resemblances are not only common, but expected If

    our methods are not finding them, they might be too strict If done properly, widening the semantic space can actually improve finding cognates We still need to work on methods for that Semantic change does not follow ontology A future step could be investigating the role of phonosymbolism Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 26 / 33
  19. References I Baxter, W. H., and Manaster Ramer, A. Review:

    On calculating the factor of chance in language comparison. by donald a. ringe, jr. philadelphia: The american philosophical society, 1992. pp. 110. Diachronica 8, 2 (1996), 371–384. Baxter, W. H., and Manaster Ramer, A. Beyond lumping and splitting: Probabilistic issues in historical linguistics. McDonald Institute for Archaeological Research, Cambridge, 2000, pp. 167–188. Blevins, J. Evolutionary phonology: The emergence of sound patterns. Cambridge University Press, 2004. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 27 / 33
  20. References II Damerau, F. Mechanization of cognate recognition in comparative

    linguistics. Linguistics 13, 148 (1975), 5–30. Guy, J. B. An algorithm for identifying cognates in bilingual word-lists and its applicability to machine translation. Journal of Quantitative Linguistics 1, 1 (1994), 35–42. Kessler, B. Word similarity metrics and multilateral comparison. In Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology (2007), Association for Computational Linguistics, pp. 6–14. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 28 / 33
  21. References III Kilani, M. Calculating false cognates. an extension of

    the baxter manaster-ramer solution and its application to the case of pre-greek. Diachronica 32, 3 (2015), 331–364. Kondrak, G. Identification of cognates and recurrent sound correspondences in word lists. TAL 50, 2 (2009), 201–235. List, J.-M. Computer-Assisted Language Comparison: Reconciling computational and classical approaches in historical linguistics. Max Planck Institute for the Science of Human History, Jena, 2016. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 29 / 33
  22. References IV List, J.-M., Greenhill, S. J., and Gray, R.

    D. The potential of automatic word comparison for historical linguistics. PloS one 12, 1 (2017), e0170046. Mortarino, C. A statistical test useful in historical linguistics. In Proceedings of the XLII scientific meeting of the Italian statistical society. 2004, p. 107–110. Mortarino, C. An improved statistical test for historical linguistics. Statistical Methods and Applications 18, 2 (2009), 193–204. Nichols, J. The comparative method as heuristic. Oxford University Press, New York, 1996, pp. 39–71. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 30 / 33
  23. References V Nichols, J. Quasi-cognates and lexical type shifts: Rigorous

    distance measures for long-range comparison. In Phylogenetic methods and the prehistory of languages (Cambridge UK , Oxford UK , Oakville CT USA ,, 2006), P. Forster and C. Renfrew, Eds., McDonald Institute monographs, McDonald Institute for Archaeological Research; Distributed by Orbow Books, pp. 57–66. Rama, T., Borin, L., Mikros, G., and Macutek, J. Comparative evaluation of string similarity measures for automatic language classification., 2015. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 31 / 33
  24. References VI Ringe, D. A. On calculating the factor of

    chance in language comparison. Transactions of the American Philosophical Society 82, 1 (1992), 1–110. Tresoldi, T., Anderson, C., and List, J.-M. Modelling sound change with the help of multi-tiered sequence representations. In Pozna´ n Linguistic Meeting 2018 (Pozna´ n, 2018), ?, p. ? Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 32 / 33
  25. A cross-linguistic computational approach on chance resemblances Tiago Tresoldi Max

    Planck Institute for the Science of Human History (MPI-SHH, Jena) Computer-Assisted Language Comparison (CALC) Project August 22nd, 2019 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 33 / 33