A cross-linguistic computational approach on chance resemblances

A cross-linguistic computational approach on chance resemblances Tiago Tresoldi Max
Planck Institute for the Science of Human History (MPI-SHH, Jena) Computer-Assisted Language Comparison (CALC) Project August 22nd, 2019 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 1 / 33

Contents 1 Introduction 2 Mathematical Model 3 Simulated data 4
Cross-Linguistic Comparison 5 Results Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 2 / 33

Fantastic cognates and where to ﬁnd them Tresoldi, T. (MPI-SHH)
Cross-linguistic Chance Resemblances August 22nd, 2019 3 / 33

Background Mathematical and cross-linguistic attempts at pseudocognacy measurement take hold
with the diﬀusion of the quantitative methods (Hymes 1960), particularly in the reception of proposals such as those of Greenberg (Amerind) and Bomhard (Nostratic) Far fetched cases, such as Quechua and Semitic (Aedo, 2001) Many methods, not always distinct from cognacy detection, have been implemented since then (Kay 1964, Guy 1994, Ringe 1992) While criticized by Baxter (1996), Ringe set a common approach that can be found in many subsequent methods, like Kessler (2007), Mortarino (2009), Turchin (2010), Kilani (2015) and others Innovations with new methods for cognate detection, as in List (2012, 2014), J¨ ager (2017), Rama (2018) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 4 / 33

What I did Experiment with mathematical model simulated data real
cross-linguistic data Consider both semantic and phonological leeway Allow different components in the same workflow, such as different scorers, different alignment methods, etc. Use phonology and sound classes, not edit distance on orthography Not only distinguishing true from false cognates, but have a baseline for pseudocognate expectancy Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 5 / 33

Mathematical model Most approaches evaluate on purely mathematical grounds ﬁrst,
including Ringe (1992), with abstract models on languages simulated both in phonology and semantics In practice, the solution tends to be based on binomial probability distributions with replacement, with some measure for accepting or rejecting the null hypothesis Figure: Distribution of matches (Kilani 2015:Figure 1). Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 6 / 33

Results Pattern Cons Vowels Lexemes p(0) p(1..5) p(x) ≥ 0.50
CV 5 6 100 0.03 0.84 3 CV 5 6 1000 ∼0.00 ∼0.00 33 CV 7 6 100 0.08 0.87 3 CV 20 14 1000 0.02 0.82 4 CVC 5 6 100 0.57 0.42 - CVC 5 6 1000 ∼0.00 0.51 5 CVC 20 14 100 0.97 0.02 - CVC 20 14 1000 0.77 0.22 - Table: Probabilities for no matches and one to ﬁve matches (=“perfect correspondence”) on random samples of diﬀerent sizes (Lexemes) and inventory properties (Pattern, Consonants, Vowels). The last column reports the number of matches m at which the probability p(1..m) is above 50%, if any. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 7 / 33

Quantifying resemblance Resemblance The resemblance between two sequences x and
y, where 0.0 expresses identity, is defined as the phonological correspondence between the two sequences. It is potentially corrected by a mesure of semantic distance, which can by definition set the resemblance to infinite if such distance is above a a given threshold t. The basic source of phonological correspondence is an alignment score, according to either global or local scorers, modelling either historical or perceptual similarity. In this experiment, phonological correspondence is computed from sequence alignment using the SCA scorer (List 2012), and defined as 1 − ALMxy max(ALMxx ,ALMyy ) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 8 / 33

Phonological correspondence scores Seqx Seqy Score /papa/ /papa/ 0.00 /papa/
/baba/ 0.00 /papa/ /bubu/ 0.06 /papa/ /bubuj/ 0.33 /papa/ /epapa/ 0.27 /papa/ /w1tS@h˜ en:o/ 0.89 Seqx Seqy Score /mano/ /mano/ (IT) 0.00 /mano/ /mano/ (ES) 0.00 /mano/ /m˜ 5˜ w/ (PT) 0.55 /mano/ /m˜ E/ (FR) 0.57 /mano/ /hænd/ (EN) 0.65 /mano/ /Sou3/ (ZH) 0.88 Table: Scores for diﬀerent crafted examples (left table) and words for concept “HAND” in diﬀerent languages. A score of 0.00 indicates identity, 1.00 is the maximum theoretical distance. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 9 / 33

Simulation Results of simulation with randomly generated data from attested
distributions of patterns, lengths, and inventories, for diﬀerent vocabulary sizes and cluster sizes. The tables report the scores at percentiles of 5%, 25%, and 50%, as well as the three lowest scoring pairs in each simulation. Size Cluster 5% 25% 50% Best Pairs 5 1 0.61 0.75 0.76 kodu / zUvitu olu / rete semozle / uson 40 1 0.57 0.65 0.76 bIntu / pUpoti hondog / muvteka nufse / hufo 40 5 0.36 0.50 0.60 defe / tIvi khabip / uebub begu / pemi 40 10 0.37 0.47 0.54 fuveo / vefni phEbi / pemi deksab / togapak Size Cluster 5% 25% 50% Best Pairs 100 10 0.20 0.42 0.55 phEbi / pObi zas@ / zize olu / aìa 200 10 0.20 0.43 0.53 veja / vea saoga / zuku pi:ba / bæpo 1000 10 0.30 0.43 0.54 da: / taPa ao / auj ko: / goe 1000 50 0.06 0.31 0.40 udu / udu e: / eu de: / teuj Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 10 / 33

Empirical Experiment Analysis on real data: 1479 language, 193 different
families (Glottolog), 416 different genera (own extension to WALS) 587,198 forms, 2488 comparative concepts linked to Concepticon (min 25, max 1850) 948 pairwise comparisons between randomly sampled language pairs, 460 involving different families, 488 within the same family (of which 446 in the same genus) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 11 / 33

Shared cognates (LexStat) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August
22nd, 2019 12 / 33

Empirical results - I Resemblances between Erzya erzy1239 (Uralic, Russia)
and Ilit kuna1268 (Kunama, Eritrea): FIRE /tol/ and /tom/ HAND /kedj/ and /kon/ THOU /ton/ and /en/ TONGUE /kelj/ and /Nel/ TOOTH /pej/ and /m2j/ WE /mjiñ/ and /5me/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 13 / 33

Empirical results - II Resemblances between English stan1293 and Mandarin
mand1415: BACK /bæk/ and /peI/ BLUNT /d2l/ and /tun/ COUGH /kOf/ and /kh7/ FOAM /f@Um/ and /phAUmwO/ GIVE /gIv/ and /keI/ GO /g@U/ and /ts@U/ LEAD (GUIDE) /li:d/ and /liNtAU/ LONG /lON/ and /úùhAN/ PERSON /mæn/ and /õ@n/ SAY /seI/ and /ùuo/ SEA /si:/ and /xaI/ SING /sN/ and /úùhANk7/ THEY /DEI/ and /tha/ WHO /hu:/ and /ùui/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 14 / 33

Empirical results - III Resemblances between Dorze dorz1235 (Ta Ne
Omotic, Ethiopia) and Seze seze1235 (Blue Nile Mao, Ethiopia): EYE /5fe/ and /5wi/ FIRE /t5m5/ and /t5ma/ HAND /kuSe/ and /kusa/ HORN (ANATOMY) /k5tS’e/ and /k5li/ NOSE /sidi/ and /sinta/ ONE /isino/ and /iSila/ SUN /5w5/ and /5wa/ TOOTH /5ts@/ and /5tsa/ TWO /n5mĳ5/ and /nomba/ WATER /h5ts’e/ and /h5Nsi/ Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 15 / 33

Semantic leeway - I Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances
August 22nd, 2019 16 / 33

Semantic leeway - II Two main alternatives Allow a form
to freely move within its semantic cluster (no matter how defined) Link all concepts in a weighted graph, accepting all colexifications (even from single-case homonimies), smoothing, and computing the distance with a shortest path algorithm Only reporting the first alternative here Experiments with random clusters, testing for the hypothesis that, the larger the semantic leeway, the easier it is to find a resemblance Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 17 / 33

Mean resemblances - No clusters Tresoldi, T. (MPI-SHH) Cross-linguistic Chance
Resemblances August 22nd, 2019 18 / 33

Mean resemblances - CLICS2 clusters Tresoldi, T. (MPI-SHH) Cross-linguistic Chance

Mean resemblances - 10-concept clusters Tresoldi, T. (MPI-SHH) Cross-linguistic Chance

Mean resemblances - 20-concept clusters Tresoldi, T. (MPI-SHH) Cross-linguistic Chance

Mean resemblances - Single cluster Tresoldi, T. (MPI-SHH) Cross-linguistic Chance

Resemblance score percentiles As expected, the bigger the semantic leeway,
the easier it is to ﬁnd chance resemblances: 1 10 20 Single-cluster 5% 0.09 0.12 0.54 0.09 0.12 0.51 0.09 0.11 0.49 0.02 0.03 0.17 25% 0.31 0.34 0.73 0.31 0.33 0.70 0.30 0.33 0.68 0.15 0.16 0.37 50% 0.52 0.54 0.82 0.51 0.53 0.79 0.50 0.52 0.77 0.30 0.31 0.48 75% 0.69 0.71 0.89 0.68 0.69 0.87 0.67 0.68 0.85 0.42 0.42 0.57 95% 0.87 0.87 0.96 0.85 0.86 0.95 0.84 0.85 0.93 0.56 0.56 0.68 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 23 / 33

Conclusion The larger analysis offered some interesting side-products: A matrix
of phoneme correspondence in cognates (refined scoring matrix, local scoring matrices, asymmetric matrices) The weighted network for semantic exploration (to be released) We confirm that historical linguistics must be based on regular correspondences, not resemblance Which is more exciting than it sounds, because we can start measuring the probability of bad proposals (like the ”Semitic Quechua”) Exploration is also helping to generate my matrix of perceptual phoneme similarity (Tresoldi 2016, extending Mielke (2008) work on feature alignment (List 2019) and sub–optimal alignment (Tresoldi forth.) Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 24 / 33

Perceptual similarity Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd,
2019 25 / 33

Synthesis Chance resemblances are not only common, but expected If
our methods are not ﬁnding them, they might be too strict If done properly, widening the semantic space can actually improve ﬁnding cognates We still need to work on methods for that Semantic change does not follow ontology A future step could be investigating the role of phonosymbolism Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 26 / 33

References I Baxter, W. H., and Manaster Ramer, A. Review:
On calculating the factor of chance in language comparison. by donald a. ringe, jr. philadelphia: The american philosophical society, 1992. pp. 110. Diachronica 8, 2 (1996), 371–384. Baxter, W. H., and Manaster Ramer, A. Beyond lumping and splitting: Probabilistic issues in historical linguistics. McDonald Institute for Archaeological Research, Cambridge, 2000, pp. 167–188. Blevins, J. Evolutionary phonology: The emergence of sound patterns. Cambridge University Press, 2004. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 27 / 33

References II Damerau, F. Mechanization of cognate recognition in comparative
linguistics. Linguistics 13, 148 (1975), 5–30. Guy, J. B. An algorithm for identifying cognates in bilingual word-lists and its applicability to machine translation. Journal of Quantitative Linguistics 1, 1 (1994), 35–42. Kessler, B. Word similarity metrics and multilateral comparison. In Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology (2007), Association for Computational Linguistics, pp. 6–14. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 28 / 33

References III Kilani, M. Calculating false cognates. an extension of
the baxter manaster-ramer solution and its application to the case of pre-greek. Diachronica 32, 3 (2015), 331–364. Kondrak, G. Identiﬁcation of cognates and recurrent sound correspondences in word lists. TAL 50, 2 (2009), 201–235. List, J.-M. Computer-Assisted Language Comparison: Reconciling computational and classical approaches in historical linguistics. Max Planck Institute for the Science of Human History, Jena, 2016. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 29 / 33

References IV List, J.-M., Greenhill, S. J., and Gray, R.
D. The potential of automatic word comparison for historical linguistics. PloS one 12, 1 (2017), e0170046. Mortarino, C. A statistical test useful in historical linguistics. In Proceedings of the XLII scientiﬁc meeting of the Italian statistical society. 2004, p. 107–110. Mortarino, C. An improved statistical test for historical linguistics. Statistical Methods and Applications 18, 2 (2009), 193–204. Nichols, J. The comparative method as heuristic. Oxford University Press, New York, 1996, pp. 39–71. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 30 / 33

References V Nichols, J. Quasi-cognates and lexical type shifts: Rigorous
distance measures for long-range comparison. In Phylogenetic methods and the prehistory of languages (Cambridge UK , Oxford UK , Oakville CT USA ,, 2006), P. Forster and C. Renfrew, Eds., McDonald Institute monographs, McDonald Institute for Archaeological Research; Distributed by Orbow Books, pp. 57–66. Rama, T., Borin, L., Mikros, G., and Macutek, J. Comparative evaluation of string similarity measures for automatic language classiﬁcation., 2015. Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 31 / 33

References VI Ringe, D. A. On calculating the factor of
chance in language comparison. Transactions of the American Philosophical Society 82, 1 (1992), 1–110. Tresoldi, T., Anderson, C., and List, J.-M. Modelling sound change with the help of multi-tiered sequence representations. In Pozna´ n Linguistic Meeting 2018 (Pozna´ n, 2018), ?, p. ? Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 32 / 33

A cross-linguistic computational approach on chance resemblances Tiago Tresoldi Max
Planck Institute for the Science of Human History (MPI-SHH, Jena) Computer-Assisted Language Comparison (CALC) Project August 22nd, 2019 Tresoldi, T. (MPI-SHH) Cross-linguistic Chance Resemblances August 22nd, 2019 33 / 33

A cross-linguistic computational approach on ch...

A cross-linguistic computational approach on chance resemblances

Tiago Tresoldi

More Decks by Tiago Tresoldi

Other Decks in Research

Featured

Transcript