Automatic Cognate Detection: Evaluation, Benchmarks, and Probabilities

Talk by Luke Maurits, Simon Greenhill, and Johann-Mattis List, presented at the workshop "Towards a Global Language Phylogeny", 21-23 October, Max Planck Institute for the Science of Human History, Jena.

Outline 1. Evaluation a. Overview on LingPy’s cognate detection methods b. Introducing a new method based on Infomap community clustering c. Testing the method on a newly compiled testset 2. Benchmarks a. Pointing to the importance of having good benchmark databases b. Show an example of how a Cognate Detection Benchmark Database could look like c. Introducing the idea of having a series of benchmark databases in GlottoBank 3. Probabilities a. Towards a Bayesian approach b. The Chinese Restaurant Process c. MAP approach and beyond

LingPy’s Cognate Detection Methods ● LingPy currently offers four cognate detection methods ○ Turchin et al. (2010): simple sound class method, first two sounds decide about cognacy ○ Edit Distance: Normalized edit distance is calculated for all word pairs and the results are partitioned into cognate sets (using flat UPGMA clustering) ○ SCA: SCA distances (List 2012) are calculated for all word pairs and the results are partitioned into cognate sets (using flat UPGMA) ○ LexStat: Language specific scoring schemes (close to sound correspondences, effectively log-odds) are calculated and partitioned into cognate sets (UPGMA) ● The methods show an increasing accuracy, but also an increasing computational cost

A New Method for Cognate Detection ● Instead of using flat UPGMA for partitioning, we use Infomap (Rosvall and Bergstrom 2008) for partitioning but take pairwise distances from LexStat. ● Infomap is a community detection algorithm designed for complex network analysis ● By switching to network partitioning from agglomerative clustering, we follow the “lead” from evolutionary biology where network partitioning has been used during the last decade to determine homologous gene families ● Different algorithms are used in biology, and many have been tested so far, but Infomap seems to perform best (as shown in the evaluation)

Problem of Threshold Selection ● Apart from the Method by Turchin et al. (2010), all three remaining and the new method require that we know where to set the threshold. ● In order to test how well the methods perform, we need to find a way to determine some threshold in some “objective” way ● Luke will talk about this in greater detail ● For our testing of the new dataset and the new methods, we decided to use a training set (List 2014) to determine the “best” threshold for each of the methods. ● While testing the three LingPy methods which require a threshold, we also tested the new method (LS-Infomap) and an alternative method (LS-MCL, based on Markov Clustering, Dongen 2000).

Intermediate Conclusion ● The results seem to be already good enough to be used for computer- assisted language comparison (do a pre-processing with cognate detection analyses and later having them refined by experts) ● The results show that we might be able to push the barrier: switching to better partitioning algorithms increased the results by 1 point. What happens if we also manage to refine the calculation of the similarity scores?

Towards a Cognate Detection Benchmark Database ● Benchmark databases are essential to train, test, and evaluate our algorithms ● So far, there are not many benchmark databases which could be used for our tasks ● With the new data we prepared for this study, and the data that was compiled to test LingPy (List 2014), there are already 18 different datasets which are ○ in more or less clean phonetic encoding ○ are segmentized ○ are tagged for cognacy ● We think these should be published as a benchmark database for cognate detection (BDCD, or “LexiBench”?) ● In this way, we maintain comparability with other algorithms that might be proposed, and we ease the creation of alternative approaches

Benchmark Database for Cognate Detection: Demo ● This is a first demo version, on how a benchmark database could look: ● http://digling.org/lexibench/ ● It is nothing special, and we do not need anything really special, but we should get the data out there, so that people know they can use it...

Benchmark Databases in General ● There is a Benchmark Database for Phonetic Alignment (List and Prokić 2014), which will be extended in 2015 (scholars start providing data). ● There could be more benchmark databases for other tasks that are relevant for us. ● We may think of establishing a general “brand” for benchmark databases, summarized under, say “GlottoBench” (and the alignment benchmark can be directly included there). ● Alternatively, we can go for separate benchmarks. ● But what we need is to spread the word that there is data that people can use for developing and testing their algorithms. ● Otherwise, computational linguists will keep on using the Dyen database in an uppercase version to check how well their algorithms detect cognates...

Toward a Bayesian approach ● Deterministic methods like those just outlined have some drawbacks (and also advantages!) ● We only get black-and-white results from them. ● Two words are either definitely cognate, or they are definitely not. ● By adjusting thresholds we can trade off between Type I and Type II errors. ● But we must apply a single threshold to the entire dataset. ● The thresholds are essentially arbitrary, and choosing one for a new dataset is difficult.

Virtues of probabilistic modelling ● Probabilistic models can overcome many of these shortcomings (usually at the greater computational expense). ● We can get more nuanced results like “these two words are cognate with probability 0.72” ● We can also have the model attempt to “learn its own thresholds” from the data, removing the need to guess arbitrary thresholds for new datasets.

Introducing the Chinese Restaurant Process ● The CRP is whimsically-named discrete-time stochastic process. ● Imagine a Chinese restaurant with an infinite number of tables. ● Imagine each table has an infinite number of seats. ● Customers enter the restaurant and are seated one at a time. ● Where each customer sits is randomly determined.

Chinese Restaurant Process (contd.) ● Sometimes customers sit at an empty table (there is always one available!) ● Otherwise they choose an occupied table, and they choose which one with some probability proportional to how many people are already seated there. ● (in this version) there is a “stickiness” parameter 0 ≤ α ≤ ∞. ● When α is low, people like to sit with each other. ● When α is very high, everybody sits alone.

Chinese Restaurant Process (contd.) ● What does this have to do with language?! ● The CRP sorts n customers into somewhere between 1 and n tables. ● More generally, it partitions n things into 1-n sets. ● It can partition n wordforms into 1-n cognate classes. ● It is simply a probabilistic way of sorting things into groups, with the option to control how likely things are to group together, in a broad way.

A generative process for matrices Here’s a recipe for making an n×n pairwise distance matrix like one of Mattis’ from scratch by rolling dice: 1. Sample a stickiness value α from some prior. 2. Use the CRP, with α, to sort your n words into cognate classes. 3. Sample phonetic distances between cognates from a “small” distribution. 4. Sample phonetic distances between non-cognates from a “large” distribution. This is an example of a mixture model. (you can sample parameters for your large and small distributions from some prior too)

Bayes flip! ● We just outlined a way to stochastically generate phonetic distance matrices. ● We we already have (or can get) phonetic distance matrices. ● Those are our data! ● The value of generative models is that we can use Bayes’ theorem to reverse them. P(H|D) ∝ P(D|H) ● Here H is our cognate classification (and α, and any distribution parameters)

The reward ● What we actually get back from a Bayesian analysis is a posterior distribution over cognate classes. ● If we use MCMC to do the analysis, we get a set of say 100 or 1000 classifications of the words into cognate sets. ● This facilitates statements like “A and B are cognate with probability 0.82”.

Does it work? ● We get back uncertain estimates of cognate classes. ● So we can’t immediately compute precision, recall or F scores against gold standards. ● For the sake of easy and direct comparison to gold standards or existing methods, we need a way to distil our uncertain estimates into clear “consensus classes” (cf. consensus trees). ● How best to do this is an interesting research problem in itself

A simple first effort ● The MAP results feature good precision but mediocre recall. ● We need to introduce a little more “slack”. ● Here’s an off-the-cuff approach: ● For each wordform, build a cognate set out of all those forms which are cognate with p ≥ 0.75, say. ● Look for wordforms assigned to multiple cognate classes (bad). ● Keep the form in the class with the highest product of pairwise probabilities

A few more details ● We test a particular form of the generative model ● Our “big” and “small” distributions are Gaussians, with separate means and variances. ● The two distributions have different priors on their means (favouring large or small values), and the same priors on their variances (favouring lower values) ● The prior over α is based on some gold standard classifications.

Thresholds strike back ● Of course, the threshold can be varied from 0.75. ● As threshold is lowered, we trade precision for recall. ● But wait, weren’t we trying to get rid of arbitrary thresholds?! ● This is not so bad: the threshold has a principled interpretation which holds across data sets (cf the arbitrary use of 0.05 significance in hypothesis testing).

Future work ● One shortcoming of this method is the independence of the restaurants ● All restaurants share the same α, so are equally “clumpy” or “spread out” ● But there is no requirement for clumpings to be consistent across restaurants, as we may expect. ● We don’t just want people to always sit at 2 or 3 crowded tables, we want roughly the same crowds each time ● A modification to the CRP may help with this?

More future work ● An unexplored question: how can we infer trees using this system? ● We can compute consensus cognate classes as an intermediate step and then use Covarion or Dollo models in the traditional way. ● But this throws away the uncertainty information we have worked so hard to get! ● Can we do away with the intermediate step entirely and develop a tree-based generative model for distance matrices?