Automatic Cognate
Detection
Evaluation, Benchmarks, and Probabilities
Luke Maurits, Johann-Mattis List, and Simon Greenhill

Outline
1. Evaluation
a. Overview on LingPy’s cognate detection methods
b. Introducing a new method based on Infomap community clustering
c. Testing the method on a newly compiled testset
2. Benchmarks
a. Pointing to the importance of having good benchmark databases
b. Show an example of how a Cognate Detection Benchmark Database could look like
c. Introducing the idea of having a series of benchmark databases in GlottoBank
3. Probabilities
a. Towards a Bayesian approach
b. The Chinese Restaurant Process
c. MAP approach and beyond

Evaluation

LingPy’s Cognate Detection Methods
● LingPy currently offers four cognate detection methods
○ Turchin et al. (2010): simple sound class method, first two sounds decide about
cognacy
○ Edit Distance: Normalized edit distance is calculated for all word pairs and the
results are partitioned into cognate sets (using flat UPGMA clustering)
○ SCA: SCA distances (List 2012) are calculated for all word pairs and the results
are partitioned into cognate sets (using flat UPGMA)
○ LexStat: Language specific scoring schemes (close to sound correspondences,
effectively log-odds) are calculated and partitioned into cognate sets (UPGMA)
● The methods show an increasing accuracy, but also an increasing
computational cost

A New Method for Cognate Detection
● Instead of using flat UPGMA for partitioning, we use Infomap (Rosvall and
Bergstrom 2008) for partitioning but take pairwise distances from LexStat.
● Infomap is a community detection algorithm designed for complex network
analysis
● By switching to network partitioning from agglomerative clustering, we follow
the “lead” from evolutionary biology where network partitioning has been used
during the last decade to determine homologous gene families
● Different algorithms are used in biology, and many have been tested so far,
but Infomap seems to perform best (as shown in the evaluation)

Network Partitioning instead of Flat UPGMA

Network Partitioning instead of Flat UPGMA

Evaluating the Performance on a New Testset

Problem of Threshold Selection
● Apart from the Method by Turchin et al. (2010), all three remaining and the
new method require that we know where to set the threshold.
● In order to test how well the methods perform, we need to find a way to
determine some threshold in some “objective” way
● Luke will talk about this in greater detail
● For our testing of the new dataset and the new methods, we decided to use a
training set (List 2014) to determine the “best” threshold for each of the
methods.
● While testing the three LingPy methods which require a threshold, we also
tested the new method (LS-Infomap) and an alternative method (LS-MCL,
based on Markov Clustering, Dongen 2000).

Training Set Used for Threshold Selection

Results for the Threshold Selection Procedure
Accuracy
Threshold

Evaluation of the Methods: General Results

Evaluation of the Methods: General Results

Evaluation of the Methods: Specific Results

Intermediate Conclusion
● The results seem to be already good enough to be used for computer-
assisted language comparison (do a pre-processing with cognate
detection analyses and later having them refined by experts)
● The results show that we might be able to push the barrier: switching
to better partitioning algorithms increased the results by 1 point. What
happens if we also manage to refine the calculation of the similarity
scores?

Benchmarks

Towards a Cognate Detection Benchmark Database
● Benchmark databases are essential to train, test, and evaluate our algorithms
● So far, there are not many benchmark databases which could be used for our
tasks
● With the new data we prepared for this study, and the data that was compiled
to test LingPy (List 2014), there are already 18 different datasets which are
○ in more or less clean phonetic encoding
○ are segmentized
○ are tagged for cognacy
● We think these should be published as a benchmark database for cognate
detection (BDCD, or “LexiBench”?)
● In this way, we maintain comparability with other algorithms that might be
proposed, and we ease the creation of alternative approaches

Benchmark Database for Cognate Detection: Demo
● This is a first demo version, on how a benchmark database could look:
● http://digling.org/lexibench/
● It is nothing special, and we do not need anything really special, but we
should get the data out there, so that people know they can use it...

Benchmark Databases in General
● There is a Benchmark Database for Phonetic Alignment (List and Prokić
2014), which will be extended in 2015 (scholars start providing data).
● There could be more benchmark databases for other tasks that are relevant
for us.
● We may think of establishing a general “brand” for benchmark databases,
summarized under, say “GlottoBench” (and the alignment benchmark can be
directly included there).
● Alternatively, we can go for separate benchmarks.
● But what we need is to spread the word that there is data that people can use
for developing and testing their algorithms.
● Otherwise, computational linguists will keep on using the Dyen database in an
uppercase version to check how well their algorithms detect cognates...

Probabilities

Toward a Bayesian approach
● Deterministic methods like those just outlined have some drawbacks (and
also advantages!)
● We only get black-and-white results from them.
● Two words are either definitely cognate, or they are definitely not.
● By adjusting thresholds we can trade off between Type I and Type II errors.
● But we must apply a single threshold to the entire dataset.
● The thresholds are essentially arbitrary, and choosing one for a new dataset
is difficult.

Virtues of probabilistic modelling
● Probabilistic models can overcome many of these shortcomings (usually at
the greater computational expense).
● We can get more nuanced results like “these two words are cognate with
probability 0.72”
● We can also have the model attempt to “learn its own thresholds” from the
data, removing the need to guess arbitrary thresholds for new datasets.

Introducing the Chinese Restaurant Process
● The CRP is whimsically-named discrete-time stochastic process.
● Imagine a Chinese restaurant with an infinite number of tables.
● Imagine each table has an infinite number of seats.
● Customers enter the restaurant and are seated one at a time.
● Where each customer sits is randomly determined.

Chinese Restaurant Process (contd.)
● Sometimes customers sit at an empty table (there is always one available!)
● Otherwise they choose an occupied table, and they choose which one with
some probability proportional to how many people are already seated there.
● (in this version) there is a “stickiness” parameter 0 ≤ α ≤ ∞.
● When α is low, people like to sit with each other.
● When α is very high, everybody sits alone.

Chinese Restaurant Process example, n=5, α=0.5

Chinese Restaurant Process example, n=5, α=1.5

Chinese Restaurant Process example, n=5, α=2.5

Chinese Restaurant Process example, n=5, α=4.5

Chinese Restaurant Process example, n=5, α=5.5

Chinese Restaurant Process (contd.)
● What does this have to do with language?!
● The CRP sorts n customers into somewhere between 1 and n tables.
● More generally, it partitions n things into 1-n sets.
● It can partition n wordforms into 1-n cognate classes.
● It is simply a probabilistic way of sorting things into groups, with the option to
control how likely things are to group together, in a broad way.

A generative process for matrices
Here’s a recipe for making an n×n pairwise distance matrix like one of Mattis’ from
scratch by rolling dice:
1. Sample a stickiness value α from some prior.
2. Use the CRP, with α, to sort your n words into cognate classes.
3. Sample phonetic distances between cognates from a “small” distribution.
4. Sample phonetic distances between non-cognates from a “large” distribution.
This is an example of a mixture model.
(you can sample parameters for your large and small distributions from some prior
too)

Bayes flip!
● We just outlined a way to stochastically generate phonetic distance matrices.
● We we already have (or can get) phonetic distance matrices.
● Those are our data!
● The value of generative models is that we can use Bayes’ theorem to reverse
them.
P(H|D) ∝ P(D|H)
● Here H is our cognate classification (and α, and any distribution parameters)

The reward
● What we actually get back from a Bayesian analysis is a posterior distribution
over cognate classes.
● If we use MCMC to do the analysis, we get a set of say 100 or 1000
classifications of the words into cognate sets.
● This facilitates statements like “A and B are cognate with probability 0.82”.

Does it work?
● We get back uncertain estimates of cognate classes.
● So we can’t immediately compute precision, recall or F scores against gold
standards.
● For the sake of easy and direct comparison to gold standards or existing
methods, we need a way to distil our uncertain estimates into clear
“consensus classes” (cf. consensus trees).
● How best to do this is an interesting research problem in itself

A simple first effort
● The MAP results feature good precision but mediocre recall.
● We need to introduce a little more “slack”.
● Here’s an off-the-cuff approach:
● For each wordform, build a cognate set out of all those forms which are
cognate with p ≥ 0.75, say.
● Look for wordforms assigned to multiple cognate classes (bad).
● Keep the form in the class with the highest product of pairwise probabilities

A few more details
● We test a particular form of the generative model
● Our “big” and “small” distributions are Gaussians, with separate means and
variances.
● The two distributions have different priors on their means (favouring large or
small values), and the same priors on their variances (favouring lower values)
● The prior over α is based on some gold standard classifications.

Gamma priors for Gaussian means

Exponential prior for Gaussian variances

Gamma prior for , with empirical params.

p=0.75 results

Thresholds strike back
● Of course, the threshold can be varied from 0.75.
● As threshold is lowered, we trade precision for recall.
● But wait, weren’t we trying to get rid of arbitrary thresholds?!
● This is not so bad: the threshold has a principled interpretation which holds
across data sets (cf the arbitrary use of 0.05 significance in hypothesis
testing).

Future work
● One shortcoming of this method is the independence of the restaurants
● All restaurants share the same α, so are equally “clumpy” or “spread out”
● But there is no requirement for clumpings to be consistent across restaurants,
as we may expect.
● We don’t just want people to always sit at 2 or 3 crowded tables, we want
roughly the same crowds each time
● A modification to the CRP may help with this?

More future work
● An unexplored question: how can we infer trees using this system?
● We can compute consensus cognate classes as an intermediate step and
then use Covarion or Dollo models in the traditional way.
● But this throws away the uncertainty information we have worked so hard to
get!
● Can we do away with the intermediate step entirely and develop a tree-based
generative model for distance matrices?

Thank you!
● Any questions?