Automatic Cognate Detection: Evaluation, Benchmarks, and Probabilities

Slide 1

Slide 1 text

Automatic Cognate Detection Evaluation, Benchmarks, and Probabilities Luke Maurits, Johann-Mattis List, and Simon Greenhill

Slide 2

Slide 2 text

Outline 1. Evaluation a. Overview on LingPy’s cognate detection methods b. Introducing a new method based on Infomap community clustering c. Testing the method on a newly compiled testset 2. Benchmarks a. Pointing to the importance of having good benchmark databases b. Show an example of how a Cognate Detection Benchmark Database could look like c. Introducing the idea of having a series of benchmark databases in GlottoBank 3. Probabilities a. Towards a Bayesian approach b. The Chinese Restaurant Process c. MAP approach and beyond

Slide 3

Slide 3 text

Evaluation

Slide 4

Slide 4 text

LingPy’s Cognate Detection Methods ● LingPy currently offers four cognate detection methods ○ Turchin et al. (2010): simple sound class method, first two sounds decide about cognacy ○ Edit Distance: Normalized edit distance is calculated for all word pairs and the results are partitioned into cognate sets (using flat UPGMA clustering) ○ SCA: SCA distances (List 2012) are calculated for all word pairs and the results are partitioned into cognate sets (using flat UPGMA) ○ LexStat: Language specific scoring schemes (close to sound correspondences, effectively log-odds) are calculated and partitioned into cognate sets (UPGMA) ● The methods show an increasing accuracy, but also an increasing computational cost

Slide 5

Slide 5 text

A New Method for Cognate Detection ● Instead of using flat UPGMA for partitioning, we use Infomap (Rosvall and Bergstrom 2008) for partitioning but take pairwise distances from LexStat. ● Infomap is a community detection algorithm designed for complex network analysis ● By switching to network partitioning from agglomerative clustering, we follow the “lead” from evolutionary biology where network partitioning has been used during the last decade to determine homologous gene families ● Different algorithms are used in biology, and many have been tested so far, but Infomap seems to perform best (as shown in the evaluation)

Slide 6

Slide 6 text

Network Partitioning instead of Flat UPGMA

Slide 7

Slide 7 text

Network Partitioning instead of Flat UPGMA

Slide 8

Slide 8 text

Evaluating the Performance on a New Testset

Slide 9

Slide 9 text

Problem of Threshold Selection ● Apart from the Method by Turchin et al. (2010), all three remaining and the new method require that we know where to set the threshold. ● In order to test how well the methods perform, we need to find a way to determine some threshold in some “objective” way ● Luke will talk about this in greater detail ● For our testing of the new dataset and the new methods, we decided to use a training set (List 2014) to determine the “best” threshold for each of the methods. ● While testing the three LingPy methods which require a threshold, we also tested the new method (LS-Infomap) and an alternative method (LS-MCL, based on Markov Clustering, Dongen 2000).

Slide 10

Slide 10 text

Training Set Used for Threshold Selection

Slide 11

Slide 11 text

Results for the Threshold Selection Procedure Accuracy Threshold

Slide 12

Slide 12 text

Evaluation of the Methods: General Results

Slide 13

Slide 13 text

Evaluation of the Methods: General Results

Slide 14

Slide 14 text

Evaluation of the Methods: Specific Results

Slide 15

Slide 15 text

Intermediate Conclusion ● The results seem to be already good enough to be used for computer- assisted language comparison (do a pre-processing with cognate detection analyses and later having them refined by experts) ● The results show that we might be able to push the barrier: switching to better partitioning algorithms increased the results by 1 point. What happens if we also manage to refine the calculation of the similarity scores?

Slide 16

Slide 16 text

Benchmarks

Slide 17

Slide 17 text

Towards a Cognate Detection Benchmark Database ● Benchmark databases are essential to train, test, and evaluate our algorithms ● So far, there are not many benchmark databases which could be used for our tasks ● With the new data we prepared for this study, and the data that was compiled to test LingPy (List 2014), there are already 18 different datasets which are ○ in more or less clean phonetic encoding ○ are segmentized ○ are tagged for cognacy ● We think these should be published as a benchmark database for cognate detection (BDCD, or “LexiBench”?) ● In this way, we maintain comparability with other algorithms that might be proposed, and we ease the creation of alternative approaches

Slide 18

Slide 18 text

Benchmark Database for Cognate Detection: Demo ● This is a first demo version, on how a benchmark database could look: ● http://digling.org/lexibench/ ● It is nothing special, and we do not need anything really special, but we should get the data out there, so that people know they can use it...

Slide 19

Slide 19 text

Benchmark Databases in General ● There is a Benchmark Database for Phonetic Alignment (List and Prokić 2014), which will be extended in 2015 (scholars start providing data). ● There could be more benchmark databases for other tasks that are relevant for us. ● We may think of establishing a general “brand” for benchmark databases, summarized under, say “GlottoBench” (and the alignment benchmark can be directly included there). ● Alternatively, we can go for separate benchmarks. ● But what we need is to spread the word that there is data that people can use for developing and testing their algorithms. ● Otherwise, computational linguists will keep on using the Dyen database in an uppercase version to check how well their algorithms detect cognates...

Slide 20

Slide 20 text

Probabilities

Slide 21

Slide 21 text

Toward a Bayesian approach ● Deterministic methods like those just outlined have some drawbacks (and also advantages!) ● We only get black-and-white results from them. ● Two words are either definitely cognate, or they are definitely not. ● By adjusting thresholds we can trade off between Type I and Type II errors. ● But we must apply a single threshold to the entire dataset. ● The thresholds are essentially arbitrary, and choosing one for a new dataset is difficult.

Slide 22

Slide 22 text

Virtues of probabilistic modelling ● Probabilistic models can overcome many of these shortcomings (usually at the greater computational expense). ● We can get more nuanced results like “these two words are cognate with probability 0.72” ● We can also have the model attempt to “learn its own thresholds” from the data, removing the need to guess arbitrary thresholds for new datasets.

Slide 23

Slide 23 text

Introducing the Chinese Restaurant Process ● The CRP is whimsically-named discrete-time stochastic process. ● Imagine a Chinese restaurant with an infinite number of tables. ● Imagine each table has an infinite number of seats. ● Customers enter the restaurant and are seated one at a time. ● Where each customer sits is randomly determined.

Slide 24

Slide 24 text

Chinese Restaurant Process (contd.) ● Sometimes customers sit at an empty table (there is always one available!) ● Otherwise they choose an occupied table, and they choose which one with some probability proportional to how many people are already seated there. ● (in this version) there is a “stickiness” parameter 0 ≤ α ≤ ∞. ● When α is low, people like to sit with each other. ● When α is very high, everybody sits alone.

Slide 25

Slide 25 text

Chinese Restaurant Process example, n=5, α=0.5

Slide 26

Slide 26 text

Chinese Restaurant Process example, n=5, α=1.5

Slide 27

Slide 27 text

Chinese Restaurant Process example, n=5, α=2.5

Slide 28

Slide 28 text

Chinese Restaurant Process example, n=5, α=4.5

Slide 29

Slide 29 text

Chinese Restaurant Process example, n=5, α=5.5

Slide 30

Slide 30 text

Chinese Restaurant Process (contd.) ● What does this have to do with language?! ● The CRP sorts n customers into somewhere between 1 and n tables. ● More generally, it partitions n things into 1-n sets. ● It can partition n wordforms into 1-n cognate classes. ● It is simply a probabilistic way of sorting things into groups, with the option to control how likely things are to group together, in a broad way.

Slide 31

Slide 31 text

A generative process for matrices Here’s a recipe for making an n×n pairwise distance matrix like one of Mattis’ from scratch by rolling dice: 1. Sample a stickiness value α from some prior. 2. Use the CRP, with α, to sort your n words into cognate classes. 3. Sample phonetic distances between cognates from a “small” distribution. 4. Sample phonetic distances between non-cognates from a “large” distribution. This is an example of a mixture model. (you can sample parameters for your large and small distributions from some prior too)

Slide 32

Slide 32 text

Bayes flip! ● We just outlined a way to stochastically generate phonetic distance matrices. ● We we already have (or can get) phonetic distance matrices. ● Those are our data! ● The value of generative models is that we can use Bayes’ theorem to reverse them. P(H|D) ∝ P(D|H) ● Here H is our cognate classification (and α, and any distribution parameters)

Slide 33

Slide 33 text

The reward ● What we actually get back from a Bayesian analysis is a posterior distribution over cognate classes. ● If we use MCMC to do the analysis, we get a set of say 100 or 1000 classifications of the words into cognate sets. ● This facilitates statements like “A and B are cognate with probability 0.82”.

Slide 34

Slide 34 text

Does it work? ● We get back uncertain estimates of cognate classes. ● So we can’t immediately compute precision, recall or F scores against gold standards. ● For the sake of easy and direct comparison to gold standards or existing methods, we need a way to distil our uncertain estimates into clear “consensus classes” (cf. consensus trees). ● How best to do this is an interesting research problem in itself

Slide 35

Slide 35 text

A simple first effort ● The MAP results feature good precision but mediocre recall. ● We need to introduce a little more “slack”. ● Here’s an off-the-cuff approach: ● For each wordform, build a cognate set out of all those forms which are cognate with p ≥ 0.75, say. ● Look for wordforms assigned to multiple cognate classes (bad). ● Keep the form in the class with the highest product of pairwise probabilities

Slide 36

Slide 36 text

A few more details ● We test a particular form of the generative model ● Our “big” and “small” distributions are Gaussians, with separate means and variances. ● The two distributions have different priors on their means (favouring large or small values), and the same priors on their variances (favouring lower values) ● The prior over α is based on some gold standard classifications.

Slide 37

Slide 37 text

Gamma priors for Gaussian means

Slide 38

Slide 38 text

Exponential prior for Gaussian variances

Slide 39

Slide 39 text

Gamma prior for , with empirical params.

Slide 40

Slide 40 text

p=0.75 results

Slide 41

Slide 41 text

Thresholds strike back ● Of course, the threshold can be varied from 0.75. ● As threshold is lowered, we trade precision for recall. ● But wait, weren’t we trying to get rid of arbitrary thresholds?! ● This is not so bad: the threshold has a principled interpretation which holds across data sets (cf the arbitrary use of 0.05 significance in hypothesis testing).

Slide 42

Slide 42 text

Future work ● One shortcoming of this method is the independence of the restaurants ● All restaurants share the same α, so are equally “clumpy” or “spread out” ● But there is no requirement for clumpings to be consistent across restaurants, as we may expect. ● We don’t just want people to always sit at 2 or 3 crowded tables, we want roughly the same crowds each time ● A modification to the CRP may help with this?

Slide 43

Slide 43 text

More future work ● An unexplored question: how can we infer trees using this system? ● We can compute consensus cognate classes as an intermediate step and then use Covarion or Dollo models in the traditional way. ● But this throws away the uncertainty information we have worked so hard to get! ● Can we do away with the intermediate step entirely and develop a tree-based generative model for distance matrices?

Slide 44

Slide 44 text

Thank you! ● Any questions?