Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatic Cognate Detection: Evaluation, Benchmarks, and Probabilities

Automatic Cognate Detection: Evaluation, Benchmarks, and Probabilities

Talk by Luke Maurits, Simon Greenhill, and Johann-Mattis List, presented at the workshop "Towards a Global Language Phylogeny", 21-23 October, Max Planck Institute for the Science of Human History, Jena.

Johann-Mattis List

October 21, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Automatic Cognate
    Detection
    Evaluation, Benchmarks, and Probabilities
    Luke Maurits, Johann-Mattis List, and Simon Greenhill

    View Slide

  2. Outline
    1. Evaluation
    a. Overview on LingPy’s cognate detection methods
    b. Introducing a new method based on Infomap community clustering
    c. Testing the method on a newly compiled testset
    2. Benchmarks
    a. Pointing to the importance of having good benchmark databases
    b. Show an example of how a Cognate Detection Benchmark Database could look like
    c. Introducing the idea of having a series of benchmark databases in GlottoBank
    3. Probabilities
    a. Towards a Bayesian approach
    b. The Chinese Restaurant Process
    c. MAP approach and beyond

    View Slide

  3. Evaluation

    View Slide

  4. LingPy’s Cognate Detection Methods
    ● LingPy currently offers four cognate detection methods
    ○ Turchin et al. (2010): simple sound class method, first two sounds decide about
    cognacy
    ○ Edit Distance: Normalized edit distance is calculated for all word pairs and the
    results are partitioned into cognate sets (using flat UPGMA clustering)
    ○ SCA: SCA distances (List 2012) are calculated for all word pairs and the results
    are partitioned into cognate sets (using flat UPGMA)
    ○ LexStat: Language specific scoring schemes (close to sound correspondences,
    effectively log-odds) are calculated and partitioned into cognate sets (UPGMA)
    ● The methods show an increasing accuracy, but also an increasing
    computational cost

    View Slide

  5. A New Method for Cognate Detection
    ● Instead of using flat UPGMA for partitioning, we use Infomap (Rosvall and
    Bergstrom 2008) for partitioning but take pairwise distances from LexStat.
    ● Infomap is a community detection algorithm designed for complex network
    analysis
    ● By switching to network partitioning from agglomerative clustering, we follow
    the “lead” from evolutionary biology where network partitioning has been used
    during the last decade to determine homologous gene families
    ● Different algorithms are used in biology, and many have been tested so far,
    but Infomap seems to perform best (as shown in the evaluation)

    View Slide

  6. Network Partitioning instead of Flat UPGMA

    View Slide

  7. Network Partitioning instead of Flat UPGMA

    View Slide

  8. Evaluating the Performance on a New Testset

    View Slide

  9. Problem of Threshold Selection
    ● Apart from the Method by Turchin et al. (2010), all three remaining and the
    new method require that we know where to set the threshold.
    ● In order to test how well the methods perform, we need to find a way to
    determine some threshold in some “objective” way
    ● Luke will talk about this in greater detail
    ● For our testing of the new dataset and the new methods, we decided to use a
    training set (List 2014) to determine the “best” threshold for each of the
    methods.
    ● While testing the three LingPy methods which require a threshold, we also
    tested the new method (LS-Infomap) and an alternative method (LS-MCL,
    based on Markov Clustering, Dongen 2000).

    View Slide

  10. Training Set Used for Threshold Selection

    View Slide

  11. Results for the Threshold Selection Procedure
    Accuracy
    Threshold

    View Slide

  12. Evaluation of the Methods: General Results

    View Slide

  13. Evaluation of the Methods: General Results

    View Slide

  14. Evaluation of the Methods: Specific Results

    View Slide

  15. Intermediate Conclusion
    ● The results seem to be already good enough to be used for computer-
    assisted language comparison (do a pre-processing with cognate
    detection analyses and later having them refined by experts)
    ● The results show that we might be able to push the barrier: switching
    to better partitioning algorithms increased the results by 1 point. What
    happens if we also manage to refine the calculation of the similarity
    scores?

    View Slide

  16. Benchmarks

    View Slide

  17. Towards a Cognate Detection Benchmark Database
    ● Benchmark databases are essential to train, test, and evaluate our algorithms
    ● So far, there are not many benchmark databases which could be used for our
    tasks
    ● With the new data we prepared for this study, and the data that was compiled
    to test LingPy (List 2014), there are already 18 different datasets which are
    ○ in more or less clean phonetic encoding
    ○ are segmentized
    ○ are tagged for cognacy
    ● We think these should be published as a benchmark database for cognate
    detection (BDCD, or “LexiBench”?)
    ● In this way, we maintain comparability with other algorithms that might be
    proposed, and we ease the creation of alternative approaches

    View Slide

  18. Benchmark Database for Cognate Detection: Demo
    ● This is a first demo version, on how a benchmark database could look:
    ● http://digling.org/lexibench/
    ● It is nothing special, and we do not need anything really special, but we
    should get the data out there, so that people know they can use it...

    View Slide

  19. Benchmark Databases in General
    ● There is a Benchmark Database for Phonetic Alignment (List and Prokić
    2014), which will be extended in 2015 (scholars start providing data).
    ● There could be more benchmark databases for other tasks that are relevant
    for us.
    ● We may think of establishing a general “brand” for benchmark databases,
    summarized under, say “GlottoBench” (and the alignment benchmark can be
    directly included there).
    ● Alternatively, we can go for separate benchmarks.
    ● But what we need is to spread the word that there is data that people can use
    for developing and testing their algorithms.
    ● Otherwise, computational linguists will keep on using the Dyen database in an
    uppercase version to check how well their algorithms detect cognates...

    View Slide

  20. Probabilities

    View Slide

  21. Toward a Bayesian approach
    ● Deterministic methods like those just outlined have some drawbacks (and
    also advantages!)
    ● We only get black-and-white results from them.
    ● Two words are either definitely cognate, or they are definitely not.
    ● By adjusting thresholds we can trade off between Type I and Type II errors.
    ● But we must apply a single threshold to the entire dataset.
    ● The thresholds are essentially arbitrary, and choosing one for a new dataset
    is difficult.

    View Slide

  22. Virtues of probabilistic modelling
    ● Probabilistic models can overcome many of these shortcomings (usually at
    the greater computational expense).
    ● We can get more nuanced results like “these two words are cognate with
    probability 0.72”
    ● We can also have the model attempt to “learn its own thresholds” from the
    data, removing the need to guess arbitrary thresholds for new datasets.

    View Slide

  23. Introducing the Chinese Restaurant Process
    ● The CRP is whimsically-named discrete-time stochastic process.
    ● Imagine a Chinese restaurant with an infinite number of tables.
    ● Imagine each table has an infinite number of seats.
    ● Customers enter the restaurant and are seated one at a time.
    ● Where each customer sits is randomly determined.

    View Slide

  24. Chinese Restaurant Process (contd.)
    ● Sometimes customers sit at an empty table (there is always one available!)
    ● Otherwise they choose an occupied table, and they choose which one with
    some probability proportional to how many people are already seated there.
    ● (in this version) there is a “stickiness” parameter 0 ≤ α ≤ ∞.
    ● When α is low, people like to sit with each other.
    ● When α is very high, everybody sits alone.

    View Slide

  25. Chinese Restaurant Process example, n=5, α=0.5

    View Slide

  26. Chinese Restaurant Process example, n=5, α=1.5

    View Slide

  27. Chinese Restaurant Process example, n=5, α=2.5

    View Slide

  28. Chinese Restaurant Process example, n=5, α=4.5

    View Slide

  29. Chinese Restaurant Process example, n=5, α=5.5

    View Slide

  30. Chinese Restaurant Process (contd.)
    ● What does this have to do with language?!
    ● The CRP sorts n customers into somewhere between 1 and n tables.
    ● More generally, it partitions n things into 1-n sets.
    ● It can partition n wordforms into 1-n cognate classes.
    ● It is simply a probabilistic way of sorting things into groups, with the option to
    control how likely things are to group together, in a broad way.

    View Slide

  31. A generative process for matrices
    Here’s a recipe for making an n×n pairwise distance matrix like one of Mattis’ from
    scratch by rolling dice:
    1. Sample a stickiness value α from some prior.
    2. Use the CRP, with α, to sort your n words into cognate classes.
    3. Sample phonetic distances between cognates from a “small” distribution.
    4. Sample phonetic distances between non-cognates from a “large” distribution.
    This is an example of a mixture model.
    (you can sample parameters for your large and small distributions from some prior
    too)

    View Slide

  32. Bayes flip!
    ● We just outlined a way to stochastically generate phonetic distance matrices.
    ● We we already have (or can get) phonetic distance matrices.
    ● Those are our data!
    ● The value of generative models is that we can use Bayes’ theorem to reverse
    them.
    P(H|D) ∝ P(D|H)
    ● Here H is our cognate classification (and α, and any distribution parameters)

    View Slide

  33. The reward
    ● What we actually get back from a Bayesian analysis is a posterior distribution
    over cognate classes.
    ● If we use MCMC to do the analysis, we get a set of say 100 or 1000
    classifications of the words into cognate sets.
    ● This facilitates statements like “A and B are cognate with probability 0.82”.

    View Slide

  34. Does it work?
    ● We get back uncertain estimates of cognate classes.
    ● So we can’t immediately compute precision, recall or F scores against gold
    standards.
    ● For the sake of easy and direct comparison to gold standards or existing
    methods, we need a way to distil our uncertain estimates into clear
    “consensus classes” (cf. consensus trees).
    ● How best to do this is an interesting research problem in itself

    View Slide

  35. A simple first effort
    ● The MAP results feature good precision but mediocre recall.
    ● We need to introduce a little more “slack”.
    ● Here’s an off-the-cuff approach:
    ● For each wordform, build a cognate set out of all those forms which are
    cognate with p ≥ 0.75, say.
    ● Look for wordforms assigned to multiple cognate classes (bad).
    ● Keep the form in the class with the highest product of pairwise probabilities

    View Slide

  36. A few more details
    ● We test a particular form of the generative model
    ● Our “big” and “small” distributions are Gaussians, with separate means and
    variances.
    ● The two distributions have different priors on their means (favouring large or
    small values), and the same priors on their variances (favouring lower values)
    ● The prior over α is based on some gold standard classifications.

    View Slide

  37. Gamma priors for Gaussian means

    View Slide

  38. Exponential prior for Gaussian variances

    View Slide

  39. Gamma prior for , with empirical params.

    View Slide

  40. p=0.75 results

    View Slide

  41. Thresholds strike back
    ● Of course, the threshold can be varied from 0.75.
    ● As threshold is lowered, we trade precision for recall.
    ● But wait, weren’t we trying to get rid of arbitrary thresholds?!
    ● This is not so bad: the threshold has a principled interpretation which holds
    across data sets (cf the arbitrary use of 0.05 significance in hypothesis
    testing).

    View Slide

  42. Future work
    ● One shortcoming of this method is the independence of the restaurants
    ● All restaurants share the same α, so are equally “clumpy” or “spread out”
    ● But there is no requirement for clumpings to be consistent across restaurants,
    as we may expect.
    ● We don’t just want people to always sit at 2 or 3 crowded tables, we want
    roughly the same crowds each time
    ● A modification to the CRP may help with this?

    View Slide

  43. More future work
    ● An unexplored question: how can we infer trees using this system?
    ● We can compute consensus cognate classes as an intermediate step and
    then use Covarion or Dollo models in the traditional way.
    ● But this throws away the uncertainty information we have worked so hard to
    get!
    ● Can we do away with the intermediate step entirely and develop a tree-based
    generative model for distance matrices?

    View Slide

  44. Thank you!
    ● Any questions?

    View Slide