Benchmark Databases in Historical Linguistics

Benchmark Databases in Historical Linguistics Johann-Mattis List Forschungszentrum Deutscher Sprachatlas
Philipps-University Marburg 2014-10-07 1 / 30

Benchmark Basics 2 / 30

Benchmark Basics What are Benchmarks? What are Benchmark Databases? 3
/ 30

Benchmark Basics What are Benchmarks? What are Benchmark Databases? The
evaluation and comparison of programs dealing with quantitative tasks in historical linguistics requires a large number of accurate reference sets which can be used as test cases to identify the strong and weak points of the numerous programs. (Original Plagiarism) 3 / 30

Benchmark Basics What are Benchmarks? What are Benchmark Databases? A
comprehensive evaluation and comparison of alignment programs requires a large number of accurate reference alignments which can be used as test cases. It has been shown (McClure et al., 1994) that the performance of alignment programs depends on the number of sequences, the degree of similarity between sequences and the number of insertions in the alignment. [...] We have constructed BAliBASE (Benchmark Alignment dataBASE) containing high-quality, documented alignments to identify the strong and weak points of the numerous alignment programs now available. (Thompson et al. 1998: 87) 3 / 30

Benchmark Basics Why do we need Benchmarks? Why do we
need Benchmark Databases? 4 / 30

need Benchmark Databases? For comparative reasons, since otherwise, we won’t be able to really tell whether two independently proposed algorithms exhibit a similar performance or not. 4 / 30

need Benchmark Databases? For comparative reasons, since otherwise, we won’t be able to really tell whether two independently proposed algorithms exhibit a similar performance or not. For development reasons, since we should test our new algorithms on actual data in order to guarantee their applicability and accuracy. 4 / 30

Benchmark Basics Why should I care for Benchmarks? Why should
I care for Benchmark Databases? 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . But please don’t jump out of the room right now, I will make some jokes you shouldn’t miss in the end of the talk. . 5 / 30

I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . But please don’t jump out of the room right now, I will make some jokes you shouldn’t miss in the end of the talk. . I promise! 5 / 30

Benchmark Critics 6 / 30

Benchmark Critics The Gold Standard Problem The Gold Standard Problem
7 / 30

What do you mean by “Gold Standard”? Is the data simulated? 7 / 30

What do you mean by “Gold Standard”? Is the data simulated? There is no such thing as a “Gold Standard”. 7 / 30

What do you mean by “Gold Standard”? Is the data simulated? There is no such thing as a “Gold Standard”. How can you make a “Gold Standard” if even historical linguists do not really have a clue what was going on? 7 / 30

Benchmark Critics The Tuning Problem The Tuning Problem 8 /
30

Benchmark Critics The Tuning Problem The Tuning Problem Why should
I trust your algorithm if you tuned it with help of your “Gold Standard”? 8 / 30

Benchmark Critics The Tuning Problem The Tuning Problem Why should
I trust your algorithm if you tuned it with help of your “Gold Standard”? Seriously, how can I trust your algorithm despite all the scores if it fails to ﬁnd the correspondences between Armenian [jɛɾˈku] and German [t͜svai]?!? 8 / 30

Benchmark Critics The Creation Problem The Creation Problem 9 /
30

Benchmark Critics The Creation Problem ////////////////////////////// The Creation Problem 9
/ 30

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? 9
/ 30

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? Oh,
you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? 9 / 30

you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? Oh, you created the benchmark data YOURSELF. Why didn’t you use the excellent data by XYZ? 9 / 30

you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? Oh, you created the benchmark data YOURSELF. Why didn’t you use the excellent data by XYZ? How could you use the data of XYZ as the basis for your “Gold Standard” for Germanic languages, given that EVERYONE knows that their reconstruction of Proto-Nostratic is complete nonsense? 9 / 30

Stay away from my benchmarks... In Defense of Benchmarks
10 / 30

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in
Historical Linguistics 11 / 30

Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). 11 / 30

Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be conﬁdent that our traditional methods are better than random guesses. 11 / 30

Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be conﬁdent that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be conﬁdent that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). 11 / 30

Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be confident that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). If it was not for these reasons that give us confidence in our traditional methods, then why should we bother pursuing them? 11 / 30

Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be confident that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). If it was not for these reasons that give us confidence in our traditional methods, then why should we bother pursuing them? As long as we are confident that our traditional methods work more or less, we can use our traditional knowledge to compile benchmark databases and to test, how “good” automatic methods work. 11 / 30

In Defense of Benchmarks Training and Testing Training and Testing
12 / 30

It is impossible to write an algorithm without training it. 12 / 30

It is impossible to write an algorithm without training it. When testing an algorithm on the same data on which it was trained, this bears the danger of “overﬁtting” the algorithm. 12 / 30

It is impossible to write an algorithm without training it. When testing an algorithm on the same data on which it was trained, this bears the danger of “overﬁtting” the algorithm. The fact that benchmark databases often serve both to train and to test the algorithms may be considered as problematic. Nevertheless, this is no real problem of benchmarks, but of the way how benchmarks are handled by those who write and train the programs. 12 / 30

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards
13 / 30

Ideally, a benchmark database in historical linguistics serves as a standard for 13 / 30

Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, 13 / 30

Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and 13 / 30

Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and deﬁning common formats used by the algorithms. 13 / 30

Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and deﬁning common formats used by the algorithms. A benchmark database can be much more than a simple database. It can help to initiate the standardization of formats for data exchange and storage and thus force those who design new algorithms to comply with them. 13 / 30

Benchmarks Online 14 / 30

Benchmarks Online BDPA Benchmark Database for Phonetic Alignments 15 /
30

Benchmarks Online BDPA Benchmark Database for Phonetic Alignments http://alignments.lingpy.org 15
/ 30

Benchmarks Online BDCD Benchmark Database for Cognate Detection 16 /
30

Benchmarks Online BDCD Benchmark Database for Cognate Detection http://sequencecomparison.github.io 16
/ 30

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction 17 /
30

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction (?) 17
/ 30

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction (?) Some
data is already there, but it needs to be cleaned, referenced, linked, and checked before publication. Since the data should be provided in form of multiple alignments, the alignments for all cognate sets compared to the proto-forms need to be manually checked. For further data, cooperations are planned (some of the collaborators do not yet know that they are among those lucky ones that have been chosen...). 17 / 30

Benchmark Evaluation 18 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores English w o l
- d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e English w o l d e m o r t Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

- d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX English w o l - d e m o r t - Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

- d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e English w o l d e m o r t Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores r - t e
w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e w o l d e m o f u - - d i m o w o l d e m o v - l a d i m i - v - l a d i m i f u - - d i m o r t t e r t r - English Russian English Russian Russian Chinese Russian Chinese English Chinese English Chinese 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores r - t e
25 / 30 = 0.83 25 / 33 = 0.76 25 / 31.5 = 0.80 w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e w o l d e m o f u - - d i m o w o l d e m o v - l a d i m i - v - l a d i m i f u - - d i m o r t t e r t r - - 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores o l u -
o l v - l v - l f u - r - t e 26 / 30 = 0.87 26 / 33 = 0.79 26 / 31.5 = 0.83 w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e d e m o - d i m o d e m o a d i m i - a d i m i - d i m o r t t e r t r - - 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores Choosing useful evaluation scores
is essential for the evaluation of a given algorithm. Standardization is of crucial im- portance here, since this is the only way to guarantee the comparability of alternative approaches. While for some tasks (alignment analyses, cognate detection), proper evaluation scores are well-established (cf. the overview in List 2014), evaluation scores for other tasks (borrowing detection, linguistics reconstruction) are largely un- explored. Those who provide benchmark databases should also oﬀer formal ways and code to evaluate algorithm performance. 19 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores: LingPy 20 / 30

Benchmark Evaluation Evaluation Scores Evaluation Scores: LingPy http://lingpy.org 20 /
30

Benchmark Evaluation Evaluation Tools Evaluation Tools “graben” (30) Turchin Levensht.
REFERENCE. Albanisch gërmon gərmo 1 1 1 Englisch digs dɪg 2 2 2 Französisch creuse krøze 1 3 3 Deutsch gräbt graːb 1 1 4 Hawaii ‘eli ʔeli 5 5 5 Navajo hahashgééd hahageːd 6 6 6 Türkisch kazıyor kaz 7 3 7 21 / 30

Benchmark Evaluation Evaluation Tools Evaluation Tools “Mund” (104) Turchin Levensth.
REFERENCE. Albanisch gojë goj 1 1 1 Englisch mouth mauθ 2 2 2 Französisch bouche buʃ 3 3 3 Deutsch Mund mund 4 4 2 Hawaii waha waha 5 5 5 Navajo ’azéé’ zeːʔ 6 6 6 Türkisch ağız aɣz 7 7 7 21 / 30

Benchmark Evaluation Evaluation Tools Evaluation Tools So far, there are
no real tools for the evaluation of the re- sults of automatic approaches. Nevertheless, if we want to increase the interaction between manual and automatic approaches in historical linguistics, it seems worthwhile to in- vest in proper tools for the expert evaluation of algorithms. 21 / 30

BENCHMARKS BEYOND Beyond Benchmarks 22 / 30

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics So far,
the majority of computational approaches in historical linguistics largely disregards the actual needs of historical linguistics. Despite the frequent claims that the algorithms are intended to supplement traditional research, many of them are mere attempts to prove the power of modern machine learning approaches and completely disregard the achievements of traditional research in historical linguistics. 23 / 30

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics If we
really want to make a diﬀerence with computational approaches and not simply seek to replace every expert who likes books with a computer or abacus, we need to work much, much harder, on a real integration of computational and traditional approaches. 23 / 30

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics P(A|B)=(P(B|A)P(A))/(P(B) FRANZ
BOPP VERY, VERY LONG TITLE 23 / 30

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics P(A|B)=(P(B|A)P(A))/(P(B) FRANZ
BOPP VERY, VERY LONG TITLE Apart from “computational historical linguistics”, we need to establish a new discipline of “computer-aided historical linguistics”. Such a framework needs benchmark databases (no wonder) and new standards, both for traditional and computational linguistics. However, such a framework will also need additional resources that help traditional approaches to leave the realm of intuition. 23 / 30

Beyond Benchmarks Semantic Change Semantic Change 24 / 30

Beyond Benchmarks Semantic Change Semantic Change It is beyond question
that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: 24 / 30

that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: The Database of Semantic Shifts (DatSemShifs, Burlak et al. http://semshifts.iling-ran.ru/) oﬀers a constantly increasing amount of patterns of semantic shifts, drawn from the linguistic literature. Shifts are categorized, tagged for meanings, and – where accessible – directions. In the future, this may turn into a very valuable resource for those interested in semantic change. 24 / 30

that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: The Database of Semantic Shifts (DatSemShifs, Burlak et al. http://semshifts.iling-ran.ru/) offers a constantly increasing amount of patterns of semantic shifts, drawn from the linguistic literature. Shifts are categorized, tagged for meanings, and – where accessible – directions. In the future, this may turn into a very valuable resource for those interested in semantic change. The Database of Cross-Linguistic Collexifications (CLICS, List et al., http://clics.lingpy.org) offers collections of colexifications (“polysemy”) patterns in some 200 languages. The data has been crawled semi-automatically from existing sources like the Intercontinental Dictionary Series (IDS, Key & Comrie 2007, http://lingweb.eva.mpg.de/ids/) and automatically cleaned and tagged for colexification. The automatic handling without proper checking of the data are a drawback which needs to be handled in the future. A strong aspect of the database are the visualizations of colexifications using up-to-date JavaScript libraries. 24 / 30

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences 25
/ 30

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences That
not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. 25 / 30

not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. However, only a few linguists have ever tried to substantiate this claim with data (Dolgopolsky 1964, Brown et al. 2013). 25 / 30

not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. However, only a few linguists have ever tried to substantiate this claim with data (Dolgopolsky 1964, Brown et al. 2013). The most notable resource known to me is the one by Brown et al. (2013), who report statistics based on ASJP data. The drawbacks of this approach are the limited number of symbols in ASJP code and the fact that identity relations are not covered. The advantages are the strictness of the procedure and the large amount of data that the analysis is based upon. 25 / 30

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change 26
/ 30

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change So
far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which oﬀers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. 26 / 30

far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which oﬀers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. If the numbers presented on the UniDia website are true (10 349 sound changes in 302 languages), it contains an invaluable resource on known sound changes. 26 / 30

far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which oﬀers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. If the numbers presented on the UniDia website are true (10 349 sound changes in 302 languages), it contains an invaluable resource on known sound changes. Unfortunately, it is not evident from the website, how (or if) this data will be made public in the future, and whether it can ever be used to either train our algorithms, or to provide our experts with something more than intuition. 26 / 30

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction 27 / 30

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction It is probably needless
to say that with MultiTree (http://multitree.org), Ethnologue (Lewis & Fennig 2014, http://ethnologue.com), and GlottoLog (v2.3, Hammarström et al. 2014, http://glottolog.org), a large number of expert classiﬁcation is publicly available. 27 / 30

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction It is probably needless
to say that with MultiTree (http://multitree.org), Ethnologue (Lewis & Fennig 2014, http://ethnologue.com), and GlottoLog (v2.3, Hammarström et al. 2014, http://glottolog.org), a large number of expert classiﬁcation is publicly available. What is lacking in quite a few current approaches to phylogenetic reconstruction is a proper evaluation of the algorithms that makes rigorously use of these resources. Just eyeballing a tree and claiming, that some method “reproduces expert classiﬁcations” based on some strange criterion, is simply not enough. 27 / 30

Beyond Benchmarks Borrowing Detection Borrowing Detection 28 / 30

Beyond Benchmarks Borrowing Detection Borrowing Detection Some databases, like the
Indo-European Lexial Cognacy Databse (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list borrowings along with their sources. 28 / 30

Beyond Benchmarks Borrowing Detection Borrowing Detection Some databases, like the
Indo-European Lexial Cognacy Databse (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list borrowings along with their sources. However, given the fact that in many areas of the world (and also in Indo-European) our knowledge in historical linguistics starts to reach its limits when it comes to distinguish borrowings from inherited words, it is quite likely that it is impossible at the moment to provide exhaustive test sets in which all borrowings are identiﬁed. 28 / 30

Conclusion Conclusion 29 / 30

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical
linguistics. 29 / 30

linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. 29 / 30

linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. 29 / 30

linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. In order to increase the interaction between traditional and computational historical linguists, we need to work hard on providing high-quality benchmark databases and high-quality tools for algorithm evaluation. 29 / 30

linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. In order to increase the interaction between traditional and computational historical linguists, we need to work hard on providing high-quality benchmark databases and high-quality tools for algorithm evaluation. Not only benchmark databases are needed, but also cross-linguistics comparative databases that help historical linguists to asses the regularity and irregularity of patterns and proposals in a less intuitive way. 29 / 30

Thank You for Listening! 30 / 30

Benchmark Databases in Historical Linguistics

Benchmark Databases in Historical Linguistics

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript