Slide 1

Slide 1 text

Benchmark Databases in Historical Linguistics Johann-Mattis List Forschungszentrum Deutscher Sprachatlas Philipps-University Marburg 2014-10-07 1 / 30

Slide 2

Slide 2 text

Benchmark Basics 2 / 30

Slide 3

Slide 3 text

Benchmark Basics What are Benchmarks? What are Benchmark Databases? 3 / 30

Slide 4

Slide 4 text

Benchmark Basics What are Benchmarks? What are Benchmark Databases? The evaluation and comparison of programs dealing with quantitative tasks in historical linguistics requires a large number of accurate reference sets which can be used as test cases to identify the strong and weak points of the numerous programs. (Original Plagiarism) 3 / 30

Slide 5

Slide 5 text

Benchmark Basics What are Benchmarks? What are Benchmark Databases? The evaluation and comparison of programs dealing with quantitative tasks in historical linguistics requires a large number of accurate reference sets which can be used as test cases to identify the strong and weak points of the numerous programs. (Original Plagiarism) 3 / 30

Slide 6

Slide 6 text

Benchmark Basics What are Benchmarks? What are Benchmark Databases? A comprehensive evaluation and comparison of alignment programs requires a large number of accurate reference alignments which can be used as test cases. It has been shown (McClure et al., 1994) that the performance of alignment programs depends on the number of sequences, the degree of similarity between sequences and the number of insertions in the alignment. [...] We have constructed BAliBASE (Benchmark Alignment dataBASE) containing high-quality, documented alignments to identify the strong and weak points of the numerous alignment programs now available. (Thompson et al. 1998: 87) 3 / 30

Slide 7

Slide 7 text

Benchmark Basics Why do we need Benchmarks? Why do we need Benchmark Databases? 4 / 30

Slide 8

Slide 8 text

Benchmark Basics Why do we need Benchmarks? Why do we need Benchmark Databases? For comparative reasons, since otherwise, we won’t be able to really tell whether two independently proposed algorithms exhibit a similar performance or not. 4 / 30

Slide 9

Slide 9 text

Benchmark Basics Why do we need Benchmarks? Why do we need Benchmark Databases? For comparative reasons, since otherwise, we won’t be able to really tell whether two independently proposed algorithms exhibit a similar performance or not. For development reasons, since we should test our new algorithms on actual data in order to guarantee their applicability and accuracy. 4 / 30

Slide 10

Slide 10 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? 5 / 30

Slide 11

Slide 11 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are 5 / 30

Slide 12

Slide 12 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or 5 / 30

Slide 13

Slide 13 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or 5 / 30

Slide 14

Slide 14 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, 5 / 30

Slide 15

Slide 15 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . 5 / 30

Slide 16

Slide 16 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . But please don’t jump out of the room right now, I will make some jokes you shouldn’t miss in the end of the talk. . 5 / 30

Slide 17

Slide 17 text

Benchmark Basics Why should I care for Benchmarks? Why should I care for Benchmark Databases? Who said you need to care? If you are no historical linguist or dialectologist or typologist, or not interested in quantitative applications but prefer to do everything manually, or not interested in enhancing and formalizing existing methods with help of computational approaches, then you don’t need to care about benchmark databases at all... . But please don’t jump out of the room right now, I will make some jokes you shouldn’t miss in the end of the talk. . I promise! 5 / 30

Slide 18

Slide 18 text

Benchmark Critics 6 / 30

Slide 19

Slide 19 text

Benchmark Critics The Gold Standard Problem The Gold Standard Problem 7 / 30

Slide 20

Slide 20 text

Benchmark Critics The Gold Standard Problem The Gold Standard Problem What do you mean by “Gold Standard”? Is the data simulated? 7 / 30

Slide 21

Slide 21 text

Benchmark Critics The Gold Standard Problem The Gold Standard Problem What do you mean by “Gold Standard”? Is the data simulated? There is no such thing as a “Gold Standard”. 7 / 30

Slide 22

Slide 22 text

Benchmark Critics The Gold Standard Problem The Gold Standard Problem What do you mean by “Gold Standard”? Is the data simulated? There is no such thing as a “Gold Standard”. How can you make a “Gold Standard” if even historical linguists do not really have a clue what was going on? 7 / 30

Slide 23

Slide 23 text

Benchmark Critics The Tuning Problem The Tuning Problem 8 / 30

Slide 24

Slide 24 text

Benchmark Critics The Tuning Problem The Tuning Problem Why should I trust your algorithm if you tuned it with help of your “Gold Standard”? 8 / 30

Slide 25

Slide 25 text

Benchmark Critics The Tuning Problem The Tuning Problem Why should I trust your algorithm if you tuned it with help of your “Gold Standard”? Seriously, how can I trust your algorithm despite all the scores if it fails to find the correspondences between Armenian [jɛɾˈku] and German [t͜svai]?!? 8 / 30

Slide 26

Slide 26 text

Benchmark Critics The Creation Problem The Creation Problem 9 / 30

Slide 27

Slide 27 text

Benchmark Critics The Creation Problem ////////////////////////////// The Creation Problem 9 / 30

Slide 28

Slide 28 text

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? 9 / 30

Slide 29

Slide 29 text

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? Oh, you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? 9 / 30

Slide 30

Slide 30 text

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? Oh, you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? Oh, you created the benchmark data YOURSELF. Why didn’t you use the excellent data by XYZ? 9 / 30

Slide 31

Slide 31 text

Benchmark Critics The Creation Problem Who Benchmarks the Benchmarks? Oh, you used the data of XYZ to create your benchmark. Why didn’t you create your own dataset? Oh, you created the benchmark data YOURSELF. Why didn’t you use the excellent data by XYZ? How could you use the data of XYZ as the basis for your “Gold Standard” for Germanic languages, given that EVERYONE knows that their reconstruction of Proto-Nostratic is complete nonsense? 9 / 30

Slide 32

Slide 32 text

Stay away from my bench- marks... In Defense of Benchmarks 10 / 30

Slide 33

Slide 33 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics 11 / 30

Slide 34

Slide 34 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). 11 / 30

Slide 35

Slide 35 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. 11 / 30

Slide 36

Slide 36 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be confident that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). 11 / 30

Slide 37

Slide 37 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be confident that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). If it was not for these reasons that give us confidence in our traditional methods, then why should we bother pursuing them? 11 / 30

Slide 38

Slide 38 text

In Defense of Benchmarks Knowledge in Historical Linguistics Knowledge in Historical Linguistics Our knowledge of the past is a construct in the sense used in psychology (cf. Cronbach and Meehl 1995). This does not mean that what we know is just a “wissenschaftliche Fiktion” (Schmidt 1872). There are good reasons to be confident that our traditional methods are better than random guesses. Although we should be careful in writing Indo-European fables, there are also good reasons to be confident that our methods uncover some kind of reality (cf. Saussure’s prediction of laryngeals in 1879). If it was not for these reasons that give us confidence in our traditional methods, then why should we bother pursuing them? As long as we are confident that our traditional methods work more or less, we can use our traditional knowledge to compile benchmark databases and to test, how “good” automatic methods work. 11 / 30

Slide 39

Slide 39 text

In Defense of Benchmarks Training and Testing Training and Testing 12 / 30

Slide 40

Slide 40 text

In Defense of Benchmarks Training and Testing Training and Testing It is impossible to write an algorithm without training it. 12 / 30

Slide 41

Slide 41 text

In Defense of Benchmarks Training and Testing Training and Testing It is impossible to write an algorithm without training it. When testing an algorithm on the same data on which it was trained, this bears the danger of “overfitting” the algorithm. 12 / 30

Slide 42

Slide 42 text

In Defense of Benchmarks Training and Testing Training and Testing It is impossible to write an algorithm without training it. When testing an algorithm on the same data on which it was trained, this bears the danger of “overfitting” the algorithm. The fact that benchmark databases often serve both to train and to test the algorithms may be considered as problematic. Nevertheless, this is no real problem of benchmarks, but of the way how benchmarks are handled by those who write and train the programs. 12 / 30

Slide 43

Slide 43 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards 13 / 30

Slide 44

Slide 44 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards Ideally, a benchmark database in historical linguistics serves as a standard for 13 / 30

Slide 45

Slide 45 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, 13 / 30

Slide 46

Slide 46 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and 13 / 30

Slide 47

Slide 47 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and defining common formats used by the algorithms. 13 / 30

Slide 48

Slide 48 text

In Defense of Benchmarks Benchmarks and Standards Benchmarks and Standards Ideally, a benchmark database in historical linguistics serves as a standard for training new algorithms, testing new algorithms, and defining common formats used by the algorithms. A benchmark database can be much more than a simple database. It can help to initiate the standardization of formats for data exchange and storage and thus force those who design new algorithms to comply with them. 13 / 30

Slide 49

Slide 49 text

Benchmarks Online 14 / 30

Slide 50

Slide 50 text

Benchmarks Online BDPA Benchmark Database for Phonetic Alignments 15 / 30

Slide 51

Slide 51 text

Benchmarks Online BDPA Benchmark Database for Phonetic Alignments 15 / 30

Slide 52

Slide 52 text

Benchmarks Online BDPA Benchmark Database for Phonetic Alignments http://alignments.lingpy.org 15 / 30

Slide 53

Slide 53 text

Benchmarks Online BDCD Benchmark Database for Cognate Detection 16 / 30

Slide 54

Slide 54 text

Benchmarks Online BDCD Benchmark Database for Cognate Detection 16 / 30

Slide 55

Slide 55 text

Benchmarks Online BDCD Benchmark Database for Cognate Detection http://sequencecomparison.github.io 16 / 30

Slide 56

Slide 56 text

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction 17 / 30

Slide 57

Slide 57 text

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction (?) 17 / 30

Slide 58

Slide 58 text

Benchmarks Online BDLR Benchmark Database for Linguistic Reconstruction (?) Some data is already there, but it needs to be cleaned, referenced, linked, and checked before publication. Since the data should be provided in form of multiple alignments, the alignments for all cognate sets compared to the proto-forms need to be manually checked. For further data, cooperations are planned (some of the collaborators do not yet know that they are among those lucky ones that have been chosen...). 17 / 30

Slide 59

Slide 59 text

Benchmark Evaluation 18 / 30

Slide 60

Slide 60 text

Benchmark Evaluation Evaluation Scores Evaluation Scores 19 / 30

Slide 61

Slide 61 text

Benchmark Evaluation Evaluation Scores Evaluation Scores English w o l - d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e English w o l d e m o r t Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

Slide 62

Slide 62 text

Benchmark Evaluation Evaluation Scores Evaluation Scores English w o l - d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX English w o l - d e m o r t - Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

Slide 63

Slide 63 text

Benchmark Evaluation Evaluation Scores Evaluation Scores English w o l - d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX English w o l - d e m o r t - Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

Slide 64

Slide 64 text

Benchmark Evaluation Evaluation Scores Evaluation Scores English w o l - d e m o r t Russian v - l a d i m i r - Chinese f u - - d i m o t e English w o l d e m o r t Russian v - l a d i m i r - - Chinese f u - - d i m o - t e 8 / 11 = 0.72 8 / 10 = 0.8 8 / 10.5 = 0.76 19 / 30

Slide 65

Slide 65 text

Benchmark Evaluation Evaluation Scores Evaluation Scores r - t e w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e w o l d e m o f u - - d i m o w o l d e m o v - l a d i m i - v - l a d i m i f u - - d i m o r t t e r t r - English Russian English Russian Russian Chinese Russian Chinese English Chinese English Chinese 19 / 30

Slide 66

Slide 66 text

Benchmark Evaluation Evaluation Scores Evaluation Scores r - t e 25 / 30 = 0.83 25 / 33 = 0.76 25 / 31.5 = 0.80 w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e w o l d e m o f u - - d i m o w o l d e m o v - l a d i m i - v - l a d i m i f u - - d i m o r t t e r t r - - 19 / 30

Slide 67

Slide 67 text

Benchmark Evaluation Evaluation Scores Evaluation Scores o l u - o l v - l v - l f u - r - t e 26 / 30 = 0.87 26 / 33 = 0.79 26 / 31.5 = 0.83 w o l d e m o r t f u - - d i m o - t e w o l d e m o r t v - l a d i m i r - - - - v - l a d i m i r - - f u - - d i m o - t e d e m o - d i m o d e m o a d i m i - a d i m i - d i m o r t t e r t r - - 19 / 30

Slide 68

Slide 68 text

Benchmark Evaluation Evaluation Scores Evaluation Scores Choosing useful evaluation scores is essential for the eva- luation of a given algorithm. Standardization is of crucial im- portance here, since this is the only way to guarantee the comparability of alternative approaches. While for some tasks (alignment analyses, cognate detec- tion), proper evaluation scores are well-established (cf. the overview in List 2014), evaluation scores for other tasks (bor- rowing detection, linguistics reconstruction) are largely un- explored. Those who provide benchmark databases should also offer formal ways and code to evaluate algorithm performance. 19 / 30

Slide 69

Slide 69 text

Benchmark Evaluation Evaluation Scores Evaluation Scores: LingPy 20 / 30

Slide 70

Slide 70 text

Benchmark Evaluation Evaluation Scores Evaluation Scores: LingPy 20 / 30

Slide 71

Slide 71 text

Benchmark Evaluation Evaluation Scores Evaluation Scores: LingPy http://lingpy.org 20 / 30

Slide 72

Slide 72 text

Benchmark Evaluation Evaluation Tools Evaluation Tools “graben” (30) Turchin Levensht. REFERENCE. Albanisch gërmon gərmo 1 1 1 Englisch digs dɪg 2 2 2 Französisch creuse krøze 1 3 3 Deutsch gräbt graːb 1 1 4 Hawaii ‘eli ʔeli 5 5 5 Navajo hahashgééd hahageːd 6 6 6 Türkisch kazıyor kaz 7 3 7 21 / 30

Slide 73

Slide 73 text

Benchmark Evaluation Evaluation Tools Evaluation Tools “graben” (30) Turchin Levensht. REFERENCE. Albanisch gërmon gərmo 1 1 1 Englisch digs dɪg 2 2 2 Französisch creuse krøze 1 3 3 Deutsch gräbt graːb 1 1 4 Hawaii ‘eli ʔeli 5 5 5 Navajo hahashgééd hahageːd 6 6 6 Türkisch kazıyor kaz 7 3 7 21 / 30

Slide 74

Slide 74 text

Benchmark Evaluation Evaluation Tools Evaluation Tools “Mund” (104) Turchin Levensth. REFERENCE. Albanisch gojë goj 1 1 1 Englisch mouth mauθ 2 2 2 Französisch bouche buʃ 3 3 3 Deutsch Mund mund 4 4 2 Hawaii waha waha 5 5 5 Navajo ’azéé’ zeːʔ 6 6 6 Türkisch ağız aɣz 7 7 7 21 / 30

Slide 75

Slide 75 text

Benchmark Evaluation Evaluation Tools Evaluation Tools “Mund” (104) Turchin Levensth. REFERENCE. Albanisch gojë goj 1 1 1 Englisch mouth mauθ 2 2 2 Französisch bouche buʃ 3 3 3 Deutsch Mund mund 4 4 2 Hawaii waha waha 5 5 5 Navajo ’azéé’ zeːʔ 6 6 6 Türkisch ağız aɣz 7 7 7 21 / 30

Slide 76

Slide 76 text

Benchmark Evaluation Evaluation Tools Evaluation Tools So far, there are no real tools for the evaluation of the re- sults of automatic approaches. Nevertheless, if we want to increase the interaction between manual and automatic ap- proaches in historical linguistics, it seems worthwhile to in- vest in proper tools for the expert evaluation of algorithms. 21 / 30

Slide 77

Slide 77 text

BENCHMARKS BEYOND Beyond Benchmarks 22 / 30

Slide 78

Slide 78 text

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics So far, the majority of computational approaches in histori- cal linguistics largely disregards the actual needs of histori- cal linguistics. Despite the frequent claims that the algorith- ms are intended to supplement traditional research, many of them are mere attempts to prove the power of modern machine learning approaches and completely disregard the achievements of traditional research in historical linguistics. 23 / 30

Slide 79

Slide 79 text

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics If we really want to make a difference with computational ap- proaches and not simply seek to replace every expert who likes books with a computer or abacus, we need to work much, much harder, on a real integration of computational and traditional approaches. 23 / 30

Slide 80

Slide 80 text

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 23 / 30

Slide 81

Slide 81 text

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 23 / 30

Slide 82

Slide 82 text

Beyond Benchmarks Computer-Aided Historical Linguistics Computer-Aided Historical Linguistics P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE Apart from “computational historical linguistics”, we need to establish a new discipline of “computer-aided historical linguistics”. Such a framework needs bench- mark databases (no wonder) and new standards, both for traditional and computational linguistics. However, such a framework will also need additional resources that help traditional approaches to leave the realm of intuition. 23 / 30

Slide 83

Slide 83 text

Beyond Benchmarks Semantic Change Semantic Change 24 / 30

Slide 84

Slide 84 text

Beyond Benchmarks Semantic Change Semantic Change It is beyond question that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: 24 / 30

Slide 85

Slide 85 text

Beyond Benchmarks Semantic Change Semantic Change It is beyond question that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: The Database of Semantic Shifts (DatSemShifs, Burlak et al. http://semshifts.iling-ran.ru/) offers a constantly increasing amount of patterns of semantic shifts, drawn from the linguistic literature. Shifts are categorized, tagged for meanings, and – where accessible – directions. In the future, this may turn into a very valuable resource for those interested in semantic change. 24 / 30

Slide 86

Slide 86 text

Beyond Benchmarks Semantic Change Semantic Change It is beyond question that hypotheses in historical linguistics stand and fall with a proper treatment of semantic change. So far, however, we lack the cross-linguistic data to assess the plausibility of proposed patterns of semantic shift. There is, however, hope for improvement: The Database of Semantic Shifts (DatSemShifs, Burlak et al. http://semshifts.iling-ran.ru/) offers a constantly increasing amount of patterns of semantic shifts, drawn from the linguistic literature. Shifts are categorized, tagged for meanings, and – where accessible – directions. In the future, this may turn into a very valuable resource for those interested in semantic change. The Database of Cross-Linguistic Collexifications (CLICS, List et al., http://clics.lingpy.org) offers collections of colexifications (“polysemy”) patterns in some 200 languages. The data has been crawled semi-automatically from existing sources like the Intercontinental Dictionary Series (IDS, Key & Comrie 2007, http://lingweb.eva.mpg.de/ids/) and automatically cleaned and tagged for colexification. The automatic handling without proper checking of the data are a drawback which needs to be handled in the future. A strong aspect of the database are the visualizations of colexifications using up-to-date JavaScript libraries. 24 / 30

Slide 87

Slide 87 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences 25 / 30

Slide 88

Slide 88 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences That not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. 25 / 30

Slide 89

Slide 89 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences That not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. However, only a few linguists have ever tried to substantiate this claim with data (Dolgopolsky 1964, Brown et al. 2013). 25 / 30

Slide 90

Slide 90 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Correspondences That not all sounds are equally likely to occur in correspondence relation in historically related words has been noted by many linguists in the past. However, only a few linguists have ever tried to substantiate this claim with data (Dolgopolsky 1964, Brown et al. 2013). The most notable resource known to me is the one by Brown et al. (2013), who report statistics based on ASJP data. The drawbacks of this approach are the limited number of symbols in ASJP code and the fact that identity relations are not covered. The advantages are the strictness of the procedure and the large amount of data that the analysis is based upon. 25 / 30

Slide 91

Slide 91 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change 26 / 30

Slide 92

Slide 92 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change So far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which offers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. 26 / 30

Slide 93

Slide 93 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change So far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which offers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. If the numbers presented on the UniDia website are true (10 349 sound changes in 302 languages), it contains an invaluable resource on known sound changes. 26 / 30

Slide 94

Slide 94 text

Beyond Benchmarks Sound Change and Sound Correspondences Sound Change So far, the only online resource known to me is the web-based platform for Diachronic Data and Models (DiaDM, http://www.diadm.ish-lyon.cnrs.fr) which offers a database on Diachronic Universals (UniDia). Unfortunately, the database has been under construction for almost two years now, and no real progress regarding the presentation of the data has been visible so far. If the numbers presented on the UniDia website are true (10 349 sound changes in 302 languages), it contains an invaluable resource on known sound changes. Unfortunately, it is not evident from the website, how (or if) this data will be made public in the future, and whether it can ever be used to either train our algorithms, or to provide our experts with something more than intuition. 26 / 30

Slide 95

Slide 95 text

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction 27 / 30

Slide 96

Slide 96 text

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction It is probably needless to say that with MultiTree (http://multitree.org), Ethnologue (Lewis & Fennig 2014, http://ethnologue.com), and GlottoLog (v2.3, Hammarström et al. 2014, http://glottolog.org), a large number of expert classification is publicly available. 27 / 30

Slide 97

Slide 97 text

Beyond Benchmarks Phylogenetic Reconstruction Phylogenetic Reconstruction It is probably needless to say that with MultiTree (http://multitree.org), Ethnologue (Lewis & Fennig 2014, http://ethnologue.com), and GlottoLog (v2.3, Hammarström et al. 2014, http://glottolog.org), a large number of expert classification is publicly available. What is lacking in quite a few current approaches to phylogenetic reconstruction is a proper evaluation of the algorithms that makes rigorously use of these resources. Just eyeballing a tree and claiming, that some method “reproduces expert classifications” based on some strange criterion, is simply not enough. 27 / 30

Slide 98

Slide 98 text

Beyond Benchmarks Borrowing Detection Borrowing Detection 28 / 30

Slide 99

Slide 99 text

Beyond Benchmarks Borrowing Detection Borrowing Detection Some databases, like the Indo-European Lexial Cognacy Databse (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list borrowings along with their sources. 28 / 30

Slide 100

Slide 100 text

Beyond Benchmarks Borrowing Detection Borrowing Detection Some databases, like the Indo-European Lexial Cognacy Databse (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list borrowings along with their sources. However, given the fact that in many areas of the world (and also in Indo-European) our knowledge in historical linguistics starts to reach its limits when it comes to distinguish borrowings from inherited words, it is quite likely that it is impossible at the moment to provide exhaustive test sets in which all borrowings are identified. 28 / 30

Slide 101

Slide 101 text

Conclusion Conclusion 29 / 30

Slide 102

Slide 102 text

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical linguistics. 29 / 30

Slide 103

Slide 103 text

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. 29 / 30

Slide 104

Slide 104 text

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. 29 / 30

Slide 105

Slide 105 text

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. In order to increase the interaction between traditional and computational historical linguists, we need to work hard on providing high-quality benchmark databases and high-quality tools for algorithm evaluation. 29 / 30

Slide 106

Slide 106 text

Conclusion Conclusion Automatic approaches are constantly gaining ground in historical linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the lack of benchmark databases in historical linguistics which help programmers to train their code but also force them to test it rigorously. In order to increase the interaction between traditional and computational historical linguists, we need to work hard on providing high-quality benchmark databases and high-quality tools for algorithm evaluation. Not only benchmark databases are needed, but also cross-linguistics comparative databases that help historical linguists to asses the regularity and irregularity of patterns and proposals in a less intuitive way. 29 / 30

Slide 107

Slide 107 text

Thank You for Listening! 30 / 30