Beyond Cognacy Current Chances and Future Challenges of Automatic Cognate Detection in Historical Linguistics Johann-Mattis List Forschungszentrum Deutscher Sprachatlas Philipps-University Marburg 2014-09-17 1 / 30
word Wort слово cuvînt palabra mot adottszó slovo verbum focal 词 parola λόγος शब◌् द ord λόγος Wort слово cuvînt palabra mot adottszó slovo verbum focal 词 parola शब◌् द ord word ord ord word Cognate Detection 2 / 30
Cognate Detection Traditional Approaches The Comparative Method FRANZ BOPP VERY, VERY LONG TITLE proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification 4 / 30
Cognate Detection Traditional Approaches The Comparative Method FRANZ BOPP VERY, VERY LONG TITLE proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification 4 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x ? n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30
Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30
Cognate Detection Problems Transparency ! Results are often only reported as evaluation scores. Examples for individual cognate judgments are rare. 11 / 30
Cognate Detection Problems Transparency ! Results are often only reported as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or 11 / 30
Cognate Detection Problems Transparency ! Results are often only reported as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or – not given in a human-readable form. 11 / 30
Cognate Detection Problems Transparency ! Results are often only reported as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or – not given in a human-readable form. → The results show a great lack of transparency. 11 / 30
Cognate Detection Problems Comparability ! Test sets (benchmarks) vary greatly. Often, only subsets of Dyen et al. (1992) are used. → It is difficult to compare the performance of the methods. 12 / 30
Cognate Detection Problems Accuracy ! Evaluation criteria are not very intuitive and vary greatly. It is difficult to communicate the results to traditional linguists. 13 / 30
Cognate Detection Problems Accuracy ! Evaluation criteria are not very intuitive and vary greatly. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic cognate detection as – “impossible per se”, or 13 / 30
Cognate Detection Problems Accuracy ! Evaluation criteria are not very intuitive and vary greatly. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic cognate detection as – “impossible per se”, or – as useful as “rolling a dice”. 13 / 30
Chances Applicability Applicability PyPi GitHub SourceForge GoogleCode CPAN CTAN JSAN PEAR LaunchPad It was never easier to publish and maintain code... 15 / 30
Chances Applicability LingPy PyPi GitHub SourceForge GoogleCode CPAN CTAN JSAN PEAR LaunchPad What is LingPy? Python library for automatic tasks in historical linguistics project homepage: http://lingpy.org code base: https://github.com/lingpy/lingpy supports Python2 and Python3 works on Mac, Linux, and (basically also) Windows current release: 2.3 16 / 30
Chances Transparency Interactive Presentation of Results Alignments offer a unique perspective on results of cognate detection analyses. JavaScript and HTML5 offer unique ways for interactive data visualization. At the moment, we develop JavaScript tools that – visualize phonetic alignments of cognate sets, and – even allow to edit the data online. 18 / 30
Chances Comparability Benchmark Databases for Historical Linguistics ML BAYES ? ! First benchmark databases have been compiled and published: Benchmark Database of Phonetic Alignments (BDPA, List & Prokić 2014, http://alignments.lingpy.org) Benchmark Database for Cognate Detection (BDCD, presented in List 2014, http://sequencecomparison.github.io). Benchmark Database for Linguistic Reconstruction (BDLR, in preparation). 20 / 30
Chances Comparability Benchmark Databases for Historical Linguistics ML BAYES ? ! All data is given in phonetic transcriptions (IPA), tokenized into phonemic units, freely available for download, and can be directly used in LingPy. 20 / 30
Challenges Within Cognacy Within Cognacy We need to enhance our lexical databases (amount and quality of data), cognate detection algorithms (accessibility and performance), and 24 / 30
Challenges Within Cognacy Within Cognacy We need to enhance our lexical databases (amount and quality of data), cognate detection algorithms (accessibility and performance), and ways to present the results (interactive visualizations). 24 / 30
Challenges Beyond Cognacy Beyond Cognacy German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 25 / 30
Challenges Beyond Cognacy Beyond Cognacy German m oː n t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "MOON" "SHINE" "LIGHT" 25 / 30
Challenges Beyond Cognacy Beyond Cognacy Fúzhōu Měixiàn Guǎngzhōu Běijīng INNO VATIO N INNO VATIO N INNO VATIO N BO RRO W ING LO SS INNO VATIO N INNO VATIO N 25 / 30
Challenges Beyond Cognacy Lexical Change SEMANTIC CHANGE MORPHOLOGICAL CHANGE S T R A T IC C H A N G E Three Dimensions of Lexical Change (Gévaudan 2007) 26 / 30
Challenges Beyond Cognacy Inferring Lexical Change Scenarios In order to go beyond cognacy, we need methods for borrowing detection (stratic aspect), 27 / 30
Challenges Beyond Cognacy Inferring Lexical Change Scenarios In order to go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and 27 / 30
Challenges Beyond Cognacy Inferring Lexical Change Scenarios In order to go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and cross-semantic cognate inference (semantic aspect). 27 / 30
Challenges Beyond Cognacy Inferring Lexical Change Scenarios In order to go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and cross-semantic cognate inference (semantic aspect). Following the lead of evolutionary biology, these methods should be combined under a unified framework of tree reconciliation (Page & Cotton 2002) in historical linguistics. 27 / 30
Challenges Beyond Cognacy Tree Reconciliation PHYLOGENETIC RECONSTRUC- TION COGNATE (=HOMOLOG) DETECTION COGNATE TREE RECONCILIATION General Workflow for the Inference of Lexical Change Scenarios 28 / 30
Conclusion Automatic cognate detection is still in its infancy, yet the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. 29 / 30
Conclusion Automatic cognate detection is still in its infancy, yet the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. 29 / 30
Conclusion Automatic cognate detection is still in its infancy, yet the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. More realistic approaches that go beyond cognacy should be able to handle variation along the stratic, the morphological, and the semantic dimension of lexical change. 29 / 30
Conclusion Automatic cognate detection is still in its infancy, yet the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. More realistic approaches that go beyond cognacy should be able to handle variation along the stratic, the morphological, and the semantic dimension of lexical change. Evolutionary biology offers frameworks that could be employed to achieve these goals, yet it is not entirely clear whether and how this is possible. 29 / 30