Beyond Cognacy

Beyond Cognacy Current Chances and Future Challenges of Automatic Cognate
Detection in Historical Linguistics Johann-Mattis List Forschungszentrum Deutscher Sprachatlas Philipps-University Marburg 2014-09-17 1 / 30

word Wort слово cuvînt palabra mot adottszó slovo verbum focal
词 parola λόγος शब◌् द ord λόγος Wort слово cuvînt palabra mot adottszó slovo verbum focal 词 parola शब◌् द ord word ord ord word Cognate Detection 2 / 30

Cognate Detection Traditional Approaches Traditional Approaches FRANZ BOPP VERY, VERY
LONG TITLE 3 / 30

Cognate Detection Traditional Approaches The Comparative Method FRANZ BOPP VERY,
VERY LONG TITLE proof of relationship identification of cognates identification of sound correspondences reconstruction of proto-forms internal classification 4 / 30

Cognate Detection Traditional Approaches Cognate Detection FRANZ BOPP VERY, VERY
LONG TITLE 5 / 30

LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30

LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 5 / 30

LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x ? n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

LONG TITLE Cognate List Alignment Correspondence List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 5 / 30

Cognate Detection Automatic Approaches Automatic Approaches P(A|B)=(P(B|A)P(A))/(P(B) 6 / 30

Cognate Detection Automatic Approaches Narrowing down the Task P(A|B)=(P(B|A)P(A))/(P(B) 7
/ 30

Cognate Detection Automatic Approaches Narrowing down the Task P(A|B)=(P(B|A)P(A))/(P(B) Traditional
Workﬂow *dent- dente dɑ̃ dɛnte *tanθ tuːθ t͡saːn DICTIONARIES WORDLISTS HISTORICAL SCENARIOS 7 / 30

Cognate Detection Automatic Approaches Narrowing down the Task P(A|B)=(P(B|A)P(A))/(P(B) Traditional
Workﬂow HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] *dent- dente dɑ̃ dɛnte *tanθ tuːθ t͡saːn DICTIONARIES WORDLISTS HISTORICAL SCENARIOS 7 / 30

Cognate Detection Automatic Approaches Narrowing down the Task P(A|B)=(P(B|A)P(A))/(P(B) Technical
Workﬂow HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WORDLIST DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] RAW DATA Semantic Tagging HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] TOKENS, MORPHEMES Tokenization Cognate Detection HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] COGNATE SETS Alignment Analysis HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] SOUND CORRESPON- DENCES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PROTO- FORMS Linguistic Reconstruction 7 / 30

Cognate Detection Automatic Approaches Narrowing down the Task P(A|B)=(P(B|A)P(A))/(P(B) Technical
Workflow INPUT: Multilingual wordlist → semantically tagged → phonetically transcribed → tokenized into phonemes OUTPUT: Multilingual wordlist → identified cognate entries assigned to clusters → identified cognate entries multiply aligned 7 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) Basic Procedure for Multilingual
Cognate Detection WORDLIST DATA 8 / 30

Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS PAIRWISE COMPARISON 8 / 30

Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS COGNATE SETS COGNATE CLUSTERING PAIRWISE COMPARISON 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) Cognate Clustering Analysis ID
Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) Cognate Clustering Swedish English
Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 Analysis ID Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) Cognate Clustering Swedish English
Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) Cognate Clustering German Frau
frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini Analysis ID Taxa Word Gloss GlossID IPA CogID ... ... ... ... ... ... ... 21 German Frau woman 20 frau 1 22 Dutch vrouw woman 20 vrɑu 1 23 English woman woman 20 wʊmən 2 24 Danish kvinde woman 20 kvenə 3 25 Swedish kvinna woman 20 kviːna 3 26 Norwegian kvine woman 20 kʋinə 3 ... ... ... ... ... ... ... 8 / 30

Cognate Detection Automatic Approaches Algorithms P(A|B)=(P(B|A)P(A))/(P(B) INPUT TOKENIZATION PREPROCESSING LOG-ODDS
D ISTANCE COGNATE OUTPUT CORRESPONDENCE DETECTION USING PHONETIC ALIGNMENT LOOP DISTRIBUTION LexStat Algorithm (List 2014) EXPECTED ATTESTED DISTRIBUTION CALCULATION CLUSTERING 8 / 30

Cognate Detection Problems Problems ! 9 / 30

Cognate Detection Problems Applicability ! 10 / 30

Cognate Detection Problems Applicability ! Method Multilingual? No additional requirements?
Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 10 / 30

Cognate Detection Problems Transparency ! 11 / 30

Cognate Detection Problems Transparency ! Results are often only reported
as evaluation scores. 11 / 30

as evaluation scores. Examples for individual cognate judgments are rare. 11 / 30

as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or 11 / 30

as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or – not given in a human-readable form. 11 / 30

as evaluation scores. Examples for individual cognate judgments are rare. Supplementary data – is often lacking, or – not given in a human-readable form. → The results show a great lack of transparency. 11 / 30

Cognate Detection Problems Comparability ! 12 / 30

Cognate Detection Problems Comparability ! Test sets (benchmarks) vary greatly.
12 / 30

Often, only subsets of Dyen et al. (1992) are used. 12 / 30

Often, only subsets of Dyen et al. (1992) are used. → It is diﬃcult to compare the performance of the methods. 12 / 30

Cognate Detection Problems Accuracy ! 13 / 30

Cognate Detection Problems Accuracy ! Evaluation criteria are not very
intuitive and vary greatly. 13 / 30

intuitive and vary greatly. It is diﬃcult to communicate the results to traditional linguists. 13 / 30

intuitive and vary greatly. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic cognate detection as – “impossible per se”, or 13 / 30

intuitive and vary greatly. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic cognate detection as – “impossible per se”, or – as useful as “rolling a dice”. 13 / 30

Chances 14 / 30

Chances Applicability Applicability PyPi GitHub SourceForge GoogleCode CPAN CTAN JSAN
PEAR LaunchPad 15 / 30

Chances Applicability Applicability PyPi GitHub SourceForge GoogleCode CPAN CTAN JSAN
PEAR LaunchPad It was never easier to publish and maintain code... 15 / 30

Chances Applicability LingPy PyPi GitHub SourceForge GoogleCode CPAN CTAN JSAN
PEAR LaunchPad 16 / 30

PEAR LaunchPad What is LingPy? Python library for automatic tasks in historical linguistics project homepage: http://lingpy.org code base: https://github.com/lingpy/lingpy supports Python2 and Python3 works on Mac, Linux, and (basically also) Windows current release: 2.3 16 / 30

PEAR LaunchPad What does LingPy oﬀer? tokenization of phonetic sequences phonetic alignment analyses (List 2012a) automatic cognate detection (Turchin 2010, List 2012b) automatic borrowing detection (List et al. 2014) basic routines for the evaluation of automatic methods plotting routines for interactive visualizations 16 / 30

Chances Transparency Transparency 17 / 30

Chances Transparency Interactive Presentation of Results 18 / 30

Chances Transparency Interactive Presentation of Results Alignments oﬀer a unique
perspective on results of cognate detection analyses. JavaScript and HTML5 oﬀer unique ways for interactive data visualization. At the moment, we develop JavaScript tools that – visualize phonetic alignments of cognate sets, and – even allow to edit the data online. 18 / 30

Chances Comparability Comparability ML BAYES ? ! 19 / 30

Chances Comparability Benchmark Databases for Historical Linguistics ML BAYES ?
! 20 / 30

! First benchmark databases have been compiled and published: Benchmark Database of Phonetic Alignments (BDPA, List & Prokić 2014, http://alignments.lingpy.org) Benchmark Database for Cognate Detection (BDCD, presented in List 2014, http://sequencecomparison.github.io). Benchmark Database for Linguistic Reconstruction (BDLR, in preparation). 20 / 30

! All data is given in phonetic transcriptions (IPA), tokenized into phonemic units, freely available for download, and can be directly used in LingPy. 20 / 30

Chances Accuracy Accuracy *h₂ 21 / 30

Chances Accuracy Performance of Cognate Detection Algorithms *h₂ 22 /
30

Chances Accuracy Performance of Cognate Detection Algorithms *h₂ B-Cubed F-Scores
on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 22 / 30

on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% 92% 81% 89% 81% 22 / 30

on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% 22 / 30

P(A|B)=(P(B|A)P(A))/(P(B) Challenges 23 / 30

Challenges Within Cognacy Within Cognacy 24 / 30

Challenges Within Cognacy Within Cognacy We need to enhance our
24 / 30

lexical databases (amount and quality of data), 24 / 30

lexical databases (amount and quality of data), cognate detection algorithms (accessibility and performance), and 24 / 30

lexical databases (amount and quality of data), cognate detection algorithms (accessibility and performance), and ways to present the results (interactive visualizations). 24 / 30

Challenges Beyond Cognacy Beyond Cognacy 25 / 30

Challenges Beyond Cognacy Beyond Cognacy German m oː n t
- English m uː n - - Danish m ɔː n - ə Swedish m oː n - e 25 / 30

- English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 25 / 30

- English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "MOON" "SHINE" "LIGHT" 25 / 30

Challenges Beyond Cognacy Beyond Cognacy Fúzhōu Měixiàn Guǎngzhōu Běijīng 25
/ 30

Challenges Beyond Cognacy Beyond Cognacy Fúzhōu Měixiàn Guǎngzhōu Běijīng INNO
VATIO N INNO VATIO N INNO VATIO N BO RRO W ING LO SS INNO VATIO N INNO VATIO N 25 / 30

Challenges Beyond Cognacy Lexical Change SEMANTIC CHANGE MORPHOLOGICAL CHANGE S
T R A T IC C H A N G E Three Dimensions of Lexical Change (Gévaudan 2007) 26 / 30

Challenges Beyond Cognacy Lexical Change Stratic Morphological Semantic Relation Biolog.
Term continuity traditional notion of cognacy - + +/- +/- cognacy à la Swadesh - + +/- + automatic cognate detection - +/- +/- + direct cognate relation orthology + + + oblique cognate relation paralogy + - + etymological relation homology +/- +/- +/- oblique etymological relation xenology - +/- +/- 26 / 30

Challenges Beyond Cognacy Inferring Lexical Change Scenarios 27 / 30

Challenges Beyond Cognacy Inferring Lexical Change Scenarios In order to
go beyond cognacy, we need methods for 27 / 30

go beyond cognacy, we need methods for borrowing detection (stratic aspect), 27 / 30

go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and 27 / 30

go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and cross-semantic cognate inference (semantic aspect). 27 / 30

go beyond cognacy, we need methods for borrowing detection (stratic aspect), partial cognate inference (morphological aspect), and cross-semantic cognate inference (semantic aspect). Following the lead of evolutionary biology, these methods should be combined under a uniﬁed framework of tree reconciliation (Page & Cotton 2002) in historical linguistics. 27 / 30

Challenges Beyond Cognacy Tree Reconciliation Fúzhōu Měixiàn Guǎngzhōu Běijīng Fúzhōu
Měixiàn Guǎngzhōu Běijīng 28 / 30

Challenges Beyond Cognacy Tree Reconciliation Fúzhōu Měixiàn Guǎngzhōu Běijīng 28
/ 30

Challenges Beyond Cognacy Tree Reconciliation LOSS INNO VATIO N INNO
VATIO N BORROWING 28 / 30

Challenges Beyond Cognacy Tree Reconciliation PHYLOGENETIC RECONSTRUC- TION COGNATE (=HOMOLOG)
DETECTION COGNATE TREE RECONCILIATION General Workﬂow for the Inference of Lexical Change Scenarios 28 / 30

Conclusion 29 / 30

Conclusion Automatic cognate detection is still in its infancy, yet
the child is constantly growing. 29 / 30

the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. 29 / 30

the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. 29 / 30

the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. More realistic approaches that go beyond cognacy should be able to handle variation along the stratic, the morphological, and the semantic dimension of lexical change. 29 / 30

the child is constantly growing. Enhancing the applicability, transparency, comparability, and accuracy of cognate detection methods is a goal that can be achieved in the near future. The greatest challenge arises from the complexity of lexical change processes. More realistic approaches that go beyond cognacy should be able to handle variation along the stratic, the morphological, and the semantic dimension of lexical change. Evolutionary biology oﬀers frameworks that could be employed to achieve these goals, yet it is not entirely clear whether and how this is possible. 29 / 30

Thank You for Listening! 30 / 30

Beyond Cognacy

Beyond Cognacy

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript