Using sequence similarity networks to identify partial cognates in multililngual wordlists

Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual
Wordlists Johann-Mattis List, Philippe Lopez, and Eric Bapteste

Introduction

Keys to the Past by comparing the languages of the
world, we gain invaluable insights • into the past of the languages spoken in the world • into the history of ancestral populations • into human prehistory in general

Keys to the Past in order to compare the languages
in the world • we need to prove that two or more languages are genetically related by • identifying elements they have inherited from their common ancestors

Keys to the Past having identified these cognate elements (usually
words and morphemes) • we can calculate phylogenetic trees and networks, • we can reconstruct the words in the unattested ancestral languages, and • we can try to learn more about these language families (when they existed, how they developed, etc.)

Finding Cognate Words increasing amounts of digital data of the
languages in the world necessitate the use of automatic methods for language comparison, but unfortunately • available methods work well on small language families with moderate time depths, • but they completely fail when it comes to the detection of words which are only partially cognate

Finding Cognate Words “moon” in Germanic languages English moon German
Mond Dutch maan Swedish måne

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese
dialects English moon Fúzhōu ŋuoʔ⁵ German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴ Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

dialects English moon Fúzhōu moon German Mond Měixiàn moon shine Dutch maan Wēnzhōu moon shine suff. Swedish måne Běijīng moon gloss

dialects English moon Fúzhōu moon German Mond Měixiàn moon shine Dutch maan Wēnzhōu moon shine suff. Swedish måne Běijīng moon gloss So far, no algorithm can detect these shared similarities across words in language families like Sino-Tibetan, Austro-Asiatic, Tai-Kadai, etc.

dialects English moon ? Fúzhōu moon ? German Mond ? Měixiàn moon shine ? Dutch maan ? Wēnzhōu moon shine suff. ? Swedish måne ? Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.

dialects English moon 1 Fúzhōu moon ? German Mond 1 Měixiàn moon shine ? Dutch maan 1 Wēnzhōu moon shine suff. ? Swedish måne 1 Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.

dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 Dutch maan 1 Wēnzhōu moon shine suff. 1 Swedish måne 1 Běijīng moon gloss 1 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? loose coding of partial cognates

dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 2 Dutch maan 1 Wēnzhōu moon shine suff. 3 Swedish måne 1 Běijīng moon gloss 4 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? strict coding of partial cognates

dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 2 Dutch maan 1 Wēnzhōu moon shine suff. 1 2 3 Swedish måne 1 Běijīng moon gloss 1 4 Ideally, we label them exactly as they are, as the exact coding enables us to switch to strict or loose coding automatically. exact coding of partial cognates

Materials

New Gold Standards Dataset Bai Chinese Tujia Languages 9 18
5 Words 1028 3653 513 Concepts 110 180 109 Strict Cognates 285 1231 247 Partial Cognates 309 1408 348 Sounds 94 122 57 Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013

New Gold Standards Dataset Bai Chinese Tujia Languages 9 18
5 Words 1028 3653 513 Concepts 110 180 109 Strict Cognates 285 1231 247 Partial Cognates 309 1408 348 Sounds 94 122 57 Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013 This is the first time that a valid gold standard was created for the task of partial cognate detection!

New Gold Standards Text file of the data (originally CSV
format), visualized with help of the EDICTOR tool (http://edictor.digling.org)

format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Language Identifier

format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Concept Identifier

format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Phonetic Transcription Segmented Transcription

format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Cognate Identifier

format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID

Methods

Workflow 1. compute pairwise sequence similarities between all morphemes of
all words in the same meaning slot in a wordlist 2. create a similarity network in which nodes represent morphemes and edges represent similarities between the morphemes 3. use an algorithm for network partitioning to cluster the nodes of the network into groups of cognate morphemes

Sequence Similarity (Step 1) 1. language-independent measures a. scores depend
on the strings that are compared only b. any further information, like recurring similarities of sounds (sound correspondences) are ignored 2. language-specific measures a. previously identified regularities between languages are used to create a scoring function b. alignment algorithms use the scoring function to evaluate word similarity

Sequence Similarity (Step 1) 1. language-independent measures • SCA algorithm
(List 2012a & 2014) 2. Language-specific measures • LexStat algorithm (List 2012b & 2014) All algorithms are implemented as part of the LingPy software package (http://lingpy.org, List and Forkel 2016, version 2.5).

Sequence Similarity Networks (Step 2) • sequence similarity networks are
tools for exploratory data analysis (Méheust et al. 2016, Corel et al. 2016) • sequences (gene sequences in biology, words in linguistics) represent nodes in a network • weighted edges represent similarities between the nodes

Sequence Similarity Networks (Step 2) filtering edges drawn between sequences:
a. draw no edges between morphemes in the same word b. in each word pair, link each morpheme only to one other morpheme (choose the most similar pair) c. only draw edges whose similarity exceeds a certain threshold

Network Partitioning (Step 3) 1. flat version of UGPMA (Sokal
and Michener 1958) which terminates when a user-defined threshold is reached 2. Markov Clustering (van Dongen 2000) uses techniques for matrix multiplication to inflate and expand the edge weights in a given network 3. Infomap (Rosvall and Bergstrom 2008) was designed for community detection in complex networks and uses random walks to partition a network into communities

Workflow Example

Results

Analyses and Evaluation • all analyses require user-defined thresholds •
since our gold-standard data is too small to split it into test and training set, we carried out an exhaustive evaluation with a large range of thresholds varying between 0.05 and 0.95 in steps of 0.05 • B-Cubed scores (Bagga and Baldwin 1998) were used as evaluation measure, since they have been shown to yield sensible results (Hauer and Kondrak 2011)

Analyses and Evaluation • we tested two classical methods for
cognate detection (SCA and LexStat) against their refined variants sensitive for partial cognates (SCA-Partial, LexStat-Partial) • since SCA and LexStat yield full cognate judgments, we need to convert the partial (exact) cognate judgments to full cognate accounts, using the criterion of full identity (strict encoding as shown before) • we tested also the accuracy of SCA-Partial and LexStat-Partial on partial cognacy, but cannot compare these scores with other algorithms

Implementation The code was implemented in Python, as part of
the LingPy library (Version 2.5, List and Forkel (2016), http://lingpy.org). The igraph software package (Csárdi and Nepusz 2006) is needed to apply the Infomap algorithm.

Results: General

Results: SCA vs. LexStat

Results: Datasets

Discussion

Discussion • our method is a pilot approach for the
detection of partial cognates in multilingual word lists • further improvements are needed • we should test on additional datasets (new language families) and increase our data (testing and training) • our approach can be easily adjusted to employ different string similarity measures or partitioning algorithms: let’s try and see whether alternative measures can improve upon our current version

Code and data at: https://github.com/lingpy/partial-cognate-detection or https://zenodo.org/record/51328?ln=en

Thanks for your attention! Code and data at: https://github.com/lingpy/partial-cognate-detection or
https://zenodo.org/record/51328?ln=en

Using sequence similarity networks to identify ...

Using sequence similarity networks to identify partial cognates in multililngual wordlists

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript