Using sequence similarity networks to identify partial cognates in multililngual wordlists

Slide 1

Slide 1 text

Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists Johann-Mattis List, Philippe Lopez, and Eric Bapteste

Slide 2

Slide 2 text

Introduction

Slide 3

Slide 3 text

Keys to the Past by comparing the languages of the world, we gain invaluable insights ● into the past of the languages spoken in the world ● into the history of ancestral populations ● into human prehistory in general

Slide 4

Slide 4 text

Keys to the Past in order to compare the languages in the world ● we need to prove that two or more languages are genetically related by ● identifying elements they have inherited from their common ancestors

Slide 5

Slide 5 text

Keys to the Past having identified these cognate elements (usually words and morphemes) ● we can calculate phylogenetic trees and networks, ● we can reconstruct the words in the unattested ancestral languages, and ● we can try to learn more about these language families (when they existed, how they developed, etc.)

Slide 6

Slide 6 text

Finding Cognate Words increasing amounts of digital data of the languages in the world necessitate the use of automatic methods for language comparison, but unfortunately ● available methods work well on small language families with moderate time depths, ● but they completely fail when it comes to the detection of words which are only partially cognate

Slide 7

Slide 7 text

Finding Cognate Words “moon” in Germanic languages English moon German Mond Dutch maan Swedish måne

Slide 8

Slide 8 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon Fúzhōu ŋuoʔ⁵ German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴ Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon Fúzhōu moon German Mond Měixiàn moon shine Dutch maan Wēnzhōu moon shine suff. Swedish måne Běijīng moon gloss So far, no algorithm can detect these shared similarities across words in language families like Sino-Tibetan, Austro-Asiatic, Tai-Kadai, etc.

Slide 13

Slide 13 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon ? Fúzhōu moon ? German Mond ? Měixiàn moon shine ? Dutch maan ? Wēnzhōu moon shine suff. ? Swedish måne ? Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.

Slide 14

Slide 14 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon 1 Fúzhōu moon ? German Mond 1 Měixiàn moon shine ? Dutch maan 1 Wēnzhōu moon shine suff. ? Swedish måne 1 Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.

Slide 15

Slide 15 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 Dutch maan 1 Wēnzhōu moon shine suff. 1 Swedish måne 1 Běijīng moon gloss 1 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? loose coding of partial cognates

Slide 16

Slide 16 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 2 Dutch maan 1 Wēnzhōu moon shine suff. 3 Swedish måne 1 Běijīng moon gloss 4 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? strict coding of partial cognates

Slide 17

Slide 17 text

Finding Cognate Words “moon” in Germanic languages “moon” in Chinese dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 2 Dutch maan 1 Wēnzhōu moon shine suff. 1 2 3 Swedish måne 1 Běijīng moon gloss 1 4 Ideally, we label them exactly as they are, as the exact coding enables us to switch to strict or loose coding automatically. exact coding of partial cognates

Slide 18

Slide 18 text

Materials

Slide 19

Slide 19 text

Slide 20

Slide 20 text

New Gold Standards Dataset Bai Chinese Tujia Languages 9 18 5 Words 1028 3653 513 Concepts 110 180 109 Strict Cognates 285 1231 247 Partial Cognates 309 1408 348 Sounds 94 122 57 Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013 This is the first time that a valid gold standard was created for the task of partial cognate detection!

Slide 21

Slide 21 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org)

Slide 22

Slide 22 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Language Identifier

Slide 23

Slide 23 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Concept Identifier

Slide 24

Slide 24 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Phonetic Transcription Segmented Transcription

Slide 25

Slide 25 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Cognate Identifier

Slide 26

Slide 26 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Cognate Identifier

Slide 27

Slide 27 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID

Slide 28

Slide 28 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID

Slide 29

Slide 29 text

New Gold Standards Text file of the data (originally CSV format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID

Slide 30

Slide 30 text

Methods

Slide 31

Slide 31 text

Workflow 1. compute pairwise sequence similarities between all morphemes of all words in the same meaning slot in a wordlist 2. create a similarity network in which nodes represent morphemes and edges represent similarities between the morphemes 3. use an algorithm for network partitioning to cluster the nodes of the network into groups of cognate morphemes

Slide 32

Slide 32 text

Sequence Similarity (Step 1) 1. language-independent measures a. scores depend on the strings that are compared only b. any further information, like recurring similarities of sounds (sound correspondences) are ignored 2. language-specific measures a. previously identified regularities between languages are used to create a scoring function b. alignment algorithms use the scoring function to evaluate word similarity

Slide 33

Slide 33 text

Sequence Similarity (Step 1) 1. language-independent measures ● SCA algorithm (List 2012a & 2014) 2. Language-specific measures ● LexStat algorithm (List 2012b & 2014) All algorithms are implemented as part of the LingPy software package (http://lingpy.org, List and Forkel 2016, version 2.5).

Slide 34

Slide 34 text

Sequence Similarity Networks (Step 2) ● sequence similarity networks are tools for exploratory data analysis (Méheust et al. 2016, Corel et al. 2016) ● sequences (gene sequences in biology, words in linguistics) represent nodes in a network ● weighted edges represent similarities between the nodes

Slide 35

Slide 35 text

Sequence Similarity Networks (Step 2) filtering edges drawn between sequences: a. draw no edges between morphemes in the same word b. in each word pair, link each morpheme only to one other morpheme (choose the most similar pair) c. only draw edges whose similarity exceeds a certain threshold

Slide 36

Slide 36 text

Network Partitioning (Step 3) 1. flat version of UGPMA (Sokal and Michener 1958) which terminates when a user-defined threshold is reached 2. Markov Clustering (van Dongen 2000) uses techniques for matrix multiplication to inflate and expand the edge weights in a given network 3. Infomap (Rosvall and Bergstrom 2008) was designed for community detection in complex networks and uses random walks to partition a network into communities

Slide 37

Slide 37 text

Workflow Example

Slide 38

Slide 38 text

Workflow Example

Slide 39

Slide 39 text

Workflow Example

Slide 40

Slide 40 text

Workflow Example

Slide 41

Slide 41 text

Results

Slide 42

Slide 42 text

Analyses and Evaluation ● all analyses require user-defined thresholds ● since our gold-standard data is too small to split it into test and training set, we carried out an exhaustive evaluation with a large range of thresholds varying between 0.05 and 0.95 in steps of 0.05 ● B-Cubed scores (Bagga and Baldwin 1998) were used as evaluation measure, since they have been shown to yield sensible results (Hauer and Kondrak 2011)

Slide 43

Slide 43 text

Analyses and Evaluation ● we tested two classical methods for cognate detection (SCA and LexStat) against their refined variants sensitive for partial cognates (SCA-Partial, LexStat-Partial) ● since SCA and LexStat yield full cognate judgments, we need to convert the partial (exact) cognate judgments to full cognate accounts, using the criterion of full identity (strict encoding as shown before) ● we tested also the accuracy of SCA-Partial and LexStat-Partial on partial cognacy, but cannot compare these scores with other algorithms

Slide 44

Slide 44 text

Implementation The code was implemented in Python, as part of the LingPy library (Version 2.5, List and Forkel (2016), http://lingpy.org). The igraph software package (Csárdi and Nepusz 2006) is needed to apply the Infomap algorithm.

Slide 45

Slide 45 text

Results: General

Slide 46

Slide 46 text

Results: General

Slide 47

Slide 47 text

Results: SCA vs. LexStat

Slide 48

Slide 48 text

Results: Datasets

Slide 49

Slide 49 text

Discussion

Slide 50

Slide 50 text

Discussion ● our method is a pilot approach for the detection of partial cognates in multilingual word lists ● further improvements are needed ● we should test on additional datasets (new language families) and increase our data (testing and training) ● our approach can be easily adjusted to employ different string similarity measures or partitioning algorithms: let’s try and see whether alternative measures can improve upon our current version

Slide 51

Slide 51 text

Code and data at: https://github.com/lingpy/partial-cognate-detection or https://zenodo.org/record/51328?ln=en

Slide 52

Slide 52 text

Thanks for your attention! Code and data at: https://github.com/lingpy/partial-cognate-detection or https://zenodo.org/record/51328?ln=en