Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using sequence similarity networks to identify ...

Using sequence similarity networks to identify partial cognates in multililngual wordlists

Talk held at the Annual Meeting of the Association of Computational Linguistics (2016/08/07-12, Berlin, Association of Computational Linguistics)

Johann-Mattis List

August 10, 2016
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual

    Wordlists Johann-Mattis List, Philippe Lopez, and Eric Bapteste
  2. Keys to the Past by comparing the languages of the

    world, we gain invaluable insights • into the past of the languages spoken in the world • into the history of ancestral populations • into human prehistory in general
  3. Keys to the Past in order to compare the languages

    in the world • we need to prove that two or more languages are genetically related by • identifying elements they have inherited from their common ancestors
  4. Keys to the Past having identified these cognate elements (usually

    words and morphemes) • we can calculate phylogenetic trees and networks, • we can reconstruct the words in the unattested ancestral languages, and • we can try to learn more about these language families (when they existed, how they developed, etc.)
  5. Finding Cognate Words increasing amounts of digital data of the

    languages in the world necessitate the use of automatic methods for language comparison, but unfortunately • available methods work well on small language families with moderate time depths, • but they completely fail when it comes to the detection of words which are only partially cognate
  6. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon Fúzhōu ŋuoʔ⁵ German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴ Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ Swedish måne Běijīng yɛ⁵¹ liɑŋ¹
  7. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon Fúzhōu ŋuoʔ⁵ German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴ Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ Swedish måne Běijīng yɛ⁵¹ liɑŋ¹
  8. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon Fúzhōu ŋuoʔ⁵ German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴ Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³ Swedish måne Běijīng yɛ⁵¹ liɑŋ¹
  9. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon Fúzhōu moon German Mond Měixiàn moon shine Dutch maan Wēnzhōu moon shine suff. Swedish måne Běijīng moon gloss
  10. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon Fúzhōu moon German Mond Měixiàn moon shine Dutch maan Wēnzhōu moon shine suff. Swedish måne Běijīng moon gloss So far, no algorithm can detect these shared similarities across words in language families like Sino-Tibetan, Austro-Asiatic, Tai-Kadai, etc.
  11. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon ? Fúzhōu moon ? German Mond ? Měixiàn moon shine ? Dutch maan ? Wēnzhōu moon shine suff. ? Swedish måne ? Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.
  12. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon 1 Fúzhōu moon ? German Mond 1 Měixiàn moon shine ? Dutch maan 1 Wēnzhōu moon shine suff. ? Swedish måne 1 Běijīng moon gloss ? But not only are our algorithms not capable of detecting the structures: We also have huge problems to actually use this information in phylogenetic tree reconstruction and related tasks.
  13. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 Dutch maan 1 Wēnzhōu moon shine suff. 1 Swedish måne 1 Běijīng moon gloss 1 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? loose coding of partial cognates
  14. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 2 Dutch maan 1 Wēnzhōu moon shine suff. 3 Swedish måne 1 Běijīng moon gloss 4 Most algorithms require binary (yes/no) cognate decisions as input. But given the data for Chinese dialects, should we 1. label them all as cognate words, as they share one element? 2. label them all as different, as their strings all differ? strict coding of partial cognates
  15. Finding Cognate Words “moon” in Germanic languages “moon” in Chinese

    dialects English moon 1 Fúzhōu moon 1 German Mond 1 Měixiàn moon shine 1 2 Dutch maan 1 Wēnzhōu moon shine suff. 1 2 3 Swedish måne 1 Běijīng moon gloss 1 4 Ideally, we label them exactly as they are, as the exact coding enables us to switch to strict or loose coding automatically. exact coding of partial cognates
  16. New Gold Standards Dataset Bai Chinese Tujia Languages 9 18

    5 Words 1028 3653 513 Concepts 110 180 109 Strict Cognates 285 1231 247 Partial Cognates 309 1408 348 Sounds 94 122 57 Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013
  17. New Gold Standards Dataset Bai Chinese Tujia Languages 9 18

    5 Words 1028 3653 513 Concepts 110 180 109 Strict Cognates 285 1231 247 Partial Cognates 309 1408 348 Sounds 94 122 57 Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013 This is the first time that a valid gold standard was created for the task of partial cognate detection!
  18. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org)
  19. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Language Identifier
  20. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Concept Identifier
  21. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Phonetic Transcription Segmented Transcription
  22. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Cognate Identifier
  23. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Cognate Identifier
  24. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID
  25. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID
  26. New Gold Standards Text file of the data (originally CSV

    format), visualized with help of the EDICTOR tool (http://edictor.digling.org) Partial Cognate ID
  27. Workflow 1. compute pairwise sequence similarities between all morphemes of

    all words in the same meaning slot in a wordlist 2. create a similarity network in which nodes represent morphemes and edges represent similarities between the morphemes 3. use an algorithm for network partitioning to cluster the nodes of the network into groups of cognate morphemes
  28. Sequence Similarity (Step 1) 1. language-independent measures a. scores depend

    on the strings that are compared only b. any further information, like recurring similarities of sounds (sound correspondences) are ignored 2. language-specific measures a. previously identified regularities between languages are used to create a scoring function b. alignment algorithms use the scoring function to evaluate word similarity
  29. Sequence Similarity (Step 1) 1. language-independent measures • SCA algorithm

    (List 2012a & 2014) 2. Language-specific measures • LexStat algorithm (List 2012b & 2014) All algorithms are implemented as part of the LingPy software package (http://lingpy.org, List and Forkel 2016, version 2.5).
  30. Sequence Similarity Networks (Step 2) • sequence similarity networks are

    tools for exploratory data analysis (Méheust et al. 2016, Corel et al. 2016) • sequences (gene sequences in biology, words in linguistics) represent nodes in a network • weighted edges represent similarities between the nodes
  31. Sequence Similarity Networks (Step 2) filtering edges drawn between sequences:

    a. draw no edges between morphemes in the same word b. in each word pair, link each morpheme only to one other morpheme (choose the most similar pair) c. only draw edges whose similarity exceeds a certain threshold
  32. Network Partitioning (Step 3) 1. flat version of UGPMA (Sokal

    and Michener 1958) which terminates when a user-defined threshold is reached 2. Markov Clustering (van Dongen 2000) uses techniques for matrix multiplication to inflate and expand the edge weights in a given network 3. Infomap (Rosvall and Bergstrom 2008) was designed for community detection in complex networks and uses random walks to partition a network into communities
  33. Analyses and Evaluation • all analyses require user-defined thresholds •

    since our gold-standard data is too small to split it into test and training set, we carried out an exhaustive evaluation with a large range of thresholds varying between 0.05 and 0.95 in steps of 0.05 • B-Cubed scores (Bagga and Baldwin 1998) were used as evaluation measure, since they have been shown to yield sensible results (Hauer and Kondrak 2011)
  34. Analyses and Evaluation • we tested two classical methods for

    cognate detection (SCA and LexStat) against their refined variants sensitive for partial cognates (SCA-Partial, LexStat-Partial) • since SCA and LexStat yield full cognate judgments, we need to convert the partial (exact) cognate judgments to full cognate accounts, using the criterion of full identity (strict encoding as shown before) • we tested also the accuracy of SCA-Partial and LexStat-Partial on partial cognacy, but cannot compare these scores with other algorithms
  35. Implementation The code was implemented in Python, as part of

    the LingPy library (Version 2.5, List and Forkel (2016), http://lingpy.org). The igraph software package (Csárdi and Nepusz 2006) is needed to apply the Infomap algorithm.
  36. Discussion • our method is a pilot approach for the

    detection of partial cognates in multilingual word lists • further improvements are needed • we should test on additional datasets (new language families) and increase our data (testing and training) • our approach can be easily adjusted to employ different string similarity measures or partitioning algorithms: let’s try and see whether alternative measures can improve upon our current version