Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using sequence similarity networks to identify partial cognates in multililngual wordlists

Using sequence similarity networks to identify partial cognates in multililngual wordlists

Talk held at the Annual Meeting of the Association of Computational Linguistics (2016/08/07-12, Berlin, Association of Computational Linguistics)

Johann-Mattis List

August 10, 2016
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Using Sequence Similarity Networks
    to Identify Partial Cognates in
    Multilingual Wordlists
    Johann-Mattis List, Philippe Lopez, and Eric Bapteste

    View Slide

  2. Introduction

    View Slide

  3. Keys to the Past
    by comparing the languages of the world, we gain invaluable
    insights
    ● into the past of the languages spoken in the world
    ● into the history of ancestral populations
    ● into human prehistory in general

    View Slide

  4. Keys to the Past
    in order to compare the languages in the world
    ● we need to prove that two or more languages are
    genetically related by
    ● identifying elements they have inherited from their
    common ancestors

    View Slide

  5. Keys to the Past
    having identified these cognate elements (usually words and
    morphemes)
    ● we can calculate phylogenetic trees and networks,
    ● we can reconstruct the words in the unattested ancestral
    languages, and
    ● we can try to learn more about these language families
    (when they existed, how they developed, etc.)

    View Slide

  6. Finding Cognate Words
    increasing amounts of digital data of the languages in the
    world necessitate the use of automatic methods for language
    comparison, but unfortunately
    ● available methods work well on small language families
    with moderate time depths,
    ● but they completely fail when it comes to the detection of
    words which are only partially cognate

    View Slide

  7. Finding Cognate Words
    “moon” in Germanic languages
    English moon
    German Mond
    Dutch maan
    Swedish måne

    View Slide

  8. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View Slide

  9. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View Slide

  10. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View Slide

  11. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu moon
    German Mond Měixiàn moon shine
    Dutch maan Wēnzhōu moon shine suff.
    Swedish måne Běijīng moon gloss

    View Slide

  12. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu moon
    German Mond Měixiàn moon shine
    Dutch maan Wēnzhōu moon shine suff.
    Swedish måne Běijīng moon gloss
    So far, no algorithm can detect these shared similarities across
    words in language families like Sino-Tibetan, Austro-Asiatic,
    Tai-Kadai, etc.

    View Slide

  13. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon ? Fúzhōu moon ?
    German Mond ? Měixiàn moon shine ?
    Dutch maan ? Wēnzhōu moon shine suff. ?
    Swedish måne ? Běijīng moon gloss ?
    But not only are our algorithms not capable of detecting the
    structures: We also have huge problems to actually use this
    information in phylogenetic tree reconstruction and related
    tasks.

    View Slide

  14. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon ?
    German Mond 1 Měixiàn moon shine ?
    Dutch maan 1 Wēnzhōu moon shine suff. ?
    Swedish måne 1 Běijīng moon gloss ?
    But not only are our algorithms not capable of detecting the
    structures: We also have huge problems to actually use this
    information in phylogenetic tree reconstruction and related
    tasks.

    View Slide

  15. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 1
    Dutch maan 1 Wēnzhōu moon shine suff. 1
    Swedish måne 1 Běijīng moon gloss 1
    Most algorithms require binary (yes/no) cognate decisions as input.
    But given the data for Chinese dialects, should we
    1. label them all as cognate words, as they share one element?
    2. label them all as different, as their strings all differ?
    loose coding
    of partial
    cognates

    View Slide

  16. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 2
    Dutch maan 1 Wēnzhōu moon shine suff. 3
    Swedish måne 1 Běijīng moon gloss 4
    Most algorithms require binary (yes/no) cognate decisions as input.
    But given the data for Chinese dialects, should we
    1. label them all as cognate words, as they share one element?
    2. label them all as different, as their strings all differ?
    strict coding
    of partial
    cognates

    View Slide

  17. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 1 2
    Dutch maan 1 Wēnzhōu moon shine suff. 1 2 3
    Swedish måne 1 Běijīng moon gloss 1 4
    Ideally, we label them exactly as they are, as the exact coding
    enables us to switch to strict or loose coding automatically.
    exact coding
    of partial
    cognates

    View Slide

  18. Materials

    View Slide

  19. New Gold Standards
    Dataset Bai Chinese Tujia
    Languages 9 18 5
    Words 1028 3653 513
    Concepts 110 180 109
    Strict Cognates 285 1231 247
    Partial Cognates 309 1408 348
    Sounds 94 122 57
    Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013

    View Slide

  20. New Gold Standards
    Dataset Bai Chinese Tujia
    Languages 9 18 5
    Words 1028 3653 513
    Concepts 110 180 109
    Strict Cognates 285 1231 247
    Partial Cognates 309 1408 348
    Sounds 94 122 57
    Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013
    This is the first time that a valid gold
    standard was created for the task of
    partial cognate detection!

    View Slide

  21. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)

    View Slide

  22. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Language
    Identifier

    View Slide

  23. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Concept
    Identifier

    View Slide

  24. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Phonetic
    Transcription
    Segmented
    Transcription

    View Slide

  25. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Cognate
    Identifier

    View Slide

  26. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Cognate
    Identifier

    View Slide

  27. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View Slide

  28. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View Slide

  29. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View Slide

  30. Methods

    View Slide

  31. Workflow
    1. compute pairwise sequence similarities between all
    morphemes of all words in the same meaning slot in a
    wordlist
    2. create a similarity network in which nodes represent
    morphemes and edges represent similarities between the
    morphemes
    3. use an algorithm for network partitioning to cluster the
    nodes of the network into groups of cognate morphemes

    View Slide

  32. Sequence Similarity (Step 1)
    1. language-independent measures
    a. scores depend on the strings that are compared only
    b. any further information, like recurring similarities of
    sounds (sound correspondences) are ignored
    2. language-specific measures
    a. previously identified regularities between languages
    are used to create a scoring function
    b. alignment algorithms use the scoring function to
    evaluate word similarity

    View Slide

  33. Sequence Similarity (Step 1)
    1. language-independent measures
    ● SCA algorithm (List 2012a & 2014)
    2. Language-specific measures
    ● LexStat algorithm (List 2012b & 2014)
    All algorithms are implemented as part of the LingPy
    software package (http://lingpy.org, List and Forkel 2016,
    version 2.5).

    View Slide

  34. Sequence Similarity Networks (Step 2)
    ● sequence similarity networks are tools for
    exploratory data analysis (Méheust et al. 2016,
    Corel et al. 2016)
    ● sequences (gene sequences in biology, words in
    linguistics) represent nodes in a network
    ● weighted edges represent similarities between
    the nodes

    View Slide

  35. Sequence Similarity Networks (Step 2)
    filtering edges drawn between sequences:
    a. draw no edges between morphemes in the
    same word
    b. in each word pair, link each morpheme only to
    one other morpheme (choose the most similar
    pair)
    c. only draw edges whose similarity exceeds a
    certain threshold

    View Slide

  36. Network Partitioning (Step 3)
    1. flat version of UGPMA (Sokal and Michener 1958) which
    terminates when a user-defined threshold is reached
    2. Markov Clustering (van Dongen 2000) uses techniques
    for matrix multiplication to inflate and expand the edge
    weights in a given network
    3. Infomap (Rosvall and Bergstrom 2008) was designed for
    community detection in complex networks and uses
    random walks to partition a network into communities

    View Slide

  37. Workflow Example

    View Slide

  38. Workflow Example

    View Slide

  39. Workflow Example

    View Slide

  40. Workflow Example

    View Slide

  41. Results

    View Slide

  42. Analyses and Evaluation
    ● all analyses require user-defined thresholds
    ● since our gold-standard data is too small to split it into test
    and training set, we carried out an exhaustive evaluation
    with a large range of thresholds varying between 0.05 and
    0.95 in steps of 0.05
    ● B-Cubed scores (Bagga and Baldwin 1998) were used as
    evaluation measure, since they have been shown to yield
    sensible results (Hauer and Kondrak 2011)

    View Slide

  43. Analyses and Evaluation
    ● we tested two classical methods for cognate detection
    (SCA and LexStat) against their refined variants sensitive
    for partial cognates (SCA-Partial, LexStat-Partial)
    ● since SCA and LexStat yield full cognate judgments, we
    need to convert the partial (exact) cognate judgments to
    full cognate accounts, using the criterion of full identity
    (strict encoding as shown before)
    ● we tested also the accuracy of SCA-Partial and
    LexStat-Partial on partial cognacy, but cannot compare
    these scores with other algorithms

    View Slide

  44. Implementation
    The code was implemented in Python, as part of the
    LingPy library (Version 2.5, List and Forkel (2016),
    http://lingpy.org). The igraph software package
    (Csárdi and Nepusz 2006) is needed to apply the
    Infomap algorithm.

    View Slide

  45. Results: General

    View Slide

  46. Results: General

    View Slide

  47. Results: SCA vs. LexStat

    View Slide

  48. Results: Datasets

    View Slide

  49. Discussion

    View Slide

  50. Discussion
    ● our method is a pilot approach for the detection of
    partial cognates in multilingual word lists
    ● further improvements are needed
    ● we should test on additional datasets (new language
    families) and increase our data (testing and training)
    ● our approach can be easily adjusted to employ different
    string similarity measures or partitioning algorithms: let’s
    try and see whether alternative measures can improve
    upon our current version

    View Slide

  51. Code and data at:
    https://github.com/lingpy/partial-cognate-detection
    or
    https://zenodo.org/record/51328?ln=en

    View Slide

  52. Thanks for your attention!
    Code and data at:
    https://github.com/lingpy/partial-cognate-detection
    or
    https://zenodo.org/record/51328?ln=en

    View Slide