Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using sequence similarity networks to identify partial cognates in multililngual wordlists

Using sequence similarity networks to identify partial cognates in multililngual wordlists

Talk held at the Annual Meeting of the Association of Computational Linguistics (2016/08/07-12, Berlin, Association of Computational Linguistics)

Johann-Mattis List

August 10, 2016
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Using Sequence Similarity Networks
    to Identify Partial Cognates in
    Multilingual Wordlists
    Johann-Mattis List, Philippe Lopez, and Eric Bapteste

    View full-size slide

  2. Introduction

    View full-size slide

  3. Keys to the Past
    by comparing the languages of the world, we gain invaluable
    insights
    ● into the past of the languages spoken in the world
    ● into the history of ancestral populations
    ● into human prehistory in general

    View full-size slide

  4. Keys to the Past
    in order to compare the languages in the world
    ● we need to prove that two or more languages are
    genetically related by
    ● identifying elements they have inherited from their
    common ancestors

    View full-size slide

  5. Keys to the Past
    having identified these cognate elements (usually words and
    morphemes)
    ● we can calculate phylogenetic trees and networks,
    ● we can reconstruct the words in the unattested ancestral
    languages, and
    ● we can try to learn more about these language families
    (when they existed, how they developed, etc.)

    View full-size slide

  6. Finding Cognate Words
    increasing amounts of digital data of the languages in the
    world necessitate the use of automatic methods for language
    comparison, but unfortunately
    ● available methods work well on small language families
    with moderate time depths,
    ● but they completely fail when it comes to the detection of
    words which are only partially cognate

    View full-size slide

  7. Finding Cognate Words
    “moon” in Germanic languages
    English moon
    German Mond
    Dutch maan
    Swedish måne

    View full-size slide

  8. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View full-size slide

  9. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View full-size slide

  10. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu ŋuoʔ⁵
    German Mond Měixiàn ŋiat⁵ kuoŋ⁴⁴
    Dutch maan Wēnzhōu ȵy²¹ kuɔ³⁵ vai¹³
    Swedish måne Běijīng yɛ⁵¹ liɑŋ¹

    View full-size slide

  11. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu moon
    German Mond Měixiàn moon shine
    Dutch maan Wēnzhōu moon shine suff.
    Swedish måne Běijīng moon gloss

    View full-size slide

  12. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon Fúzhōu moon
    German Mond Měixiàn moon shine
    Dutch maan Wēnzhōu moon shine suff.
    Swedish måne Běijīng moon gloss
    So far, no algorithm can detect these shared similarities across
    words in language families like Sino-Tibetan, Austro-Asiatic,
    Tai-Kadai, etc.

    View full-size slide

  13. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon ? Fúzhōu moon ?
    German Mond ? Měixiàn moon shine ?
    Dutch maan ? Wēnzhōu moon shine suff. ?
    Swedish måne ? Běijīng moon gloss ?
    But not only are our algorithms not capable of detecting the
    structures: We also have huge problems to actually use this
    information in phylogenetic tree reconstruction and related
    tasks.

    View full-size slide

  14. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon ?
    German Mond 1 Měixiàn moon shine ?
    Dutch maan 1 Wēnzhōu moon shine suff. ?
    Swedish måne 1 Běijīng moon gloss ?
    But not only are our algorithms not capable of detecting the
    structures: We also have huge problems to actually use this
    information in phylogenetic tree reconstruction and related
    tasks.

    View full-size slide

  15. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 1
    Dutch maan 1 Wēnzhōu moon shine suff. 1
    Swedish måne 1 Běijīng moon gloss 1
    Most algorithms require binary (yes/no) cognate decisions as input.
    But given the data for Chinese dialects, should we
    1. label them all as cognate words, as they share one element?
    2. label them all as different, as their strings all differ?
    loose coding
    of partial
    cognates

    View full-size slide

  16. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 2
    Dutch maan 1 Wēnzhōu moon shine suff. 3
    Swedish måne 1 Běijīng moon gloss 4
    Most algorithms require binary (yes/no) cognate decisions as input.
    But given the data for Chinese dialects, should we
    1. label them all as cognate words, as they share one element?
    2. label them all as different, as their strings all differ?
    strict coding
    of partial
    cognates

    View full-size slide

  17. Finding Cognate Words
    “moon” in Germanic languages “moon” in Chinese dialects
    English moon 1 Fúzhōu moon 1
    German Mond 1 Měixiàn moon shine 1 2
    Dutch maan 1 Wēnzhōu moon shine suff. 1 2 3
    Swedish måne 1 Běijīng moon gloss 1 4
    Ideally, we label them exactly as they are, as the exact coding
    enables us to switch to strict or loose coding automatically.
    exact coding
    of partial
    cognates

    View full-size slide

  18. New Gold Standards
    Dataset Bai Chinese Tujia
    Languages 9 18 5
    Words 1028 3653 513
    Concepts 110 180 109
    Strict Cognates 285 1231 247
    Partial Cognates 309 1408 348
    Sounds 94 122 57
    Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013

    View full-size slide

  19. New Gold Standards
    Dataset Bai Chinese Tujia
    Languages 9 18 5
    Words 1028 3653 513
    Concepts 110 180 109
    Strict Cognates 285 1231 247
    Partial Cognates 309 1408 348
    Sounds 94 122 57
    Source Wang 2006 Běijīng Dàxué 1964 Starostin 2013
    This is the first time that a valid gold
    standard was created for the task of
    partial cognate detection!

    View full-size slide

  20. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)

    View full-size slide

  21. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Language
    Identifier

    View full-size slide

  22. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Concept
    Identifier

    View full-size slide

  23. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Phonetic
    Transcription
    Segmented
    Transcription

    View full-size slide

  24. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Cognate
    Identifier

    View full-size slide

  25. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Cognate
    Identifier

    View full-size slide

  26. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View full-size slide

  27. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View full-size slide

  28. New Gold Standards
    Text file of the data (originally CSV format), visualized with help of
    the EDICTOR tool (http://edictor.digling.org)
    Partial
    Cognate ID

    View full-size slide

  29. Workflow
    1. compute pairwise sequence similarities between all
    morphemes of all words in the same meaning slot in a
    wordlist
    2. create a similarity network in which nodes represent
    morphemes and edges represent similarities between the
    morphemes
    3. use an algorithm for network partitioning to cluster the
    nodes of the network into groups of cognate morphemes

    View full-size slide

  30. Sequence Similarity (Step 1)
    1. language-independent measures
    a. scores depend on the strings that are compared only
    b. any further information, like recurring similarities of
    sounds (sound correspondences) are ignored
    2. language-specific measures
    a. previously identified regularities between languages
    are used to create a scoring function
    b. alignment algorithms use the scoring function to
    evaluate word similarity

    View full-size slide

  31. Sequence Similarity (Step 1)
    1. language-independent measures
    ● SCA algorithm (List 2012a & 2014)
    2. Language-specific measures
    ● LexStat algorithm (List 2012b & 2014)
    All algorithms are implemented as part of the LingPy
    software package (http://lingpy.org, List and Forkel 2016,
    version 2.5).

    View full-size slide

  32. Sequence Similarity Networks (Step 2)
    ● sequence similarity networks are tools for
    exploratory data analysis (Méheust et al. 2016,
    Corel et al. 2016)
    ● sequences (gene sequences in biology, words in
    linguistics) represent nodes in a network
    ● weighted edges represent similarities between
    the nodes

    View full-size slide

  33. Sequence Similarity Networks (Step 2)
    filtering edges drawn between sequences:
    a. draw no edges between morphemes in the
    same word
    b. in each word pair, link each morpheme only to
    one other morpheme (choose the most similar
    pair)
    c. only draw edges whose similarity exceeds a
    certain threshold

    View full-size slide

  34. Network Partitioning (Step 3)
    1. flat version of UGPMA (Sokal and Michener 1958) which
    terminates when a user-defined threshold is reached
    2. Markov Clustering (van Dongen 2000) uses techniques
    for matrix multiplication to inflate and expand the edge
    weights in a given network
    3. Infomap (Rosvall and Bergstrom 2008) was designed for
    community detection in complex networks and uses
    random walks to partition a network into communities

    View full-size slide

  35. Workflow Example

    View full-size slide

  36. Workflow Example

    View full-size slide

  37. Workflow Example

    View full-size slide

  38. Workflow Example

    View full-size slide

  39. Analyses and Evaluation
    ● all analyses require user-defined thresholds
    ● since our gold-standard data is too small to split it into test
    and training set, we carried out an exhaustive evaluation
    with a large range of thresholds varying between 0.05 and
    0.95 in steps of 0.05
    ● B-Cubed scores (Bagga and Baldwin 1998) were used as
    evaluation measure, since they have been shown to yield
    sensible results (Hauer and Kondrak 2011)

    View full-size slide

  40. Analyses and Evaluation
    ● we tested two classical methods for cognate detection
    (SCA and LexStat) against their refined variants sensitive
    for partial cognates (SCA-Partial, LexStat-Partial)
    ● since SCA and LexStat yield full cognate judgments, we
    need to convert the partial (exact) cognate judgments to
    full cognate accounts, using the criterion of full identity
    (strict encoding as shown before)
    ● we tested also the accuracy of SCA-Partial and
    LexStat-Partial on partial cognacy, but cannot compare
    these scores with other algorithms

    View full-size slide

  41. Implementation
    The code was implemented in Python, as part of the
    LingPy library (Version 2.5, List and Forkel (2016),
    http://lingpy.org). The igraph software package
    (Csárdi and Nepusz 2006) is needed to apply the
    Infomap algorithm.

    View full-size slide

  42. Results: General

    View full-size slide

  43. Results: General

    View full-size slide

  44. Results: SCA vs. LexStat

    View full-size slide

  45. Results: Datasets

    View full-size slide

  46. Discussion
    ● our method is a pilot approach for the detection of
    partial cognates in multilingual word lists
    ● further improvements are needed
    ● we should test on additional datasets (new language
    families) and increase our data (testing and training)
    ● our approach can be easily adjusted to employ different
    string similarity measures or partitioning algorithms: let’s
    try and see whether alternative measures can improve
    upon our current version

    View full-size slide

  47. Code and data at:
    https://github.com/lingpy/partial-cognate-detection
    or
    https://zenodo.org/record/51328?ln=en

    View full-size slide

  48. Thanks for your attention!
    Code and data at:
    https://github.com/lingpy/partial-cognate-detection
    or
    https://zenodo.org/record/51328?ln=en

    View full-size slide