$30 off During Our Annual Pro Sale. View Details »

Investigating the impact of sample size on cognate detection

Investigating the impact of sample size on cognate detection

Talk held at the conference Comparative-Historical Linguistics Of the XXIst Century: Issues and Perspectives, March 20-22, Russian State University for the Humanities, Moscow.

Johann-Mattis List

March 21, 2013
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. .
    .
    .
    .
    .
    .
    .
    Investigating the Impact of Sample Size on Cognate
    Detection
    Johann-Mattis List
    Research Unit Quantitative Language Comparison
    Philipps-University Marburg
    March 17, 2013
    1 / 30

    View Slide

  2. Sanscruta and Italian
    Sono scritte le loro scienze tutte in una lingua, che diman-
    dano Sanscruta, che vuol dire bene articolata. [...] et ha
    la lingua d’oggi molte cose comuni con quella, nella quale
    sono molti de’ nostri nomi, e particularmente de’ numeri il 6,
    7, 8 e 9, Dio, serpe, et altri assai.(Sassetti 1855: 415)
    Translation: Everything that is related to science is written in a language
    which they call “Sanscruta”, meaning as much as “well-articulated”. Our
    language has much in common with it, among others many of our words,
    especially the numbers 6, 7 , 8, and 9, “God”, “snake”, and many more.
    2 / 30

    View Slide

  3. The Comparative Method
    3 / 30

    View Slide

  4. The Comparative Method Working Procedure
    Working Procedure
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    4 / 30

    View Slide

  5. The Comparative Method Working Procedure
    Working Procedure
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    4 / 30

    View Slide

  6. The Comparative Method Cognate Detection
    Cognate Detection
    5 / 30

    View Slide

  7. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  8. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  9. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  10. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  11. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x ?
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  12. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  13. The Comparative Method Cognate Detection
    Cognate Detection
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    n n 2 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    German dumm d ʊ m
    English dumb d ʌ m
    5 / 30

    View Slide

  14. The Comparative Method Summary
    Summary
    .
    Important Aspects
    .
    .
    .
    .
    .
    .
    .
    .
    language-specific notion of word similarity
    regular sound correspondences
    iterative character
    .
    Unspecified Parameters
    .
    .
    .
    .
    .
    .
    .
    .
    number of languages
    semantic similarity of the words
    size of the word lists
    6 / 30

    View Slide

  15. The Comparative Method Summary
    Summary
    .
    The Problem of the Sample Size
    .
    .
    .
    .
    .
    .
    .
    .
    Albanian English French German
    Albanian 0.07 0.10 0.10
    English 14 0.23 0.56
    French 20 46 0.23
    German 20 111 46
    .
    Numbers and proportions of shared cognates in the
    Swadesh-200 list (Swadesh 1952), taken from Kessler
    (2001).
    7 / 30

    View Slide

  16. Automatic Cognate Detection
    8 / 30

    View Slide

  17. Automatic Cognate Detection Similarity
    Two Types of Similarity
    .
    “Phenotypic” Similarity (Lass 1997)
    .
    .
    .
    .
    .
    .
    .
    .
    based on surface resemblances of phonetic
    segments
    only depends on the words under comparison
    .
    “Genotypic” Similarity (ibid.)
    .
    .
    .
    .
    .
    .
    .
    .
    based on sound-correspondences
    depends on the words and the languages under
    comparison
    9 / 30

    View Slide

  18. Automatic Cognate Detection Similarity
    Two Types of Similarity
    German Mund [mʊnt]
    English mouth [mauθ]
    10 / 30

    View Slide

  19. Automatic Cognate Detection Similarity
    Two Types of Similarity
    German Mund [mʊnt]
    English mouth [mauθ]
    German English
    Milch [ mɪlç] m m [ mɪlk] milk
    rund [ rʊnt] ʊ au [ raund] round
    anders [ andərs] n - [ ʌ(-)θər] other
    südlich [ sytlɪç] t θ [ sʌθərn] southern
    10 / 30

    View Slide

  20. Automatic Cognate Detection Language-Independent Approaches
    Language-Independent Approaches
    .
    Normalized Edit Distance
    .
    .
    .
    .
    .
    .
    .
    .
    align two words and calculate their hamming distance
    normalize by dividing by the length of the longer word
    assume cognacy for distances beyond a certain
    threshold
    .
    Turchin et al. (2010)
    .
    .
    .
    .
    .
    .
    .
    .
    convert two (or more) words to Dolgopolsky (1966)
    consonant classes
    assume cognacy if the first two classes match
    11 / 30

    View Slide

  21. Automatic Cognate Detection Language-Independent Approaches
    Language-Independent Approaches
    German Mund [mʊnt]
    English mouth [mauθ]
    12 / 30

    View Slide

  22. Automatic Cognate Detection Language-Independent Approaches
    Language-Independent Approaches
    German Mund [mʊnt]
    English mouth [mauθ]
    Turchin NED
    mʊnt → M N T m ʊ n t
    mauθ → M T m au - θ
    Matches: x 0 1 1 1
    1 match => not cognate 3/4 = 0.75 => not cognate
    12 / 30

    View Slide

  23. Automatic Cognate Detection Language-Specific Approaches
    Language-Specific Approaches
    .
    LexStat (List 2012a)
    .
    .
    .
    .
    .
    .
    .
    .
    represent words as tuples of sound classes and prosodic
    strings
    use the SCA approach (List 2012b) to guess initial
    correspondences
    use a Monte-Carlo permutation test to derive language-specific
    similarity scores
    use the language-specific scores to calculate distance between
    words
    cluster words into cognate sets using a flat cluster algorithm
    13 / 30

    View Slide

  24. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    14 / 30

    View Slide

  25. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Sound Classes
    .
    .
    .
    .
    .
    .
    .
    .
    Sounds which frequently occur
    in correspondence relations in
    genetically related languages
    can be divided in classes
    (types). It is thereby assumed
    that “phonetic correspondences
    inside a ‘type’ are more regular
    than those between different
    ‘types’” (Dolgoposky
    1986[1966]: 35).
    14 / 30

    View Slide

  26. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Sound Classes
    .
    .
    .
    .
    .
    .
    .
    .
    Sounds which frequently occur
    in correspondence relations in
    genetically related languages
    can be divided in classes
    (types). It is thereby assumed
    that “phonetic correspondences
    inside a ‘type’ are more regular
    than those between different
    ‘types’” (Dolgoposky
    1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  27. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Sound Classes
    .
    .
    .
    .
    .
    .
    .
    .
    Sounds which frequently occur
    in correspondence relations in
    genetically related languages
    can be divided in classes
    (types). It is thereby assumed
    that “phonetic correspondences
    inside a ‘type’ are more regular
    than those between different
    ‘types’” (Dolgoposky
    1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  28. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Sound Classes
    .
    .
    .
    .
    .
    .
    .
    .
    Sounds which frequently occur
    in correspondence relations in
    genetically related languages
    can be divided in classes
    (types). It is thereby assumed
    that “phonetic correspondences
    inside a ‘type’ are more regular
    than those between different
    ‘types’” (Dolgoposky
    1986[1966]: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    14 / 30

    View Slide

  29. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Sound Classes
    .
    .
    .
    .
    .
    .
    .
    .
    Sounds which frequently occur
    in correspondence relations in
    genetically related languages
    can be divided in classes
    (types). It is thereby assumed
    that “phonetic correspondences
    inside a ‘type’ are more regular
    than those between different
    ‘types’” (Dolgoposky
    1986[1966]: 35).
    K
    T
    P
    S
    1
    14 / 30

    View Slide

  30. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    15 / 30

    View Slide

  31. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols.
    15 / 30

    View Slide

  32. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols. j a b ə l k a
    15 / 30

    View Slide

  33. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols. j a b ə l k a
    ↑ ↑ ↓ ↑
    o strong
    weak
    15 / 30

    View Slide

  34. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols. j a b ə l k a
    ↑ ↑ ↓ ↑
    ↑ ascending
    maximum
    ↓ descending
    15 / 30

    View Slide

  35. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols.
    sonority
    increases
    j a b ə l k a
    15 / 30

    View Slide

  36. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    .
    Prosodic Strings
    .
    .
    .
    .
    .
    .
    .
    .
    Sound change occurs more
    frequently in weak positions of
    sound sequences (Geisler
    1992). Based on a sonority
    profile of sound sequences, one
    can distinguish sound positions
    according to their prosodic
    contexts. Prosodic context can
    be modeled as prosodic string
    in which different contexts are
    coded by different symbols. j a b ə l k a
    # v C v c C >
    15 / 30

    View Slide

  37. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    16 / 30

    View Slide

  38. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    External Representation
    IPA j a b ə l k a
    Internal Representation
    Sound-Class String J A P E L K A
    Prosodic String # V C V c C >
    16 / 30

    View Slide

  39. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    17 / 30

    View Slide

  40. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    Cognate List Alignment Correspondence List
    German Zunge ʦ ʊ ŋ ə GER ENG Frequ.
    ʦ t 2 x
    s t 2 x
    h h 1 x
    f f 1 x
    n - 1 x
    … … …
    English tongue t ʌ ŋ -
    German Zahn ʦ aː n -
    English tooth t ʊː - θ
    German heiß h ai s
    English hot h ɔ t
    German Fuß f u ː s
    English foot f ʊ t
    17 / 30

    View Slide

  41. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    Cognate List Alignment Correspondence List
    German Zunge ʦ ʊ ŋ ə GER ENG Frequ.
    ʦ t 2 x
    s t 2 x
    h h 1 x
    f f 1 x
    n - 1 x
    … … …
    English tongue t ʌ ŋ -
    German Zahn ʦ aː n -
    English tooth t ʊː - θ
    German heiß h ai s
    English hot h ɔ t
    German Fuß f u ː s
    English foot f ʊ t
    17 / 30

    View Slide

  42. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    Cognate List Alignment Correspondence List
    German Zunge C U N E GER ENG Frequ.
    C/# T/# 2 x
    S/$ T/$ 2 x
    H/$ H/# 1 x
    B/$ B/# 1 x
    N/c - 1 x
    … … …
    English tongue T A N -
    German Zahn C A N -
    English tooth T U - T
    German heiß H A S
    English hot H O T
    German Fuß B U S
    English foot B U T
    17 / 30

    View Slide

  43. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    Dataset of Kessler (2001)
    “to dig” (30) Turchin NED LexStat.
    Albanisch gërmon gərmo 1 1 1
    Englisch digs dɪg 2 2 2
    Französisch creuse krøze 1 3 3
    Deutsch gräbt graːb 1 1 4
    Hawaii ‘eli ʔeli 5 5 5
    Navajo hahashgééd hahageːd 6 6 6
    Türkisch kazıyor kaz 7 3 7
    18 / 30

    View Slide

  44. Automatic Cognate Detection Language-Specific Approaches
    LexStat
    Dataset of Kessler (2001)
    “mouth” (104) Turchin NED LexStat.
    Albanisch gojë goj 1 1 1
    Englisch mouth mauθ 2 2 2
    Französisch bouche buʃ 3 3 3
    Deutsch Mund mund 4 4 2
    Hawaii waha waha 5 5 5
    Navajo ’azéé’ zeːʔ 6 6 6
    Türkisch ağız aɣz 7 7 7
    19 / 30

    View Slide

  45. Testing the Impact of Sample
    Size on Cognate Detection
    20 / 30

    View Slide

  46. Testing the Impact of Sample Size on Cognate Detection Materials
    Gold Standard
    .
    IDS-Testset
    .
    .
    .
    .
    .
    .
    .
    .
    4 languages (German, English, Dutch, French)
    550 items (glosses)
    translations taken from the IDS (Key & Comrie 2009)
    orthographic entries converted into IPA transcriptions
    cognate judgments follow traditional literature
    21 / 30

    View Slide

  47. Testing the Impact of Sample Size on Cognate Detection Materials
    Subsets of Varying Samplesize
    .
    Creating the Subsets
    .
    .
    .
    .
    .
    .
    .
    .
    Starting from the basic dataset, subsets of the data were
    created by
    randomly deleting 5, 10, 15, etc. items from the original
    dataset, and
    taking 5 different samples for each distinct number of
    deletions.
    This process yielded 550 datasets, covering the whole range
    of possible sample sizes between 5 and 550 in steps of 5.
    22 / 30

    View Slide

  48. Testing the Impact of Sample Size on Cognate Detection Methods
    Automatic Cognate Detection
    .
    Methods for Cognate Detection
    .
    .
    .
    .
    .
    .
    .
    .
    Normalized Edit Distance (NED)
    Turchin et al. (2010, Turchin)
    SCA Distance (List 2012b)
    LexStat (List 2012a)
    .
    Implementation
    .
    .
    .
    .
    .
    .
    .
    .
    All methods are implemented as part of LingPy-1.0 (see
    http://lingpy.org), a Python library for quantitative
    tasks in historical linguistics.
    23 / 30

    View Slide

  49. Testing the Impact of Sample Size on Cognate Detection Methods
    Evaluation Measures
    .
    B-Cubed Precision and Recall (Amigó et al. 2009)
    .
    .
    .
    .
    .
    .
    .
    .
    Given a test (result of an analysis) and a reference (the gold
    standard),
    precision is the proportion of items in the test that also occur in
    the reference, and
    recall is the proportion of items in the reference that also occur
    in the test.
    Low precision is equivalent to high rates of false positives, low recall
    is equivalent to high rates of false negatives (missed cognates).
    24 / 30

    View Slide

  50. Results
    25 / 30

    View Slide

  51. Results
    Results
    Items
    B-Cubed Recall
    Turchin NED SCA LexStat
    50 86.10 85.55 92.44 90.88
    100 86.55 85.77 92.20 93.89
    200 86.88 86.61 92.68 95.02
    300 87.13 86.64 92.90 95.05
    400 87.14 86.81 92.89 94.94
    500 87.07 86.77 92.75 94.90
    26 / 30

    View Slide

  52. Results
    100 200 300 400 500
    84
    86
    88
    90
    92
    94
    96
    Turchin
    100 200 300 400 500
    84
    86
    88
    90
    92
    94
    96
    NED
    100 200 300 400 500
    84
    86
    88
    90
    92
    94
    96
    SCA
    100 200 300 400 500
    84
    86
    88
    90
    92
    94
    96
    LexStat
    27 / 30

    View Slide

  53. Discussion
    28 / 30

    View Slide

  54. Discussion
    Discussion
    .
    Are 200 words enough?
    .
    .
    .
    .
    .
    .
    .
    .
    Although
    the representativity of the data is limited, and
    the number of languages investigated is small,
    the test shows that
    sample size has a definite impact on the results of
    language-specific methods, and
    using 200 words is surely better than using 100 words.
    29 / 30

    View Slide

  55. 30 / 30

    View Slide

  56. Sanscruta sarpá- s a r p a
    Italienisch serpe s ɛ r p ə
    Sanscruta devá- d e v a
    Italienisch Dio d i - o
    Sanscruta saptá- s a p t a
    Italienisch sette s ɛ - tː ə
    30 / 30

    View Slide

  57. Спасибо за Ваше Внимание!
    30 / 30

    View Slide