Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond Cognacy

Beyond Cognacy

Talk presented at the workshop "Towards a Global Language Phylogeny", 17-19 September, Max Planck Institute for History and the Sciences, Jena.

Johann-Mattis List

September 17, 2014
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Beyond Cognacy
    Current Chances and Future Challenges of Automatic Cognate
    Detection in Historical Linguistics
    Johann-Mattis List
    Forschungszentrum Deutscher Sprachatlas
    Philipps-University Marburg
    2014-09-17
    1 / 30

    View Slide

  2. word Wort слово
    cuvînt palabra
    mot adottszó slovo verbum
    focal 词 parola λόγος
    शब◌्
    द ord
    λόγος Wort слово
    cuvînt palabra
    mot adottszó slovo verbum
    focal 词 parola
    शब◌्
    द ord
    word
    ord
    ord
    word
    Cognate Detection
    2 / 30

    View Slide

  3. Cognate Detection Traditional Approaches
    Traditional Approaches
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    3 / 30

    View Slide

  4. Cognate Detection Traditional Approaches
    The Comparative Method
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    4 / 30

    View Slide

  5. Cognate Detection Traditional Approaches
    The Comparative Method
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    proof of
    relationship
    identification
    of cognates
    identification of
    sound correspondences
    reconstruction
    of proto-forms
    internal
    classification
    4 / 30

    View Slide

  6. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    5 / 30

    View Slide

  7. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  8. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  9. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn d ɔː n
    5 / 30

    View Slide

  10. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 2 x
    d d 1 x
    n n 1 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  11. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x ?
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  12. Cognate Detection Traditional Approaches
    Cognate Detection
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Cognate List Alignment Correspondence List
    German dünn d ʏ n GER ENG Frequ.
    d θ 3 x
    d d 1 x
    n n 2 x
    m m 1 x
    ŋ ŋ 1 x
    English thin θ ɪ n
    German Ding d ɪ ŋ
    English thing θ ɪ ŋ
    German dumm d ʊ m
    English dumb d ʌ m
    German Dorn d ɔɐ n
    English thorn θ ɔː n
    5 / 30

    View Slide

  13. Cognate Detection Automatic Approaches
    Automatic Approaches
    P(A|B)=(P(B|A)P(A))/(P(B)
    6 / 30

    View Slide

  14. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    7 / 30

    View Slide

  15. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Traditional Workflow
    *dent-
    dente
    dɑ̃
    dɛnte
    *tanθ
    tuːθ
    t͡saːn
    DICTIONARIES
    WORDLISTS
    HISTORICAL SCENARIOS
    7 / 30

    View Slide

  16. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Traditional Workflow
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    *dent-
    dente
    dɑ̃
    dɛnte
    *tanθ
    tuːθ
    t͡saːn
    DICTIONARIES
    WORDLISTS
    HISTORICAL SCENARIOS
    7 / 30

    View Slide

  17. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Traditional Workflow
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    *dent-
    dente
    dɑ̃
    dɛnte
    *tanθ
    tuːθ
    t͡saːn
    DICTIONARIES
    WORDLISTS
    HISTORICAL SCENARIOS
    7 / 30

    View Slide

  18. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Technical Workflow
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    WORDLIST
    DATA
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    RAW
    DATA
    Semantic
    Tagging
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    TOKENS,
    MORPHEMES
    Tokenization
    Cognate
    Detection
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    COGNATE
    SETS
    Alignment
    Analysis
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    SOUND
    CORRESPON-
    DENCES
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    PROTO-
    FORMS
    Linguistic
    Reconstruction
    7 / 30

    View Slide

  19. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Technical Workflow
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    WORDLIST
    DATA
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    RAW
    DATA
    Semantic
    Tagging
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    TOKENS,
    MORPHEMES
    Tokenization
    Cognate
    Detection
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    COGNATE
    SETS
    Alignment
    Analysis
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    SOUND
    CORRESPON-
    DENCES
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    PROTO-
    FORMS
    Linguistic
    Reconstruction
    7 / 30

    View Slide

  20. Cognate Detection Automatic Approaches
    Narrowing down the Task
    P(A|B)=(P(B|A)P(A))/(P(B)
    Technical Workflow
    INPUT:
    Multilingual wordlist
    → semantically tagged
    → phonetically transcribed
    → tokenized into phonemes
    OUTPUT:
    Multilingual wordlist
    → identified cognate entries
    assigned to clusters
    → identified cognate entries
    multiply aligned
    7 / 30

    View Slide

  21. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    8 / 30

    View Slide

  22. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    8 / 30

    View Slide

  23. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    PAIRWISE
    DISTANCES
    BETWEEN
    WORDS
    PAIRWISE
    COMPARISON
    8 / 30

    View Slide

  24. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Basic Procedure for Multilingual Cognate Detection
    WORDLIST
    DATA
    PAIRWISE
    DISTANCES
    BETWEEN
    WORDS
    COGNATE
    SETS
    COGNATE
    CLUSTERING
    PAIRWISE
    COMPARISON
    8 / 30

    View Slide

  25. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Cognate Clustering
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    8 / 30

    View Slide

  26. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    Analysis
    ID Taxa Word Gloss GlossID IPA
    ... ... ... ... ... ...
    21 German Frau woman 20 frau
    22 Dutch vrouw woman 20 vrɑu
    23 English woman woman 20 wʊmən
    24 Danish kvinde woman 20 kvenə
    25 Swedish kvinna woman 20 kviːna
    26 Norwegian kvine woman 20 kʋinə
    ... ... ... ... ... ...
    8 / 30

    View Slide

  27. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    8 / 30

    View Slide

  28. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Cognate Clustering
    Swedish English Danish Norwegian Dutch German
    kvinna woman kvinde kvine vrouw Frau
    Swedish
    kvina
    0.00 0.69 0.07 0.12 0.71 0.78
    English
    wumin
    0.69 0.00 0.66 0.57 0.68 0.87
    Danish
    kveni
    0.07 0.66 0.00 0.08 0.67 0.71
    Norwegian
    kwini
    0.12 0.57 0.08 0.00 0.75 0.74
    Dutch
    frou
    0.71 0.68 0.67 0.75 0.00 0.17
    German
    frau
    0.78 0.87 0.71 0.74 0.17 0.00
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    8 / 30

    View Slide

  29. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    Cognate Clustering
    German Frau frau
    Dutch vrouw vrou
    English woman wumin
    Danish kvinde kveni
    Swedish kvinna kvina
    Norwegian kvine kwini
    Analysis
    ID Taxa Word Gloss GlossID IPA CogID
    ... ... ... ... ... ... ...
    21 German Frau woman 20 frau 1
    22 Dutch vrouw woman 20 vrɑu 1
    23 English woman woman 20 wʊmən 2
    24 Danish kvinde woman 20 kvenə 3
    25 Swedish kvinna woman 20 kviːna 3
    26 Norwegian kvine woman 20 kʋinə 3
    ... ... ... ... ... ... ...
    8 / 30

    View Slide

  30. Cognate Detection Automatic Approaches
    Algorithms
    P(A|B)=(P(B|A)P(A))/(P(B)
    INPUT
    TOKENIZATION
    PREPROCESSING
    LOG-ODDS
    D ISTANCE
    COGNATE
    OUTPUT
    CORRESPONDENCE
    DETECTION USING
    PHONETIC
    ALIGNMENT
    LOOP
    DISTRIBUTION
    LexStat Algorithm (List 2014)
    EXPECTED
    ATTESTED
    DISTRIBUTION
    CALCULATION
    CLUSTERING
    8 / 30

    View Slide

  31. Cognate Detection Problems
    Problems
    !
    9 / 30

    View Slide

  32. Cognate Detection Problems
    Applicability
    !
    10 / 30

    View Slide

  33. Cognate Detection Problems
    Applicability
    !
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    10 / 30

    View Slide

  34. Cognate Detection Problems
    Applicability
    !
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    10 / 30

    View Slide

  35. Cognate Detection Problems
    Applicability
    !
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    10 / 30

    View Slide

  36. Cognate Detection Problems
    Applicability
    !
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    10 / 30

    View Slide

  37. Cognate Detection Problems
    Transparency
    !
    11 / 30

    View Slide

  38. Cognate Detection Problems
    Transparency
    !
    Results are often only reported as evaluation scores.
    11 / 30

    View Slide

  39. Cognate Detection Problems
    Transparency
    !
    Results are often only reported as evaluation scores.
    Examples for individual cognate judgments are rare.
    11 / 30

    View Slide

  40. Cognate Detection Problems
    Transparency
    !
    Results are often only reported as evaluation scores.
    Examples for individual cognate judgments are rare.
    Supplementary data
    – is often lacking, or
    11 / 30

    View Slide

  41. Cognate Detection Problems
    Transparency
    !
    Results are often only reported as evaluation scores.
    Examples for individual cognate judgments are rare.
    Supplementary data
    – is often lacking, or
    – not given in a human-readable form.
    11 / 30

    View Slide

  42. Cognate Detection Problems
    Transparency
    !
    Results are often only reported as evaluation scores.
    Examples for individual cognate judgments are rare.
    Supplementary data
    – is often lacking, or
    – not given in a human-readable form.
    → The results show a great lack of transparency.
    11 / 30

    View Slide

  43. Cognate Detection Problems
    Comparability
    !
    12 / 30

    View Slide

  44. Cognate Detection Problems
    Comparability
    !
    Test sets (benchmarks) vary greatly.
    12 / 30

    View Slide

  45. Cognate Detection Problems
    Comparability
    !
    Test sets (benchmarks) vary greatly.
    Often, only subsets of Dyen et al. (1992) are used.
    12 / 30

    View Slide

  46. Cognate Detection Problems
    Comparability
    !
    Test sets (benchmarks) vary greatly.
    Often, only subsets of Dyen et al. (1992) are used.
    → It is difficult to compare the performance of the
    methods.
    12 / 30

    View Slide

  47. Cognate Detection Problems
    Accuracy
    !
    13 / 30

    View Slide

  48. Cognate Detection Problems
    Accuracy
    !
    Evaluation criteria are not very intuitive and vary greatly.
    13 / 30

    View Slide

  49. Cognate Detection Problems
    Accuracy
    !
    Evaluation criteria are not very intuitive and vary greatly.
    It is difficult to communicate the results to traditional linguists.
    13 / 30

    View Slide

  50. Cognate Detection Problems
    Accuracy
    !
    Evaluation criteria are not very intuitive and vary greatly.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic cognate detection as
    – “impossible per se”, or
    13 / 30

    View Slide

  51. Cognate Detection Problems
    Accuracy
    !
    Evaluation criteria are not very intuitive and vary greatly.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic cognate detection as
    – “impossible per se”, or
    – as useful as “rolling a dice”.
    13 / 30

    View Slide

  52. Chances
    14 / 30

    View Slide

  53. Chances
    14 / 30

    View Slide

  54. Chances
    14 / 30

    View Slide

  55. Chances Applicability
    Applicability
    PyPi
    GitHub
    SourceForge
    GoogleCode
    CPAN
    CTAN
    JSAN
    PEAR
    LaunchPad
    15 / 30

    View Slide

  56. Chances Applicability
    Applicability
    PyPi
    GitHub
    SourceForge
    GoogleCode
    CPAN
    CTAN
    JSAN
    PEAR
    LaunchPad
    It was never easier
    to publish and
    maintain code...
    15 / 30

    View Slide

  57. Chances Applicability
    LingPy
    PyPi
    GitHub
    SourceForge
    GoogleCode
    CPAN
    CTAN
    JSAN
    PEAR
    LaunchPad
    16 / 30

    View Slide

  58. Chances Applicability
    LingPy
    PyPi
    GitHub
    SourceForge
    GoogleCode
    CPAN
    CTAN
    JSAN
    PEAR
    LaunchPad
    What is LingPy?
    Python library for automatic tasks in historical linguistics
    project homepage: http://lingpy.org
    code base: https://github.com/lingpy/lingpy
    supports Python2 and Python3
    works on Mac, Linux, and (basically also) Windows
    current release: 2.3
    16 / 30

    View Slide

  59. Chances Applicability
    LingPy
    PyPi
    GitHub
    SourceForge
    GoogleCode
    CPAN
    CTAN
    JSAN
    PEAR
    LaunchPad
    What does LingPy offer?
    tokenization of phonetic sequences
    phonetic alignment analyses (List 2012a)
    automatic cognate detection (Turchin 2010, List 2012b)
    automatic borrowing detection (List et al. 2014)
    basic routines for the evaluation of automatic methods
    plotting routines for interactive visualizations
    16 / 30

    View Slide

  60. Chances Transparency
    Transparency
    17 / 30

    View Slide

  61. Chances Transparency
    Interactive Presentation of Results
    18 / 30

    View Slide

  62. Chances Transparency
    Interactive Presentation of Results
    Alignments offer a unique perspective on results of
    cognate detection analyses.
    JavaScript and HTML5 offer unique ways for
    interactive data visualization.
    At the moment, we develop JavaScript tools that
    – visualize phonetic alignments of cognate sets, and
    – even allow to edit the data online.
    18 / 30

    View Slide

  63. Chances Comparability
    Comparability
    ML
    BAYES
    ? !
    19 / 30

    View Slide

  64. Chances Comparability
    Benchmark Databases for Historical Linguistics
    ML
    BAYES
    ? !
    20 / 30

    View Slide

  65. Chances Comparability
    Benchmark Databases for Historical Linguistics
    ML
    BAYES
    ? !
    First benchmark databases have been compiled and published:
    Benchmark Database of Phonetic Alignments (BDPA, List & Prokić
    2014, http://alignments.lingpy.org)
    Benchmark Database for Cognate Detection (BDCD, presented in
    List 2014, http://sequencecomparison.github.io).
    Benchmark Database for Linguistic Reconstruction (BDLR, in
    preparation).
    20 / 30

    View Slide

  66. Chances Comparability
    Benchmark Databases for Historical Linguistics
    ML
    BAYES
    ? !
    All data is
    given in phonetic transcriptions (IPA),
    tokenized into phonemic units,
    freely available for download, and
    can be directly used in LingPy.
    20 / 30

    View Slide

  67. Chances Accuracy
    Accuracy
    *h₂
    21 / 30

    View Slide

  68. Chances Accuracy
    Performance of Cognate Detection Algorithms
    *h₂
    22 / 30

    View Slide

  69. Chances Accuracy
    Performance of Cognate Detection Algorithms
    *h₂ B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    22 / 30

    View Slide

  70. Chances Accuracy
    Performance of Cognate Detection Algorithms
    *h₂ B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    92%
    81%
    89%
    81%
    22 / 30

    View Slide

  71. Chances Accuracy
    Performance of Cognate Detection Algorithms
    *h₂ B-Cubed F-Scores on BDCD Benchmark (List 2014)
    Bai
    (Tibeto-Burman)
    Indo-European
    Japanese and
    Ryukyu Ob-Ugrian
    Austronesian
    Sinitic
    (Chinese Dialects)
    60
    65
    70
    75
    80
    85
    90
    95
    Turchin
    NED
    SCA
    LexStat
    75%
    93%
    22 / 30

    View Slide

  72. P(A|B)=(P(B|A)P(A))/(P(B)
    Challenges
    23 / 30

    View Slide

  73. Challenges Within Cognacy
    Within Cognacy
    24 / 30

    View Slide

  74. Challenges Within Cognacy
    Within Cognacy
    We need to enhance our
    24 / 30

    View Slide

  75. Challenges Within Cognacy
    Within Cognacy
    We need to enhance our
    lexical databases (amount and quality of data),
    24 / 30

    View Slide

  76. Challenges Within Cognacy
    Within Cognacy
    We need to enhance our
    lexical databases (amount and quality of data),
    cognate detection algorithms (accessibility and performance), and
    24 / 30

    View Slide

  77. Challenges Within Cognacy
    Within Cognacy
    We need to enhance our
    lexical databases (amount and quality of data),
    cognate detection algorithms (accessibility and performance), and
    ways to present the results (interactive visualizations).
    24 / 30

    View Slide

  78. Challenges Beyond Cognacy
    Beyond Cognacy
    25 / 30

    View Slide

  79. Challenges Beyond Cognacy
    Beyond Cognacy
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    25 / 30

    View Slide

  80. Challenges Beyond Cognacy
    Beyond Cognacy
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    25 / 30

    View Slide

  81. Challenges Beyond Cognacy
    Beyond Cognacy
    German m oː n t -
    English m uː n - -
    Danish m ɔː n - ə
    Swedish m oː n - e
    Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - -
    Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴
    Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - -
    Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - -
    "MOON"
    "MOON"
    "SHINE" "LIGHT"
    25 / 30

    View Slide

  82. Challenges Beyond Cognacy
    Beyond Cognacy
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    25 / 30

    View Slide

  83. Challenges Beyond Cognacy
    Beyond Cognacy
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BO
    RRO
    W
    ING
    LO
    SS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    25 / 30

    View Slide

  84. Challenges Beyond Cognacy
    Lexical Change
    SEMANTIC CHANGE
    MORPHOLOGICAL CHANGE
    S
    T
    R
    A
    T
    IC
    C
    H
    A
    N
    G
    E
    Three Dimensions of Lexical Change (Gévaudan 2007)
    26 / 30

    View Slide

  85. Challenges Beyond Cognacy
    Lexical Change
    Stratic
    Morphological
    Semantic
    Relation Biolog. Term continuity
    traditional notion of cognacy - + +/- +/-
    cognacy à la Swadesh - + +/- +
    automatic cognate detection - +/- +/- +
    direct cognate relation orthology + + +
    oblique cognate relation paralogy + - +
    etymological relation homology +/- +/- +/-
    oblique etymological relation xenology - +/- +/-
    26 / 30

    View Slide

  86. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    27 / 30

    View Slide

  87. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    In order to go beyond cognacy, we need methods for
    27 / 30

    View Slide

  88. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    In order to go beyond cognacy, we need methods for
    borrowing detection (stratic aspect),
    27 / 30

    View Slide

  89. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    In order to go beyond cognacy, we need methods for
    borrowing detection (stratic aspect),
    partial cognate inference (morphological aspect), and
    27 / 30

    View Slide

  90. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    In order to go beyond cognacy, we need methods for
    borrowing detection (stratic aspect),
    partial cognate inference (morphological aspect), and
    cross-semantic cognate inference (semantic aspect).
    27 / 30

    View Slide

  91. Challenges Beyond Cognacy
    Inferring Lexical Change Scenarios
    In order to go beyond cognacy, we need methods for
    borrowing detection (stratic aspect),
    partial cognate inference (morphological aspect), and
    cross-semantic cognate inference (semantic aspect).
    Following the lead of evolutionary biology, these methods should be
    combined under a unified framework of tree reconciliation (Page &
    Cotton 2002) in historical linguistics.
    27 / 30

    View Slide

  92. Challenges Beyond Cognacy
    Tree Reconciliation
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    28 / 30

    View Slide

  93. Challenges Beyond Cognacy
    Tree Reconciliation
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    28 / 30

    View Slide

  94. Challenges Beyond Cognacy
    Tree Reconciliation
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    28 / 30

    View Slide

  95. Challenges Beyond Cognacy
    Tree Reconciliation
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    28 / 30

    View Slide

  96. Challenges Beyond Cognacy
    Tree Reconciliation
    LOSS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BORROWING
    28 / 30

    View Slide

  97. Challenges Beyond Cognacy
    Tree Reconciliation
    PHYLOGENETIC
    RECONSTRUC-
    TION
    COGNATE
    (=HOMOLOG)
    DETECTION
    COGNATE
    TREE
    RECONCILIATION
    General Workflow for the Inference of Lexical Change Scenarios
    28 / 30

    View Slide

  98. Conclusion
    29 / 30

    View Slide

  99. Conclusion
    Automatic cognate detection is still in its infancy, yet the child is
    constantly growing.
    29 / 30

    View Slide

  100. Conclusion
    Automatic cognate detection is still in its infancy, yet the child is
    constantly growing.
    Enhancing the applicability, transparency, comparability, and
    accuracy of cognate detection methods is a goal that can be
    achieved in the near future.
    29 / 30

    View Slide

  101. Conclusion
    Automatic cognate detection is still in its infancy, yet the child is
    constantly growing.
    Enhancing the applicability, transparency, comparability, and
    accuracy of cognate detection methods is a goal that can be
    achieved in the near future.
    The greatest challenge arises from the complexity of lexical
    change processes.
    29 / 30

    View Slide

  102. Conclusion
    Automatic cognate detection is still in its infancy, yet the child is
    constantly growing.
    Enhancing the applicability, transparency, comparability, and
    accuracy of cognate detection methods is a goal that can be
    achieved in the near future.
    The greatest challenge arises from the complexity of lexical
    change processes.
    More realistic approaches that go beyond cognacy should be able
    to handle variation along the stratic, the morphological, and the
    semantic dimension of lexical change.
    29 / 30

    View Slide

  103. Conclusion
    Automatic cognate detection is still in its infancy, yet the child is
    constantly growing.
    Enhancing the applicability, transparency, comparability, and
    accuracy of cognate detection methods is a goal that can be
    achieved in the near future.
    The greatest challenge arises from the complexity of lexical
    change processes.
    More realistic approaches that go beyond cognacy should be able
    to handle variation along the stratic, the morphological, and the
    semantic dimension of lexical change.
    Evolutionary biology offers frameworks that could be employed to
    achieve these goals, yet it is not entirely clear whether and how
    this is possible.
    29 / 30

    View Slide

  104. Thank You for Listening!
    30 / 30

    View Slide