$30 off During Our Annual Pro Sale. View Details »

Benchmark Databases in Historical Linguistics

Benchmark Databases in Historical Linguistics

Talk held at the workshop "Language Comparison with Linguistic Databases" (Max Planck Institute, Nijmegen)

Johann-Mattis List

October 07, 2014
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Benchmark Databases in Historical Linguistics
    Johann-Mattis List
    Forschungszentrum Deutscher Sprachatlas
    Philipps-University Marburg
    2014-10-07
    1 / 30

    View Slide

  2. Benchmark Basics
    2 / 30

    View Slide

  3. Benchmark Basics What are Benchmarks?
    What are Benchmark Databases?
    3 / 30

    View Slide

  4. Benchmark Basics What are Benchmarks?
    What are Benchmark Databases?
    The evaluation and comparison of programs dealing with
    quantitative tasks in historical linguistics requires a large
    number of accurate reference sets which can be used as test
    cases to identify the strong and weak points of the numerous
    programs. (Original Plagiarism)
    3 / 30

    View Slide

  5. Benchmark Basics What are Benchmarks?
    What are Benchmark Databases?
    The evaluation and comparison of programs dealing with
    quantitative tasks in historical linguistics requires a large
    number of accurate reference sets which can be used as test
    cases to identify the strong and weak points of the numerous
    programs. (Original Plagiarism)
    3 / 30

    View Slide

  6. Benchmark Basics What are Benchmarks?
    What are Benchmark Databases?
    A comprehensive evaluation and comparison of alignment
    programs requires a large number of accurate reference
    alignments which can be used as test cases. It has been
    shown (McClure et al., 1994) that the performance of
    alignment programs depends on the number of sequences, the
    degree of similarity between sequences and the number of
    insertions in the alignment. [...] We have constructed BAliBASE
    (Benchmark Alignment dataBASE) containing high-quality,
    documented alignments to identify the strong and weak points
    of the numerous alignment programs now available.
    (Thompson et al. 1998: 87)
    3 / 30

    View Slide

  7. Benchmark Basics Why do we need Benchmarks?
    Why do we need Benchmark Databases?
    4 / 30

    View Slide

  8. Benchmark Basics Why do we need Benchmarks?
    Why do we need Benchmark Databases?
    For comparative reasons, since otherwise, we won’t be able to
    really tell whether two independently proposed algorithms exhibit a
    similar performance or not.
    4 / 30

    View Slide

  9. Benchmark Basics Why do we need Benchmarks?
    Why do we need Benchmark Databases?
    For comparative reasons, since otherwise, we won’t be able to
    really tell whether two independently proposed algorithms exhibit a
    similar performance or not.
    For development reasons, since we should test our new algorithms
    on actual data in order to guarantee their applicability and
    accuracy.
    4 / 30

    View Slide

  10. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    5 / 30

    View Slide

  11. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    5 / 30

    View Slide

  12. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    5 / 30

    View Slide

  13. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    not interested in quantitative applications but prefer to do
    everything manually, or
    5 / 30

    View Slide

  14. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    not interested in quantitative applications but prefer to do
    everything manually, or
    not interested in enhancing and formalizing existing methods with
    help of computational approaches,
    5 / 30

    View Slide

  15. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    not interested in quantitative applications but prefer to do
    everything manually, or
    not interested in enhancing and formalizing existing methods with
    help of computational approaches,
    then you don’t need to care about benchmark databases at all...
    .
    5 / 30

    View Slide

  16. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    not interested in quantitative applications but prefer to do
    everything manually, or
    not interested in enhancing and formalizing existing methods with
    help of computational approaches,
    then you don’t need to care about benchmark databases at all...
    .
    But please don’t jump out of the room right now, I will make some jokes
    you shouldn’t miss in the end of the talk.
    .
    5 / 30

    View Slide

  17. Benchmark Basics Why should I care for Benchmarks?
    Why should I care for Benchmark Databases?
    Who said you need to care? If you are
    no historical linguist or dialectologist or typologist, or
    not interested in quantitative applications but prefer to do
    everything manually, or
    not interested in enhancing and formalizing existing methods with
    help of computational approaches,
    then you don’t need to care about benchmark databases at all...
    .
    But please don’t jump out of the room right now, I will make some jokes
    you shouldn’t miss in the end of the talk.
    .
    I promise!
    5 / 30

    View Slide

  18. Benchmark Critics
    6 / 30

    View Slide

  19. Benchmark Critics The Gold Standard Problem
    The Gold Standard Problem
    7 / 30

    View Slide

  20. Benchmark Critics The Gold Standard Problem
    The Gold Standard Problem
    What do you mean by “Gold Standard”? Is the data simulated?
    7 / 30

    View Slide

  21. Benchmark Critics The Gold Standard Problem
    The Gold Standard Problem
    What do you mean by “Gold Standard”? Is the data simulated?
    There is no such thing as a “Gold Standard”.
    7 / 30

    View Slide

  22. Benchmark Critics The Gold Standard Problem
    The Gold Standard Problem
    What do you mean by “Gold Standard”? Is the data simulated?
    There is no such thing as a “Gold Standard”.
    How can you make a “Gold Standard” if even historical
    linguists do not really have a clue what was going on?
    7 / 30

    View Slide

  23. Benchmark Critics The Tuning Problem
    The Tuning Problem
    8 / 30

    View Slide

  24. Benchmark Critics The Tuning Problem
    The Tuning Problem
    Why should I trust your algorithm if you tuned it with help of
    your “Gold Standard”?
    8 / 30

    View Slide

  25. Benchmark Critics The Tuning Problem
    The Tuning Problem
    Why should I trust your algorithm if you tuned it with help of
    your “Gold Standard”?
    Seriously, how can I trust your algorithm despite all the scores
    if it fails to find the correspondences between Armenian [jɛɾˈku]
    and German [t͜svai]?!?
    8 / 30

    View Slide

  26. Benchmark Critics The Creation Problem
    The Creation Problem
    9 / 30

    View Slide

  27. Benchmark Critics The Creation Problem
    //////////////////////////////
    The Creation Problem
    9 / 30

    View Slide

  28. Benchmark Critics The Creation Problem
    Who Benchmarks the Benchmarks?
    9 / 30

    View Slide

  29. Benchmark Critics The Creation Problem
    Who Benchmarks the Benchmarks?
    Oh, you used the data of XYZ to create your benchmark. Why
    didn’t you create your own dataset?
    9 / 30

    View Slide

  30. Benchmark Critics The Creation Problem
    Who Benchmarks the Benchmarks?
    Oh, you used the data of XYZ to create your benchmark. Why
    didn’t you create your own dataset?
    Oh, you created the benchmark data YOURSELF. Why didn’t
    you use the excellent data by XYZ?
    9 / 30

    View Slide

  31. Benchmark Critics The Creation Problem
    Who Benchmarks the Benchmarks?
    Oh, you used the data of XYZ to create your benchmark. Why
    didn’t you create your own dataset?
    Oh, you created the benchmark data YOURSELF. Why didn’t
    you use the excellent data by XYZ?
    How could you use the data of XYZ as the basis for your “Gold
    Standard” for Germanic languages, given that EVERYONE
    knows that their reconstruction of Proto-Nostratic is complete
    nonsense?
    9 / 30

    View Slide

  32. Stay away
    from my
    bench-
    marks...
    In Defense of Benchmarks
    10 / 30

    View Slide

  33. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    11 / 30

    View Slide

  34. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    Our knowledge of the past is a construct in the sense used in
    psychology (cf. Cronbach and Meehl 1995).
    11 / 30

    View Slide

  35. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    Our knowledge of the past is a construct in the sense used in
    psychology (cf. Cronbach and Meehl 1995).
    This does not mean that what we know is just a “wissenschaftliche
    Fiktion” (Schmidt 1872). There are good reasons to be confident
    that our traditional methods are better than random guesses.
    11 / 30

    View Slide

  36. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    Our knowledge of the past is a construct in the sense used in
    psychology (cf. Cronbach and Meehl 1995).
    This does not mean that what we know is just a “wissenschaftliche
    Fiktion” (Schmidt 1872). There are good reasons to be confident
    that our traditional methods are better than random guesses.
    Although we should be careful in writing Indo-European fables,
    there are also good reasons to be confident that our methods
    uncover some kind of reality (cf. Saussure’s prediction of
    laryngeals in 1879).
    11 / 30

    View Slide

  37. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    Our knowledge of the past is a construct in the sense used in
    psychology (cf. Cronbach and Meehl 1995).
    This does not mean that what we know is just a “wissenschaftliche
    Fiktion” (Schmidt 1872). There are good reasons to be confident
    that our traditional methods are better than random guesses.
    Although we should be careful in writing Indo-European fables,
    there are also good reasons to be confident that our methods
    uncover some kind of reality (cf. Saussure’s prediction of
    laryngeals in 1879).
    If it was not for these reasons that give us confidence in our
    traditional methods, then why should we bother pursuing them?
    11 / 30

    View Slide

  38. In Defense of Benchmarks Knowledge in Historical Linguistics
    Knowledge in Historical Linguistics
    Our knowledge of the past is a construct in the sense used in
    psychology (cf. Cronbach and Meehl 1995).
    This does not mean that what we know is just a “wissenschaftliche
    Fiktion” (Schmidt 1872). There are good reasons to be confident
    that our traditional methods are better than random guesses.
    Although we should be careful in writing Indo-European fables,
    there are also good reasons to be confident that our methods
    uncover some kind of reality (cf. Saussure’s prediction of
    laryngeals in 1879).
    If it was not for these reasons that give us confidence in our
    traditional methods, then why should we bother pursuing them?
    As long as we are confident that our traditional methods work more or
    less, we can use our traditional knowledge to compile benchmark
    databases and to test, how “good” automatic methods work.
    11 / 30

    View Slide

  39. In Defense of Benchmarks Training and Testing
    Training and Testing
    12 / 30

    View Slide

  40. In Defense of Benchmarks Training and Testing
    Training and Testing
    It is impossible to write an algorithm without training it.
    12 / 30

    View Slide

  41. In Defense of Benchmarks Training and Testing
    Training and Testing
    It is impossible to write an algorithm without training it.
    When testing an algorithm on the same data on which it was
    trained, this bears the danger of “overfitting” the algorithm.
    12 / 30

    View Slide

  42. In Defense of Benchmarks Training and Testing
    Training and Testing
    It is impossible to write an algorithm without training it.
    When testing an algorithm on the same data on which it was
    trained, this bears the danger of “overfitting” the algorithm.
    The fact that benchmark databases often serve both to train and to test
    the algorithms may be considered as problematic. Nevertheless, this is
    no real problem of benchmarks, but of the way how benchmarks are
    handled by those who write and train the programs.
    12 / 30

    View Slide

  43. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    13 / 30

    View Slide

  44. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    Ideally, a benchmark database in historical linguistics serves as a
    standard for
    13 / 30

    View Slide

  45. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    Ideally, a benchmark database in historical linguistics serves as a
    standard for
    training new algorithms,
    13 / 30

    View Slide

  46. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    Ideally, a benchmark database in historical linguistics serves as a
    standard for
    training new algorithms,
    testing new algorithms, and
    13 / 30

    View Slide

  47. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    Ideally, a benchmark database in historical linguistics serves as a
    standard for
    training new algorithms,
    testing new algorithms, and
    defining common formats used by the algorithms.
    13 / 30

    View Slide

  48. In Defense of Benchmarks Benchmarks and Standards
    Benchmarks and Standards
    Ideally, a benchmark database in historical linguistics serves as a
    standard for
    training new algorithms,
    testing new algorithms, and
    defining common formats used by the algorithms.
    A benchmark database can be much more than a simple database. It
    can help to initiate the standardization of formats for data exchange and
    storage and thus force those who design new algorithms to comply with
    them.
    13 / 30

    View Slide

  49. Benchmarks Online
    14 / 30

    View Slide

  50. Benchmarks Online BDPA
    Benchmark Database for Phonetic Alignments
    15 / 30

    View Slide

  51. Benchmarks Online BDPA
    Benchmark Database for Phonetic Alignments
    15 / 30

    View Slide

  52. Benchmarks Online BDPA
    Benchmark Database for Phonetic Alignments
    http://alignments.lingpy.org
    15 / 30

    View Slide

  53. Benchmarks Online BDCD
    Benchmark Database for Cognate Detection
    16 / 30

    View Slide

  54. Benchmarks Online BDCD
    Benchmark Database for Cognate Detection
    16 / 30

    View Slide

  55. Benchmarks Online BDCD
    Benchmark Database for Cognate Detection
    http://sequencecomparison.github.io
    16 / 30

    View Slide

  56. Benchmarks Online BDLR
    Benchmark Database for Linguistic Reconstruction
    17 / 30

    View Slide

  57. Benchmarks Online BDLR
    Benchmark Database for Linguistic Reconstruction (?)
    17 / 30

    View Slide

  58. Benchmarks Online BDLR
    Benchmark Database for Linguistic Reconstruction (?)
    Some data is already there, but it needs to be cleaned, referenced,
    linked, and checked before publication.
    Since the data should be provided in form of multiple alignments,
    the alignments for all cognate sets compared to the proto-forms
    need to be manually checked.
    For further data, cooperations are planned (some of the
    collaborators do not yet know that they are among those lucky
    ones that have been chosen...).
    17 / 30

    View Slide

  59. Benchmark Evaluation
    18 / 30

    View Slide

  60. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    19 / 30

    View Slide

  61. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    English w o l - d e m o r t
    Russian v - l a d i m i r -
    Chinese f u - - d i m o t e
    English w o l d e m o r t
    Russian v - l a d i m i r - -
    Chinese f u - - d i m o - t e
    8 / 11 = 0.72
    8 / 10 = 0.8
    8 / 10.5 = 0.76
    19 / 30

    View Slide

  62. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    English w o l - d e m o r t
    Russian v - l a d i m i r -
    Chinese f u - - d i m o t e
    XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX
    English w o l - d e m o r t -
    Russian v - l a d i m i r - -
    Chinese f u - - d i m o - t e
    8 / 11 = 0.72
    8 / 10 = 0.8
    8 / 10.5 = 0.76
    19 / 30

    View Slide

  63. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    English w o l - d e m o r t
    Russian v - l a d i m i r -
    Chinese f u - - d i m o t e
    XXX XXX XXX XXX XXX XXX XXX XXX XXX XXX
    English w o l - d e m o r t -
    Russian v - l a d i m i r - -
    Chinese f u - - d i m o - t e
    8 / 11 = 0.72
    8 / 10 = 0.8
    8 / 10.5 = 0.76
    19 / 30

    View Slide

  64. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    English w o l - d e m o r t
    Russian v - l a d i m i r -
    Chinese f u - - d i m o t e
    English w o l d e m o r t
    Russian v - l a d i m i r - -
    Chinese f u - - d i m o - t e
    8 / 11 = 0.72
    8 / 10 = 0.8
    8 / 10.5 = 0.76
    19 / 30

    View Slide

  65. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    r -
    t e
    w o l d e m o r t
    f u - - d i m o - t e
    w o l d e m o r t
    v - l a d i m i r - -
    -
    -
    v - l a d i m i r - -
    f u - - d i m o - t e
    w o l d e m o
    f u - - d i m o
    w o l d e m o
    v - l a d i m i
    -
    v - l a d i m i
    f u - - d i m o
    r t
    t e
    r t
    r -
    English
    Russian
    English
    Russian
    Russian
    Chinese
    Russian
    Chinese
    English
    Chinese
    English
    Chinese
    19 / 30

    View Slide

  66. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    r -
    t e
    25 / 30 = 0.83
    25 / 33 = 0.76
    25 / 31.5 = 0.80
    w o l d e m o r t
    f u - - d i m o - t e
    w o l d e m o r t
    v - l a d i m i r - -
    -
    -
    v - l a d i m i r - -
    f u - - d i m o - t e
    w o l d e m o
    f u - - d i m o
    w o l d e m o
    v - l a d i m i
    -
    v - l a d i m i
    f u - - d i m o
    r t
    t e
    r t
    r -
    -
    19 / 30

    View Slide

  67. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    o l
    u -
    o l
    v - l
    v - l
    f u -
    r -
    t e
    26 / 30 = 0.87
    26 / 33 = 0.79
    26 / 31.5 = 0.83
    w o l d e m o r t
    f u - - d i m o - t e
    w o l d e m o r t
    v - l a d i m i r - -
    -
    -
    v - l a d i m i r - -
    f u - - d i m o - t e
    d e m o
    - d i m o
    d e m o
    a d i m i
    -
    a d i m i
    - d i m o
    r t
    t e
    r t
    r -
    -
    19 / 30

    View Slide

  68. Benchmark Evaluation Evaluation Scores
    Evaluation Scores
    Choosing useful evaluation scores is essential for the eva-
    luation of a given algorithm. Standardization is of crucial im-
    portance here, since this is the only way to guarantee the
    comparability of alternative approaches.
    While for some tasks (alignment analyses, cognate detec-
    tion), proper evaluation scores are well-established (cf. the
    overview in List 2014), evaluation scores for other tasks (bor-
    rowing detection, linguistics reconstruction) are largely un-
    explored.
    Those who provide benchmark databases should also offer
    formal ways and code to evaluate algorithm performance.
    19 / 30

    View Slide

  69. Benchmark Evaluation Evaluation Scores
    Evaluation Scores: LingPy
    20 / 30

    View Slide

  70. Benchmark Evaluation Evaluation Scores
    Evaluation Scores: LingPy
    20 / 30

    View Slide

  71. Benchmark Evaluation Evaluation Scores
    Evaluation Scores: LingPy
    http://lingpy.org
    20 / 30

    View Slide

  72. Benchmark Evaluation Evaluation Tools
    Evaluation Tools
    “graben” (30) Turchin Levensht. REFERENCE.
    Albanisch gërmon gərmo 1 1 1
    Englisch digs dɪg 2 2 2
    Französisch creuse krøze 1 3 3
    Deutsch gräbt graːb 1 1 4
    Hawaii ‘eli ʔeli 5 5 5
    Navajo hahashgééd hahageːd 6 6 6
    Türkisch kazıyor kaz 7 3 7
    21 / 30

    View Slide

  73. Benchmark Evaluation Evaluation Tools
    Evaluation Tools
    “graben” (30) Turchin Levensht. REFERENCE.
    Albanisch gërmon gərmo 1 1 1
    Englisch digs dɪg 2 2 2
    Französisch creuse krøze 1 3 3
    Deutsch gräbt graːb 1 1 4
    Hawaii ‘eli ʔeli 5 5 5
    Navajo hahashgééd hahageːd 6 6 6
    Türkisch kazıyor kaz 7 3 7
    21 / 30

    View Slide

  74. Benchmark Evaluation Evaluation Tools
    Evaluation Tools
    “Mund” (104) Turchin Levensth. REFERENCE.
    Albanisch gojë goj 1 1 1
    Englisch mouth mauθ 2 2 2
    Französisch bouche buʃ 3 3 3
    Deutsch Mund mund 4 4 2
    Hawaii waha waha 5 5 5
    Navajo ’azéé’ zeːʔ 6 6 6
    Türkisch ağız aɣz 7 7 7
    21 / 30

    View Slide

  75. Benchmark Evaluation Evaluation Tools
    Evaluation Tools
    “Mund” (104) Turchin Levensth. REFERENCE.
    Albanisch gojë goj 1 1 1
    Englisch mouth mauθ 2 2 2
    Französisch bouche buʃ 3 3 3
    Deutsch Mund mund 4 4 2
    Hawaii waha waha 5 5 5
    Navajo ’azéé’ zeːʔ 6 6 6
    Türkisch ağız aɣz 7 7 7
    21 / 30

    View Slide

  76. Benchmark Evaluation Evaluation Tools
    Evaluation Tools
    So far, there are no real tools for the evaluation of the re-
    sults of automatic approaches. Nevertheless, if we want to
    increase the interaction between manual and automatic ap-
    proaches in historical linguistics, it seems worthwhile to in-
    vest in proper tools for the expert evaluation of algorithms.
    21 / 30

    View Slide

  77. BENCHMARKS
    BEYOND
    Beyond Benchmarks
    22 / 30

    View Slide

  78. Beyond Benchmarks Computer-Aided Historical Linguistics
    Computer-Aided Historical Linguistics
    So far, the majority of computational approaches in histori-
    cal linguistics largely disregards the actual needs of histori-
    cal linguistics. Despite the frequent claims that the algorith-
    ms are intended to supplement traditional research, many
    of them are mere attempts to prove the power of modern
    machine learning approaches and completely disregard the
    achievements of traditional research in historical linguistics.
    23 / 30

    View Slide

  79. Beyond Benchmarks Computer-Aided Historical Linguistics
    Computer-Aided Historical Linguistics
    If we really want to make a difference with computational ap-
    proaches and not simply seek to replace every expert who
    likes books with a computer or abacus, we need to work
    much, much harder, on a real integration of computational
    and traditional approaches.
    23 / 30

    View Slide

  80. Beyond Benchmarks Computer-Aided Historical Linguistics
    Computer-Aided Historical Linguistics
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    23 / 30

    View Slide

  81. Beyond Benchmarks Computer-Aided Historical Linguistics
    Computer-Aided Historical Linguistics
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    23 / 30

    View Slide

  82. Beyond Benchmarks Computer-Aided Historical Linguistics
    Computer-Aided Historical Linguistics
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Apart from “computational historical
    linguistics”, we need to establish a
    new discipline of “computer-aided
    historical linguistics”.
    Such a framework needs bench-
    mark databases (no wonder) and
    new standards, both for traditional
    and computational linguistics.
    However, such a framework will also
    need additional resources that help
    traditional approaches to leave the
    realm of intuition.
    23 / 30

    View Slide

  83. Beyond Benchmarks Semantic Change
    Semantic Change
    24 / 30

    View Slide

  84. Beyond Benchmarks Semantic Change
    Semantic Change
    It is beyond question that hypotheses in historical linguistics stand and fall with
    a proper treatment of semantic change. So far, however, we lack the
    cross-linguistic data to assess the plausibility of proposed patterns of semantic
    shift. There is, however, hope for improvement:
    24 / 30

    View Slide

  85. Beyond Benchmarks Semantic Change
    Semantic Change
    It is beyond question that hypotheses in historical linguistics stand and fall with
    a proper treatment of semantic change. So far, however, we lack the
    cross-linguistic data to assess the plausibility of proposed patterns of semantic
    shift. There is, however, hope for improvement:
    The Database of Semantic Shifts (DatSemShifs, Burlak et al.
    http://semshifts.iling-ran.ru/) offers a constantly increasing amount
    of patterns of semantic shifts, drawn from the linguistic literature. Shifts are
    categorized, tagged for meanings, and – where accessible – directions. In the
    future, this may turn into a very valuable resource for those interested in semantic
    change.
    24 / 30

    View Slide

  86. Beyond Benchmarks Semantic Change
    Semantic Change
    It is beyond question that hypotheses in historical linguistics stand and fall with
    a proper treatment of semantic change. So far, however, we lack the
    cross-linguistic data to assess the plausibility of proposed patterns of semantic
    shift. There is, however, hope for improvement:
    The Database of Semantic Shifts (DatSemShifs, Burlak et al.
    http://semshifts.iling-ran.ru/) offers a constantly increasing amount
    of patterns of semantic shifts, drawn from the linguistic literature. Shifts are
    categorized, tagged for meanings, and – where accessible – directions. In the
    future, this may turn into a very valuable resource for those interested in semantic
    change.
    The Database of Cross-Linguistic Collexifications (CLICS, List et al.,
    http://clics.lingpy.org) offers collections of colexifications (“polysemy”)
    patterns in some 200 languages. The data has been crawled semi-automatically
    from existing sources like the Intercontinental Dictionary Series (IDS, Key &
    Comrie 2007, http://lingweb.eva.mpg.de/ids/) and automatically
    cleaned and tagged for colexification. The automatic handling without proper
    checking of the data are a drawback which needs to be handled in the future. A
    strong aspect of the database are the visualizations of colexifications using
    up-to-date JavaScript libraries.
    24 / 30

    View Slide

  87. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Correspondences
    25 / 30

    View Slide

  88. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Correspondences
    That not all sounds are equally likely to occur in correspondence
    relation in historically related words has been noted by many
    linguists in the past.
    25 / 30

    View Slide

  89. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Correspondences
    That not all sounds are equally likely to occur in correspondence
    relation in historically related words has been noted by many
    linguists in the past.
    However, only a few linguists have ever tried to substantiate this
    claim with data (Dolgopolsky 1964, Brown et al. 2013).
    25 / 30

    View Slide

  90. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Correspondences
    That not all sounds are equally likely to occur in correspondence
    relation in historically related words has been noted by many
    linguists in the past.
    However, only a few linguists have ever tried to substantiate this
    claim with data (Dolgopolsky 1964, Brown et al. 2013).
    The most notable resource known to me is the one by Brown et al.
    (2013), who report statistics based on ASJP data. The drawbacks
    of this approach are the limited number of symbols in ASJP code
    and the fact that identity relations are not covered. The advantages
    are the strictness of the procedure and the large amount of data
    that the analysis is based upon.
    25 / 30

    View Slide

  91. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Change
    26 / 30

    View Slide

  92. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Change
    So far, the only online resource known to me is the web-based
    platform for Diachronic Data and Models (DiaDM,
    http://www.diadm.ish-lyon.cnrs.fr) which offers a
    database on Diachronic Universals (UniDia). Unfortunately, the
    database has been under construction for almost two years now,
    and no real progress regarding the presentation of the data has
    been visible so far.
    26 / 30

    View Slide

  93. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Change
    So far, the only online resource known to me is the web-based
    platform for Diachronic Data and Models (DiaDM,
    http://www.diadm.ish-lyon.cnrs.fr) which offers a
    database on Diachronic Universals (UniDia). Unfortunately, the
    database has been under construction for almost two years now,
    and no real progress regarding the presentation of the data has
    been visible so far.
    If the numbers presented on the UniDia website are true (10 349
    sound changes in 302 languages), it contains an invaluable
    resource on known sound changes.
    26 / 30

    View Slide

  94. Beyond Benchmarks Sound Change and Sound Correspondences
    Sound Change
    So far, the only online resource known to me is the web-based
    platform for Diachronic Data and Models (DiaDM,
    http://www.diadm.ish-lyon.cnrs.fr) which offers a
    database on Diachronic Universals (UniDia). Unfortunately, the
    database has been under construction for almost two years now,
    and no real progress regarding the presentation of the data has
    been visible so far.
    If the numbers presented on the UniDia website are true (10 349
    sound changes in 302 languages), it contains an invaluable
    resource on known sound changes.
    Unfortunately, it is not evident from the website, how (or if) this
    data will be made public in the future, and whether it can ever be
    used to either train our algorithms, or to provide our experts with
    something more than intuition.
    26 / 30

    View Slide

  95. Beyond Benchmarks Phylogenetic Reconstruction
    Phylogenetic Reconstruction
    27 / 30

    View Slide

  96. Beyond Benchmarks Phylogenetic Reconstruction
    Phylogenetic Reconstruction
    It is probably needless to say that with MultiTree
    (http://multitree.org), Ethnologue (Lewis & Fennig 2014,
    http://ethnologue.com), and GlottoLog (v2.3, Hammarström
    et al. 2014, http://glottolog.org), a large number of expert
    classification is publicly available.
    27 / 30

    View Slide

  97. Beyond Benchmarks Phylogenetic Reconstruction
    Phylogenetic Reconstruction
    It is probably needless to say that with MultiTree
    (http://multitree.org), Ethnologue (Lewis & Fennig 2014,
    http://ethnologue.com), and GlottoLog (v2.3, Hammarström
    et al. 2014, http://glottolog.org), a large number of expert
    classification is publicly available.
    What is lacking in quite a few current approaches to phylogenetic
    reconstruction is a proper evaluation of the algorithms that makes
    rigorously use of these resources. Just eyeballing a tree and
    claiming, that some method “reproduces expert classifications”
    based on some strange criterion, is simply not enough.
    27 / 30

    View Slide

  98. Beyond Benchmarks Borrowing Detection
    Borrowing Detection
    28 / 30

    View Slide

  99. Beyond Benchmarks Borrowing Detection
    Borrowing Detection
    Some databases, like the Indo-European Lexial Cognacy Databse
    (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list
    borrowings along with their sources.
    28 / 30

    View Slide

  100. Beyond Benchmarks Borrowing Detection
    Borrowing Detection
    Some databases, like the Indo-European Lexial Cognacy Databse
    (IELex, http://ielex.mpi.nl/, Dunn et al. 2012) list
    borrowings along with their sources.
    However, given the fact that in many areas of the world (and also
    in Indo-European) our knowledge in historical linguistics starts to
    reach its limits when it comes to distinguish borrowings from
    inherited words, it is quite likely that it is impossible at the moment
    to provide exhaustive test sets in which all borrowings are
    identified.
    28 / 30

    View Slide

  101. Conclusion
    Conclusion
    29 / 30

    View Slide

  102. Conclusion
    Conclusion
    Automatic approaches are constantly gaining ground in historical
    linguistics.
    29 / 30

    View Slide

  103. Conclusion
    Conclusion
    Automatic approaches are constantly gaining ground in historical
    linguistics.
    Nevertheless, the majority of the new approaches shows a great
    lack in transparency and applicability.
    29 / 30

    View Slide

  104. Conclusion
    Conclusion
    Automatic approaches are constantly gaining ground in historical
    linguistics.
    Nevertheless, the majority of the new approaches shows a great
    lack in transparency and applicability.
    One reason for this is the lack of benchmark databases in historical
    linguistics which help programmers to train their code but also
    force them to test it rigorously.
    29 / 30

    View Slide

  105. Conclusion
    Conclusion
    Automatic approaches are constantly gaining ground in historical
    linguistics.
    Nevertheless, the majority of the new approaches shows a great
    lack in transparency and applicability.
    One reason for this is the lack of benchmark databases in historical
    linguistics which help programmers to train their code but also
    force them to test it rigorously.
    In order to increase the interaction between traditional and
    computational historical linguists, we need to work hard on
    providing high-quality benchmark databases and high-quality tools
    for algorithm evaluation.
    29 / 30

    View Slide

  106. Conclusion
    Conclusion
    Automatic approaches are constantly gaining ground in historical
    linguistics.
    Nevertheless, the majority of the new approaches shows a great
    lack in transparency and applicability.
    One reason for this is the lack of benchmark databases in historical
    linguistics which help programmers to train their code but also
    force them to test it rigorously.
    In order to increase the interaction between traditional and
    computational historical linguists, we need to work hard on
    providing high-quality benchmark databases and high-quality tools
    for algorithm evaluation.
    Not only benchmark databases are needed, but also
    cross-linguistics comparative databases that help historical
    linguists to asses the regularity and irregularity of patterns and
    proposals in a less intuitive way.
    29 / 30

    View Slide

  107. Thank You for Listening!
    30 / 30

    View Slide