$30 off During Our Annual Pro Sale. View Details »

Datasets and Software Tools for Computer-Assisted Language Comparison

Datasets and Software Tools for Computer-Assisted Language Comparison

Talk, held at the workshop "Databases in historical linguistics" (2015-08-20/21, Santa Fe Institute, Santa Fe).

Johann-Mattis List

August 20, 2015
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Datasets and Software Tools for Computer-Assisted
    Language Comparison
    Johann-Mattis List
    DFG research fellow
    Centre des recherches linguistiques sur l’Asie Orientale
    Team Adaptation, Integration, Reticulation, Evolution
    EHESS and UPMC, Paris
    2015-08-20
    1 / 50

    View Slide

  2. Traditional Language Comparison
    Traditional
    Language Comparison
    2 / 50

    View Slide

  3. Traditional Language Comparison Characteristics
    Characteristics
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    3 / 50

    View Slide

  4. Traditional Language Comparison Characteristics
    Research Object
    4 / 50

    View Slide

  5. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    4 / 50

    View Slide

  6. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    4 / 50

    View Slide

  7. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n - -
    * Proto-Germanic t a n d
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - - -
    4 / 50

    View Slide

  8. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n - -
    Proto-Germanic t a n θ -
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - - -
    4 / 50

    View Slide

  9. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - -
    4 / 50

    View Slide

  10. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    Proto-Indo-European d e n t -
    Italian d ɛ n t ə
    Proto-Romance d e n t e
    French d ɑ̃ - -
    4 / 50

    View Slide

  11. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː - θ
    Proto-Indo-European d e n t
    Italian d ɛ n t ə
    * Proto-Romance d e n t
    French d ɑ̃ - -
    4 / 50

    View Slide

  12. Traditional Language Comparison Characteristics
    Research Object
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    1
    4 / 50

    View Slide

  13. Traditional Language Comparison Characteristics
    Origins
    Uniformitarianism
    “universality of change” – change is independent of time and space
    “graduality of change” – change is neither abrupt nor chaotic
    “uniformity of change” – change is not heterogeneous, but uniform
    Founding Fathers
    Franz Bopp (1791–1867): language comparison (Bopp 1816)
    Rasmus Rask (1787-1832) and Jacob Grimm (1785-1863): sound
    law (Rask 1818, Grimm 1822)
    August Schleicher (1821–1868): family tree and linguistic
    reconstruction (Schleicher 1853 & 1861)
    5 / 50

    View Slide

  14. Traditional Language Comparison Achievements
    Achievements
    6 / 50

    View Slide

  15. Traditional Language Comparison Achievements
    Methods and Theories
    Comparative Method (Meillet 1925)
    Basic procedure for proving language relationship and reconstructing
    unattested ancestral language states, etymologies, and genetic
    classifications.
    Family Tree Model and Wave Theory (Schleicher 1853, Schmidt 1872)
    Two partially incompatible models to describe historical language
    relations.
    Regularity Hypothesis (Osthoff & Brugmann 1878)
    Fundamental working hypothesis that states that certain sound change
    processes proceed regularly (universally, gradually, and in a uniform
    manner).
    7 / 50

    View Slide

  16. Traditional Language Comparison Achievements
    Insights
    Internal Language History
    Thanks to historical linguistics, the history of a considerable (but still
    small) amount of languages has been thoroughly investigated.
    External Language History
    Thanks to historical linguistics, a considerable amount of the languages
    in the world has been genetically classified (although there remain
    many unsolved and controversially discussed questions).
    General Language History
    Some work on general processes of language history has been done,
    yet many questions still remain unsolved or are controversially debated.
    8 / 50

    View Slide

  17. Traditional Language Comparison Problems
    Problems
    9 / 50

    View Slide

  18. Traditional Language Comparison Problems
    Transparency
    Part of the process of “becoming” a competent
    Indo-Europeanist has always been recognized as coming to
    grasp “intuitively” concepts and types of changes in language
    so as to be able to pick and choose between alternative
    explanations for the history and development of specific
    features of the reconstructed language and its offspring.
    Schwink (1994)
    10 / 50

    View Slide

  19. Traditional Language Comparison Problems
    Transparency: Philological Knowledge Representation
    Frucht. Sf std. (9. Jh.), mhd. vruht, ahd. fruht, as. fruht. Ent-
    lehnt aus l. frūctus m. gleicher Bedeutung (zu l. fruī “ge-
    nieße”). Das deutsche Wort ist Femininum geworden im
    Anschluß an die ti- Abstrakta wie Flucht² usw. Adjekti-
    ve: fruchtig, fruchtbar; Verb: (be-)fruchten. Ebenso nndl.
    vrucht, ne. fruit, nfrz. fruit, nschw. frukt, nnorw. frukt; frugal.
    (Kluge und Seebold 2002)
    11 / 50

    View Slide

  20. Traditional Language Comparison Problems
    Applicability
    – 7,106 languages (Lewis & Fennig 2013)
    – 147 language families (ibid.)
    – 25244065 languages which could be compared
    12 / 50

    View Slide

  21. Traditional Language Comparison Problems
    Applicability
    The amount of digitally available data for the lan-
    guages of the world is growing from day to day,
    while there are only a few historical linguists who
    are trained to carry out the comparison of these
    languages. It seems impossible to handle this
    task when relying only on the traditional, time-
    consuming manual procedures developed in tra-
    ditional historical linguistics.
    12 / 50

    View Slide

  22. Traditional Language Comparison Problems
    Adequacy
    One time is never, two times is ever!
    (a mathematician friend on the treatment of probability in
    Indo-European linguistics)
    13 / 50

    View Slide

  23. Traditional Language Comparison Problems
    Summary
    Despite its achievements, traditional historical linguistics has some
    clear shortcomings, such as
    a lack of transparency in methodology,
    the “philological” form of knowledge representation, and
    the questionable validity of certain results.
    14 / 50

    View Slide

  24. Computational Language Comparison
    Computational
    Language Comparison
    15 / 50

    View Slide

  25. Computational Language Comparison Characteristics
    Characteristics
    P(A|B)=(P(B|A)P(A))/(P(B)
    16 / 50

    View Slide

  26. Computational Language Comparison Characteristics
    Characteristics
    “Indo-European and computational cladistics” (Ringe, Warnow and Taylor
    2002)
    “Language-tree divergence times support the Anatolian theory of
    Indo-European origin” (Gray und Atkinson 2003)
    “Language classification by numbers” (McMahon und McMahon 2005)
    “Curious Parallels and Curious Connections: Phylogenetic Thinking in
    Biology and Historical Linguistics” (Atkinson und Gray 2005)
    “Automated classification of the world’s languages” (Brown et al. 2008)
    “Computational Feature-Sensitive Reconstruction of Language
    Relationships: Developing the ALINE Distance for Comparative Historical
    Linguistic Reconstruction” (Downey et al. 2008)
    “Networks uncover hidden lexical borrowing in Indo-European language
    evolution” (Nelson-Sathi et al. 2011)
    “A pipeline for computational historical linguistics” (Steiner, Stadler, und
    Cysouw 2011)
    17 / 50

    View Slide

  27. Computational Language Comparison Characteristics
    Points of Interest and Goals
    phylogenetic reconstruction
    sequence comparison
    general questions of language development
    18 / 50

    View Slide

  28. Computational Language Comparison Characteristics
    Points of Interest and Goals
    phylogenetic reconstruction
    sequence comparison
    general questions of language development
    Primary Goal
    If we cannot guarantee getting the same results from the same data
    considered by different linguists, we jeopardize the essential scientific
    criterion of repeatability. (McMahon & McMahon 2005)
    18 / 50

    View Slide

  29. Computational Language Comparison Characteristics
    Methods and Theories
    phylogenetic reconstruction (cf., among others, Gray & Atkinson
    2003 Ringe et al. 2002, Brown et al. 2008)
    phonetic alignment (cf., among others, Kondrak 2000, Prokić et al.
    2009, List 2012a)
    cognate detection (cf. Steiner et al. 2011, List 2012b)
    borrowing detection (cf. Nelson-Sathi et al. 2011, List et al. 2014a)
    19 / 50

    View Slide

  30. Computational Language Comparison Achievements
    Achievements
    20 / 50

    View Slide

  31. Computational Language Comparison Achievements
    New Perspectives
    external language history receives more attention than before
    “Indo-Euro-Centrism” is replaced by a more cross-linguistic
    paradigm
    new questions regarding general language history
    new proposals to model language history
    21 / 50

    View Slide

  32. Computational Language Comparison Achievements
    New Approaches
    empirical data becomes the center of interest
    probabilistic approaches replace “historical” approaches
    databases replace philological knowledge representation
    “informal” methods are formalized and automatized
    22 / 50

    View Slide

  33. Computational Language Comparison Problems
    Problems
    23 / 50

    View Slide

  34. Computational Language Comparison Problems
    Transparency
    24 / 50

    View Slide

  35. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    24 / 50

    View Slide

  36. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    24 / 50

    View Slide

  37. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    It is difficult to communicate the results to traditional linguists.
    24 / 50

    View Slide

  38. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic approaches as
    24 / 50

    View Slide

  39. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic approaches as
    – not trustworthy and error-prone, or
    24 / 50

    View Slide

  40. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic approaches as
    – not trustworthy and error-prone, or
    – “impossible per se”, or
    24 / 50

    View Slide

  41. Computational Language Comparison Problems
    Transparency
    Evaluation criteria for applied automatic methods are not
    very intuitive and vary greatly.
    Benchmark databases are rarely used, especially in
    phylogenetic approaches eyeballing of phylogenetic trees is
    sold as proof for “valid approaches”.
    It is difficult to communicate the results to traditional linguists.
    → Many linguists regard automatic approaches as
    – not trustworthy and error-prone, or
    – “impossible per se”, or
    – as useful as “rolling a dice”.
    24 / 50

    View Slide

  42. Computational Language Comparison Problems
    Applicability
    25 / 50

    View Slide

  43. Computational Language Comparison Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & List 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    25 / 50

    View Slide

  44. Computational Language Comparison Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    25 / 50

    View Slide

  45. Computational Language Comparison Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    25 / 50

    View Slide

  46. Computational Language Comparison Problems
    Applicability
    Method
    Multilingual?
    No additional
    requirements?
    Freely
    Available?
    Mackay & Kondrak 2005 ✗ ✓ ✗
    Bergsma & Kondrak 2007 ✓ ✓ ✗
    Turchin et al. 2010 ✓ ✓ ✓
    Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗
    Hauer & Kondrak 2011 ✓ ✓ ✗
    Steiner et al. 2011 ✓ ✓ ✗
    List 2012 & 2014 ✓ ✓ ✓
    Beinborn et al. 2013 ✗ ? ✗
    Bouchard-Côté et al. 2013 ✓ ✗ ✗
    Rama 2013 ✗ ✓ ✗
    Ciobanu & Dinu 2014 ✗ ✓ ✗
    … … … …
    25 / 50

    View Slide

  47. Computational Language Comparison Problems
    Accuracy
    Data Problems (Geisler & List forthcoming)
    Comparing two independently produced lexicostatistical datasets:
    database # languages # concepts
    Dyen et al. 1997 95 200
    Tower of Babel 98 110
    intersection 46 103
    26 / 50

    View Slide

  48. Computational Language Comparison Problems
    Accuracy
    Data Problems (Geisler & List forthcoming)
    Comparing two independently produced lexicostatistical datasets:
    database # languages # concepts
    Dyen et al. 1997 95 200
    Tower of Babel 98 110
    intersection 46 103
    Results
    up to 10 % difference in concept translations
    many undetected borrowings in both datasets
    up to 30 % differences in tree topologies for Bayesian analyses
    26 / 50

    View Slide

  49. Computational Language Comparison Problems
    Accuracy
    Greek Slavic Celtic
    DKB
    TOB
    Indo-Iranian Romance Germanic
    Baltic Armenian
    Albanian
    26 / 50

    View Slide

  50. Computational Language Comparison Problems
    Accuracy
    South-Slavic East-Slavic West-Slavic
    DKB
    TOB
    26 / 50

    View Slide

  51. Computational Language Comparison Problems
    Summary
    Many quantitative methods which are based on manually compiled
    datasets cannot cope with errors resulting from inconsistent data
    compilation. They are only as objective as the data being fed to
    them!
    Many quantitative approaches are insufficiently tested, and
    scholars are often content with results traditional linguists would
    never accept.
    Additionally, quantitative approaches are often presented in a way
    that makes it hard (not only for traditional linguists) to understand
    what they are based upon. Results are reported in an intransparent
    way, supplementary data is often lacking, concrete examples are
    seldom provided and source code (essential to check and replicate
    analyses) is missing in almost all recent publications.
    27 / 50

    View Slide

  52. 28 / 50

    View Slide

  53. 28 / 50

    View Slide

  54. 28 / 50

    View Slide

  55. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    28 / 50

    View Slide

  56. PRO:
    - intuition
    - background knowledge
    - can juggle with multiple types of evidence
    CONTRA:
    - has to sleep and rest
    - does not like to count and do boring work
    - can oversee facts when doing boring work
    CONTRA:
    - no intuition
    - no background knowledge
    - can't juggle with multiple types of evidence
    PRO:
    - doesn't need to sleep
    - is very good at counting and boring work
    - doesn't make errors in boring work
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    28 / 50

    View Slide

  57. PRO:
    - intuition
    - background knowledge
    - can juggle with multiple types of evidence
    CONTRA:
    - has to sleep and rest
    - does not like to count and do boring work
    - can oversee facts when doing boring work
    CONTRA:
    - no intuition
    - no background knowledge
    - can't juggle with multiple types of evidence
    PRO:
    - doesn't need to sleep
    - is very good at counting and boring work
    - doesn't make errors in boring work
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    28 / 50

    View Slide

  58. PRO:
    - intuition
    - background knowledge
    - can juggle with multiple types of evidence
    CONTRA:
    - has to sleep and rest
    - does not like to count and do boring work
    - can oversee facts when doing boring work
    CONTRA:
    - no intuition
    - no background knowledge
    - can't juggle with multiple types of evidence
    PRO:
    - doesn't need to sleep
    - is very good at counting and boring work
    - doesn't make errors in boring work
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    COMPUTER-ASSISTED LANGUAGE COMPARISON
    28 / 50

    View Slide

  59. Computer-Assisted Language Comparison
    Computer-Assisted
    Language Comparison
    29 / 50

    View Slide

  60. Computer-Assisted Language Comparison Datasets
    Datasets
    30 / 50

    View Slide

  61. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    31 / 50

    View Slide

  62. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    31 / 50

    View Slide

  63. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    http://alignments.lingpy.org
    31 / 50

    View Slide

  64. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    32 / 50

    View Slide

  65. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    32 / 50

    View Slide

  66. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    http://sequencecomparison.github.io
    32 / 50

    View Slide

  67. Computer-Assisted Language Comparison Datasets
    Benchmark Databases
    More?
    Originally, I planned to publish a benchmark database for linguistic
    reconstruction in addition to the two benchmark databases
    mentioned before.
    Due to all different kinds of problems, this undertaking was delayed
    ever since I started to collect the first datasets.
    In the future, the initial ideas for the benchmark, along with the
    datasets created so far, will be included as part a larger
    collaborative effort to launch a database for cross-linguistic
    historical phonology (PhonoBank, MPI Jena).
    33 / 50

    View Slide

  68. Computer-Assisted Language Comparison Datasets
    Database of Cross-Linguistic Colexifications (CLICS)
    Key Concept Russian German ...
    1.1 world mir, svet Welt ...
    1.21 earth, land zemlja Erde, Land ...
    1.212 ground, soil počva Erde, Boden ...
    1.420 tree derevo Baum ...
    1.430 wood derevo Wald ...
    ... ... ... ... ...
    34 / 50

    View Slide

  69. Computer-Assisted Language Comparison Datasets
    Database of Cross-Linguistic Colexifications (CLICS)
    CLICS: Crosslinguistic Colexifications
    - 221 Languages
    - 64 language families
    - 1280 concepts
    - 301,498 words
    - 45,667 polysemies (colexifications)
    - 16,239 different links between concepts
    - http://clics.lingpy.org
    34 / 50

    View Slide

  70. Computer-Assisted Language Comparison Datasets
    Database of Cross-Linguistic Colexifications (CLICS)
    684
    678
    871
    1043
    6
    30
    129
    196
    1243
    128
    869
    853
    650 344
    1103
    150
    185
    627
    232
    709
    1035
    1206
    177
    97
    311
    496
    606
    137
    207
    444
    840
    1077
    325
    222
    1063
    1138
    1204
    1258
    559
    723
    495
    766
    914
    38
    1101
    652
    865
    891
    872
    633
    291
    980
    700 144
    410
    430
    1025
    406
    464
    787
    622
    131
    242
    918
    275
    1159
    99
    1174
    671 1038
    786
    705
    641
    760
    1259
    356
    391
    197
    10
    214
    299
    63
    191
    619
    644
    792
    1205
    897 67
    1231
    213
    226
    747
    681
    399
    841
    439
    773
    123
    800
    16
    1067
    1227
    696
    417
    550
    68
    76
    108
    360
    1244
    339
    500
    81
    867
    79
    1097
    98
    96
    833
    771
    715
    455
    380
    1268
    1186
    1046
    39
    252
    1228
    66
    23
    1112
    133
    676
    336
    739 1150
    1071
    986
    485
    112
    372
    1109
    830
    721
    1053
    1057
    601
    573
    556
    527
    1248
    614
    488
    908
    499
    1002
    309
    442
    814
    1193
    569
    458 258
    563
    653
    682 774
    70
    1151
    948
    801
    1082
    243
    47
    71
    83
    153
    1265
    934
    85
    1215
    1199
    523
    581
    422
    21
    358
    1261
    111
    354
    219
    759
    15
    890
    261
    1222
    141
    158
    74
    806
    1031
    845
    770
    850
    903
    1224
    419
    754
    433
    798
    188
    1256
    613
    528
    208
    539
    323
    981
    132
    1055
    1001
    790
    804
    844
    1118
    907
    640 446
    815
    923
    498
    201
    1184
    578
    566
    427
    532
    452
    151
    750
    598
    1094
    345
    735
    777
    978
    599
    492
    390
    286
    1107
    742
    1015
    1202
    1210
    1257 1275
    859
    988
    69
    752
    596
    290
    126
    110
    950
    922
    1047
    741
    253
    347
    385
    620
    966
    221
    431 3
    224
    1194
    999
    953
    1029
    852
    301
    389
    318
    530
    1048
    1032 175
    701 544
    1119
    241
    94
    745
    835
    1270
    62
    107
    159
    20
    767
    512
    331
    248
    549
    1013
    946
    974
    1022 1100
    477
    302
    233
    1168
    1003
    1211
    570
    307 40
    945
    1269
    784
    546
    437
    901
    350
    238
    305
    1191
    482
    1012
    977
    906
    783
    524
    117
    457
    603
    836
    1181
    880
    229 124
    216
    1113
    1074
    72
    586
    647
    447
    2
    113
    1179
    7 1006
    665
    397
    502
    610 1274
    707
    327
    659
    667
    824
    917
    985
    1089
    346
    1229
    101
    542
    1042
    727
    782
    733
    967
    462
    592
    468
    1106
    440
    478 308
    577
    698
    776
    75
    1155
    51
    145
    517
    359
    938
    1157
    1160
    1183
    947
    1102
    1135
    1252
    343
    608
    537
    103
    634
    251
    383 506
    25
    829
    396
    686
    679
    574
    516
    42
    250
    379
    809
    602
    660
    780
    765
    697
    856
    899
    594
    1008
    393
    179
    114
    1140 11
    100
    1209
    618
    600
    192
    1277
    896
    1142
    1278
    762 421
    713
    182
    521
    861
    672
    297
    1116
    1190
    1192
    140
    1212
    46
    493
    1187
    157
    1225
    212
    403
    519
    616
    173
    413
    912
    1110
    84
    756
    793
    636
    118
    889
    692
    998
    366
    711
    1045
    61
    240
    1263
    199
    648
    832
    289
    522
    368
    1091
    931
    982
    949
    400
    119
    388 811
    53 59
    1069
    708
    952
    545
    763
    1238
    184
    825
    377
    1242
    1233
    262
    635
    269
    1062
    1061
    1073
    933
    17
    1247
    352
    64
    384
    50
    632 736
    1246
    822
    781 758 1
    939
    595
    778
    105
    860
    1049
    1066
    1072
    995
    503 370
    919
    1149
    1127
    1128
    972
    1126
    245
    921
    973
    675
    587
    1235
    960
    928 926
    1143
    548
    1250
    86
    1021
    32
    1068
    719
    965
    259
    1070
    863
    638
    303
    324
    873
    249
    892
    976 1007
    722
    36
    459
    293
    165
    209
    557
    1245
    788 862
    651
    900
    31
    483
    236
    935 1052
    115
    294 680
    831
    44
    453
    206
    971
    1273
    170
    753
    256
    1148 200
    450
    382
    1240
    561
    615
    317
    572
    725 870
    438
    139
    1011
    646
    1117
    392
    45
    276 264 704
    1080
    174
    1050
    808
    1197
    508
    576
    225
    562
    471
    1217
    333
    1014
    593
    92
    1034
    611
    1171 312
    802
    1253
    29
    902
    244
    582
    466
    668
    878
    341
    432
    1163
    625
    904
    164
    467 1195
    1232
    796
    828
    281
    629
    349
    1166
    411
    369
    387
    1208
    394
    415
    1000 58
    1098
    148
    287
    1223
    818
    263
    220
    838
    876
    313
    260
    65
    1165
    5 355
    106
    1172
    490
    718
    171
    1139
    163
    785
    881
    887
    1169
    319
    585
    553
    894
    306
    314
    1041
    1009
    799
    674
    848
    1201
    1004
    689
    1085
    1218 1145 1170
    228
    911
    279
    73 104
    690
    1254
    402
    340
    169
    693
    868
    893
    1018
    78
    1092
    194
    555
    198
    834
    1249
    997
    932
    237
    1176 666
    956
    624
    1262
    541
    520
    795
    866
    702
    4
    734
    1095
    1180
    728
    964
    1079 271
    842
    1241
    1056
    154
    751 353
    905
    1136
    504
    909
    910
    1133
    362
    583
    670
    1124 381
    1216
    215
    178
    571
    470
    142
    376
    1154
    172
    296
    533
    364
    963
    152
    797 1213
    803
    1051
    738
    426
    1036
    1153
    637
    823
    915
    428
    1075
    560
    547
    1137
    35
    882
    89
    511
    1122
    805
    494
    1130
    1188
    1086
    1236
    669
    588
    930
    703
    942
    18
    655
    335
    155
    710
    1156
    1028
    465
    147
    183
    414
    1221
    273
    166
    1054
    278
    55
    460
    812 1090
    810
    180
    768
    143
    156
    404
    367
    1182
    231
    288
    136
    456
    82
    529
    970
    1016
    729
    395 187
    604
    408
    330
    1064
    34
    1267
    847
    726
    543
    677
    642
    940
    645
    958
    683 695
    864
    1058 605
    1084
    451
    443
    699
    1167
    959
    925
    1198
    227
    886
    628
    1178
    337
    991
    813
    657
    1185
    1039
    769
    1081
    484
    712
    1189
    944
    1207
    322
    33
    685
    424 80
    270
    937
    1177
    283
    1237
    816
    130
    161
    189
    77
    300
    1026
    463 1104
    326
    589 60
    983
    474
    1093
    744
    748
    554 292
    41
    267
    984
    373
    1214
    957
    1024 969
    507 37
    874
    1030
    630
    579
    962
    535
    706
    688
    122
    497
    1060
    1083
    1027 102
    510 405
    1134
    658
    617
    936
    929
    363
    1175 361
    536
    534
    1219
    181
    386
    884
    418
    558 8
    479
    979
    551
    505
    316
    298
    26
    315
    761
    202
    1144
    176
    473 348 134
    639
    663
    717
    885
    924
    149
    49
    1078
    1040
    57
    167
    764
    1173
    673
    280
    1152
    277
    1272
    1065
    272
    827
    531
    607
    1123
    257
    996
    436 9
    826
    234
    1096
    875
    525
    304
    1108
    475
    1132
    714
    846
    540
    716
    1005
    1105
    357
    1162
    694
    920 743
    28
    994
    1200
    168
    1266
    420
    515
    568
    755
    895
    218
    916
    730
    807 210
    375
    854
    1010
    879
    1125
    268
    1129
    1114
    1255
    1158
    1279
    487
    486
    398
    597
    661
    135 565
    621 193
    321
    1230
    513
    654
    265
    612
    737
    855
    211
    1196
    246
    1264
    584
    338
    749
    1271
    434
    121
    423
    509
    839
    1147
    656
    230
    239
    489
    14
    469
    22
    1044
    351
    448
    282
    329
    961
    254
    989
    371
    284
    223
    843
    821
    24
    1023
    643
    819
    285
    514
    746
    757
    791
    138
    186
    849
    93 951 127
    877
    1088
    518
    1164
    1260
    501
    54
    190
    95
    43 205
    1276
    116
    146 662
    217
    461
    883
    204
    1033
    310
    472
    12
    412
    332
    817
    649
    794
    1037
    943 927
    481
    968
    425
    109 195
    857
    1121
    564
    687
    664
    724
    87
    1120
    88
    449
    429
    255
    987
    992
    1111
    591
    575
    491
    720
    851
    328
    941
    990 1019
    993
    1087
    955
    580
    1226
    975
    1099
    732
    235 779
    365 1234
    441
    609 247
    334 91
    1251
    1131
    913
    691
    52
    274
    1017
    435
    90
    407
    480
    1239
    13
    623
    0
    266
    626
    295
    954
    1059
    552
    898
    858
    772 526
    1115
    48
    1161
    125
    590
    454
    1020
    1141
    203
    740
    1146
    342
    820
    1220
    56
    320
    416
    27
    401
    476
    19
    120
    1203
    445 789
    775
    888
    567
    378
    1076
    160
    162
    409
    731
    631
    374
    538
    837
    34 / 50

    View Slide

  71. Computer-Assisted Language Comparison Datasets
    Database of Cross-Linguistic Colexifications (CLICS)
    Concept "money" is part of a cluster with the central concept "fishscale" with a total of 10 nodes. Hover over
    forms for each link. Click on the forms to check their sources. Click HERE to export the current network.
    ity: Line weights: Coloring: Family
    silver
    leather
    fishscale
    bark
    coin
    fur
    snail
    skin, hide
    money
    shell
    49 links for "silver" and "money":
    Language Family Form
    1. Ignaciano Arawakan ne
    2. Aymara, Central Aymaran ḳulʸḳi
    3. Tsafiki Barbacoan kaˈla
    4. Seselwa Creole French Creole larzan
    5. Miao, White Hmong-Mien nyiaj
    6. Breton Indo-European arhant
    7. French Indo-European argent
    8. Gaelic, Irish Indo-European airgead
    9. Welsh Indo-European arian
    10. Cofán Isolate koriΦĩʔdi
    34 / 50

    View Slide

  72. Computer-Assisted Language Comparison Datasets
    Database of Cross-Linguistic Colexifications (CLICS)
    Concept "wheel" is part of a cluster with the central concept "leg" with a total of 11 nodes. Hover over the e
    each link. Click on the forms to check their sources. Click HERE to export the current network.
    ity: Line weights: Coloring: Geolocation
    sphere, ball
    round
    footprint
    foot
    calf of leg
    circle
    thigh
    wheel
    leg
    hip
    buttocks
    6 links for "foot" and "wheel":
    Language Family Form
    1. Cofán Isolate c̷ɨʔtʰe
    2. Puinave Isolate sim
    3. Yaminahua Panoan taɨ
    4. Wayampi Tupi pɨ
    5. Pumé Unclassified taɔ
    6. Ninam Yanomam mãhuk
    34 / 50

    View Slide

  73. Computer-Assisted Language Comparison Standards
    Standards
    35 / 50

    View Slide

  74. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling
    36 / 50

    View Slide

  75. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling: Background
    36 / 50

    View Slide

  76. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling: Background
    Concept List # Items Concept Label Concept ID
    Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID: 3232)
    Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID: 3232)
    Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID: 3232)
    Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID: 3232)
    Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID: 3232)
    Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID: 3232)
    OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID: 3232)
    Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID: 3232)
    Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID: 3232)
    Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID: 3232)
    Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID: 3232)
    Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID: 3232)
    Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID: 3232)
    Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID: 3232)
    Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID: 3232)
    Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID: 3232)
    TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID: 3232)
    Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID: 3232)
    Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID: 3232)
    Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID: 3232)
    Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID: 3232)
    Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID: 3232)
    Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015,
    online at http://concepticon.clld.org)
    36 / 50

    View Slide

  77. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling: Background
    Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015,
    online at http://concepticon.clld.org)
    Concept List # Items Concept Label Concept ID
    Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323)
    Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323)
    Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323)
    Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323)
    Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323)
    Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323)
    OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323)
    Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323)
    Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323)
    Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323)
    Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323)
    Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323)
    Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323)
    Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323)
    Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323)
    Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323)
    TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323)
    Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323)
    Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323)
    Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323)
    Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323)
    Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323)
    36 / 50

    View Slide

  78. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling: Background
    Concept labels for “GREASE” in 22 different concept lists (see List et al. 2015,
    online at http://concepticon.clld.org)
    Concept List # Items Concept Label Concept ID
    Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323)
    Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323)
    Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323)
    Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323)
    Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323)
    Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323)
    OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323)
    Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323)
    Matisoff (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323)
    Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323)
    Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323)
    Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323)
    Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323)
    Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323)
    Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323)
    Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323)
    TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323)
    Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323)
    Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323)
    Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323)
    Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323)
    Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323)
    36 / 50

    View Slide

  79. Computer-Assisted Language Comparison Standards
    Standardizing Concept Labeling: The Concepticon
    The Concepticon is an attempt to link the many different con-
    cept lists (“Swadesh Lists”) which are used in the linguistic lite-
    rature. In practice, all entries from the various concept lists are
    linked to a concept set as an intermediate way to reference the
    concepts. The Concepticon
    links 9611 concepts
    from 51 concept lists
    to 2206 concept sets and
    defines 243 relations between the concept sets.
    List, Cysouw & Forkel (2015): Concepticon. Version 0.1, http://
    concepticon.clld.org.
    37 / 50

    View Slide

  80. Computer-Assisted Language Comparison Standards
    Standardizing Lexical Representation
    38 / 50

    View Slide

  81. Computer-Assisted Language Comparison Standards
    Standardizing Lexical Representation
    Dialect Entry IPA Segments Morphemes
    Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵
    Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³
    Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹
    Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵²
    Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³
    Meixian 油 jiu¹² j i u ¹² j i u ¹ ²
    Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵
    Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³
    Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹
    Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴
    Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties
    (data taken from Wang and Hamed 2006)
    38 / 50

    View Slide

  82. Computer-Assisted Language Comparison Standards
    Standardizing Lexical Representation
    Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties
    (data taken from Wang and Hamed 2006)
    Dialect Entry IPA Segments Morphemes
    Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵
    Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³
    Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹
    Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵²
    Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³
    Meixian 油 jiu¹² j i u ¹² j i u ¹ ²
    Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵
    Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³
    Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹
    Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴
    38 / 50

    View Slide

  83. Computer-Assisted Language Comparison Standards
    Standardizing Lexical Representation
    Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties
    (data taken from Wang and Hamed 2006)
    Dialect Entry IPA Segments Morphemes
    Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵
    Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³
    Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹
    Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵²
    Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³
    Meixian 油 jiu¹² j i u ¹² j i u ¹ ²
    Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵
    Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³
    Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹
    Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴
    38 / 50

    View Slide

  84. Computer-Assisted Language Comparison Standards
    Standardizing Lexical Representation
    Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties
    (data taken from Wang and Hamed 2006)
    Dialect Entry IPA Segments Morphemes
    Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵
    Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³
    Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹
    Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵²
    Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³
    Meixian 油 jiu¹² j i u ¹² j i u ¹ ²
    Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵
    Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³
    Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹
    Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴
    38 / 50

    View Slide

  85. Computer-Assisted Language Comparison Standards
    Standards the Representation of Judgments
    39 / 50

    View Slide

  86. Computer-Assisted Language Comparison Standards
    Standards the Representation of Judgments
    Language Lexical Entry Cognacy Alignment
    Central Amis simar 2 s i m a r
    Thao lhimash 2 lh i m a sh
    Hanunóo tabáʔ 23 t a b á ʔ
    Nias tawõ 23 t a w õ -
    Mailu mona 1 m o n a -
    Maloh -iñak 1 - i ñ a k
    Tetum mina 1 m i n a -
    Banggi laːna 24 l aː n a -
    Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ
    Iban lemak 24 l e m a k
    Cognate judgments for “grease/fat” across 10 Austronesian languages
    (data taken from Greenhill et. al 2008, online at
    http://language.psy.auckland.ac.nz/austronesian/)
    39 / 50

    View Slide

  87. Computer-Assisted Language Comparison Standards
    Standards the Representation of Judgments
    Cognate judgments for “grease/fat” across 10 Austronesian languages
    (data taken from Greenhill et. al 2008, online at
    http://language.psy.auckland.ac.nz/austronesian/)
    Language Lexical Entry Cognacy Alignment
    Central Amis simar 2 s i m a r
    Thao lhimash 2 lh i m a sh
    Hanunóo tabáʔ 23 t a b á ʔ
    Nias tawõ 23 t a w õ -
    Mailu mona 1 m o n a -
    Maloh -iñak 1 - i ñ a k
    Tetum mina 1 m i n a -
    Banggi laːna 24 l aː n a -
    Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ
    Iban lemak 24 l e m a k
    39 / 50

    View Slide

  88. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    40 / 50

    View Slide

  89. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    JENA
    WORDLIST
    STANDARD
    The Jena Wordlist Standard is being developed by the NESCent style working group
    “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray
    40 / 50

    View Slide

  90. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    The Jena Wordlist Standard is being developed by the NESCent style working group
    “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray
    JENA
    WORDLIST
    STANDARD
    DEFINE STANDARDS FOR
    - Wordlists
    - Cognate Sets
    - Alignments
    PROVIDE TOOLS FOR
    - Data Validation
    - Data Exchange
    - Data Enrichment
    40 / 50

    View Slide

  91. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    The Jena Wordlist Standard is being developed by the NESCent style working group
    “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray
    JENA
    WORDLIST
    STANDARD
    arbitrarité
    Glottolog
    http://glottolog.clld.org
    Phoible
    http://phoible.clld.org
    CONCEPTICON
    http://concepticon.clld.org
    [ˈfɔi.bł]
    INTEGRATE EXISTING STANDARDS
    40 / 50

    View Slide

  92. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    The Jena Wordlist Standard is being developed by the NESCent style working group
    “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray
    PROVIDE TOOLS FOR
    EDITING AND ANALYSIS
    LingPy
    http://lingpy.org
    TSV EDICTOR
    http://tsv.lingpy.org
    JENA
    WORDLIST
    STANDARD
    40 / 50

    View Slide

  93. Computer-Assisted Language Comparison Standards
    Jena Wordlist Standard
    The Jena Wordlist Standard is being developed by the NESCent style working group
    “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray
    JENA
    WORDLIST
    STANDARD
    LexiBank
    - Cross-Linguistic Database
    of Lexical Cognate Sets
    PhonoBank
    - Cross-Linguistic Database
    of Regular Sound Change
    Patterns
    USE THE STANDARD TO BUILD
    NEW DATABASES
    40 / 50

    View Slide

  94. Computer-Assisted Language Comparison Software Tools
    Software Tools
    41 / 50

    View Slide

  95. Computer-Assisted Language Comparison Software Tools
    Background: Computer-Assisted Workflows
    42 / 50

    View Slide

  96. Computer-Assisted Language Comparison Software Tools
    Background: Computer-Assisted Workflows
    P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BOPP
    VERY,
    VERY
    LONG
    TITLE
    Semantic
    Tagging
    Segmentation
    Cognate
    Detection
    Alignment
    Analysis
    Linguistic
    Reconstruction
    Phylogenetic
    Reconstruction
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    RAW
    DATA
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    WORDLIST
    DATA
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    TOKENS,
    MORPHEMES
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    COGNATE
    SETS
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    SOUND
    CORRESPON-
    DENCES
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    PROTO-
    FORMS
    HAND [hænd]
    FOOT [fʊt]
    EARTH [ɜːrθ]
    TREE [triː]
    BARK [bɑːrk]
    PHYLO-
    GENIES
    PROVIDES
    AUTOMATIC
    ANALYSES
    REVISES
    AUTOMATIC
    ANALYSES
    A possible computer-assisted, iterative workflow with automatic and manual components.
    42 / 50

    View Slide

  97. Computer-Assisted Language Comparison Software Tools
    LingPy and EDICTOR
    43 / 50

    View Slide

  98. Computer-Assisted Language Comparison Software Tools
    LingPy and EDICTOR
    LingPy
    http://lingpy.org
    TSV EDICTOR
    http://tsv.lingpy.org
    43 / 50

    View Slide

  99. Computer-Assisted Language Comparison Software Tools
    LingPy and EDICTOR
    LingPy and EDICTOR: Two tools for computer-assisted language comparison.
    TSV EDICTOR
    http://tsv.lingpy.org
    Software Library for Automatic
    Tasks in Historical Linguistics
    - phonetic segmentation
    - phonetic alignment
    - cognate detection
    - ancestral state reconstruction
    - borrowing detection
    - phylogenetic reconstruction
    43 / 50

    View Slide

  100. Computer-Assisted Language Comparison Software Tools
    LingPy and EDICTOR
    LingPy and EDICTOR: Two tools for computer-assisted language comparison.
    TSV
    LingPy
    http://lingpy.org
    Online Tool for Computer-
    Assisted Language Comparison
    - server- and client-based
    - data validation
    - phonetic segmentation
    - cognate set editor
    - alignment editor
    - correspondence evaluation
    43 / 50

    View Slide

  101. Computer-Assisted Language Comparison Software Tools
    Demo
    Testfile: rom.xls, from the Global Lexicostatistical Database (GLD,
    Starostin 2014), downloadable from
    http://starling.rinet.ru/new100/rom.xls.
    Spreadsheet: Tool for data conversion from GLD-Format (Excel
    spreadsheet) to LingPy (tsv), available at
    http://dighl.github.com/spreadsheet.
    LingPy: Use LingPy to tokenize the data (currently not
    implemented in Spreadsheet), compute a phylogenetic tree
    (Neighbor-Joining or UPGMA), test automatic cognate detection,
    align the data, and convert the data to Nexus-Format.
    Edictor: Use Edictor to inspect the data, carry out manual
    alignment analyses, and check and edit the cognate judgments.
    Additional scripts accompanying the demo available online at: https://gist.
    github.com/LinguList/17548931a1aa8862c408
    44 / 50

    View Slide

  102. Challenges
    Challenges
    45 / 50

    View Slide

  103. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    46 / 50

    View Slide

  104. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    46 / 50

    View Slide

  105. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    46 / 50

    View Slide

  106. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    Romance
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    46 / 50

    View Slide

  107. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    46 / 50

    View Slide

  108. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl̩- sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    solej
    SUN
    French
    sol
    SUN
    Spanish
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    46 / 50

    View Slide

  109. Challenges Modeling Language Change
    Modeling Language Change
    'soh₂-wl◌̩
    - sh₂uˈen-
    SUN
    Indo-European
    soːwel- sunːoː-
    SUN
    Germanic
    soːl-
    SUN
    soːlikul-
    SMALL SUN
    Romance
    solej
    SUN
    French
    sol
    SUN
    Spanish
    zɔnə
    SUN
    German
    suːl
    SUN
    Swedish
    SEM
    ANTIC
    SHIFT
    M
    O
    RPH
    O
    LO
    G
    ICAL
    CH
    AN
    G
    E
    M
    O
    R
    PH
    O
    LO
    G
    ICA
    L
    CH
    A
    N
    G
    E
    MORPHOLOGICAL
    CHANGE
    MORPHOLOGICAL
    CHANGE
    46 / 50

    View Slide

  110. Challenges Modeling Language Change
    Modeling Language Change
    So far, our linguistic databases mostly model the relati-
    ons between different linguistic entities (cognacy, borrowing,
    etc.). To fully reflect what is “philologically” encoded in ety-
    mological dictionaries, however, we need to start thinking of
    how to model processes.
    46 / 50

    View Slide

  111. Challenges Limitations of Knowledge
    Limitations of Knowledge
    "wolf"
    lupus
    Latin
    "wolf"
    lupus
    Latin
    ?
    47 / 50

    View Slide

  112. Challenges Limitations of Knowledge
    Limitations of Knowledge
    "wolf"
    lupus
    Latin
    "wolf"
    *ulkʷo-
    Indo-European
    "wolf"
    *lupo-
    Sabellic
    "wolf"
    *lukʷo-
    Italic
    47 / 50

    View Slide

  113. Challenges Limitations of Knowledge
    Limitations of Knowledge
    "wolf"
    lupus
    Latin
    "wolf"
    lupus
    Latin
    "marten"
    *ulp-
    Indo-European
    "fox"
    *ulp-
    Italic
    "fox"
    volpes
    Latin
    47 / 50

    View Slide

  114. Challenges Limitations of Knowledge
    Limitations of Knowledge
    "wolf"
    *lukʷo-
    Italic
    "wolf"
    lupus
    Latin
    "wolf"
    *ulkʷo-
    Indo-European
    "wolf"
    lupus
    Latin
    "marten"
    *ulp-
    Indo-European
    "fox"
    *ulp-
    Italic
    "fox"
    volpes
    Latin
    ??
    "wolf"
    *lupo-
    Sabellic
    47 / 50

    View Slide

  115. Challenges Limitations of Knowledge
    Limitations of Knowledge
    Although our computational algorithms are getting better and
    better at modeling uncertainties, our databases still give the
    impression as if they represented fully proven facts. We need
    to find a way to include uncertainties when modeling and
    representing our data.
    47 / 50

    View Slide

  116. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    48 / 50

    View Slide

  117. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    48 / 50

    View Slide

  118. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    48 / 50

    View Slide

  119. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    Fúzhōu
    Měixiàn
    Guǎngzhōu
    Běijīng
    48 / 50

    View Slide

  120. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    LOSS
    INNO
    VATIO
    N
    INNO
    VATIO
    N
    BORROWING
    48 / 50

    View Slide

  121. Challenges Reconciliation of Evidence
    Reconciliation of Evidence
    Despite the massive body of data which has been accumu-
    lated during the last two centuries of research, we are still far
    away from being able to sufficiently reconcile all the different
    types of evidence which are important for our discipline.
    48 / 50

    View Slide

  122. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    It’s a very long
    way up
    to
    the top...
    49 / 50

    View Slide

  123. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    ... but together we can
    m
    ake it!
    49 / 50

    View Slide

  124. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    Concluding Remarks
    The possibilities for research in historical
    linguistics are nowadays greater than ever
    before. But so are our challenges. In order
    to be up to the job, we cannot do without
    computers, but likewise, we cannot do wi-
    thout the intuition and experience of trai-
    ned historical linguists. What we further
    need are combined efforts of standardiza-
    tion and knowledge exchange. We need
    to bridge disciplines and break down the
    frontiers between different schools.
    50 / 50

    View Slide

  125. P(A|B)=(P(B|A)P(A))/(P(B)
    FRANZ BO
    PP
    VERY,
    VERY
    LO
    NG
    TI TLE
    Thanks for Your Attention!
    50 / 50

    View Slide