Datasets and Software Tools for Computer-Assisted Language Comparison

Datasets and Software Tools for Computer-Assisted Language Comparison Johann-Mattis List
DFG research fellow Centre des recherches linguistiques sur l’Asie Orientale Team Adaptation, Integration, Reticulation, Evolution EHESS and UPMC, Paris 2015-08-20 1 / 50

Traditional Language Comparison Traditional Language Comparison 2 / 50

Traditional Language Comparison Characteristics Characteristics FRANZ BOPP VERY, VERY LONG
TITLE 3 / 50

Traditional Language Comparison Characteristics Research Object 4 / 50

Traditional Language Comparison Characteristics Research Object German ʦ aː n
- * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 4 / 50

- - * Proto-Germanic t a n d English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - - 4 / 50

- - Proto-Germanic t a n θ - English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - - 4 / 50

- Proto-Germanic t a n θ - English t ʊː - θ ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - 4 / 50

- Proto-Germanic t a n θ - English t ʊː - θ Proto-Indo-European d e n t - Italian d ɛ n t ə Proto-Romance d e n t e French d ɑ̃ - - 4 / 50

- * Proto-Germanic t a n d English t ʊː - θ Proto-Indo-European d e n t Italian d ɛ n t ə * Proto-Romance d e n t French d ɑ̃ - - 4 / 50

Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ German ʦ aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ 1 4 / 50

Traditional Language Comparison Characteristics Origins Uniformitarianism “universality of change” –
change is independent of time and space “graduality of change” – change is neither abrupt nor chaotic “uniformity of change” – change is not heterogeneous, but uniform Founding Fathers Franz Bopp (1791–1867): language comparison (Bopp 1816) Rasmus Rask (1787-1832) and Jacob Grimm (1785-1863): sound law (Rask 1818, Grimm 1822) August Schleicher (1821–1868): family tree and linguistic reconstruction (Schleicher 1853 & 1861) 5 / 50

Traditional Language Comparison Achievements Achievements 6 / 50

Traditional Language Comparison Achievements Methods and Theories Comparative Method (Meillet
1925) Basic procedure for proving language relationship and reconstructing unattested ancestral language states, etymologies, and genetic classiﬁcations. Family Tree Model and Wave Theory (Schleicher 1853, Schmidt 1872) Two partially incompatible models to describe historical language relations. Regularity Hypothesis (Osthoﬀ & Brugmann 1878) Fundamental working hypothesis that states that certain sound change processes proceed regularly (universally, gradually, and in a uniform manner). 7 / 50

Traditional Language Comparison Achievements Insights Internal Language History Thanks to
historical linguistics, the history of a considerable (but still small) amount of languages has been thoroughly investigated. External Language History Thanks to historical linguistics, a considerable amount of the languages in the world has been genetically classiﬁed (although there remain many unsolved and controversially discussed questions). General Language History Some work on general processes of language history has been done, yet many questions still remain unsolved or are controversially debated. 8 / 50

Traditional Language Comparison Problems Problems 9 / 50

Traditional Language Comparison Problems Transparency Part of the process of
“becoming” a competent Indo-Europeanist has always been recognized as coming to grasp “intuitively” concepts and types of changes in language so as to be able to pick and choose between alternative explanations for the history and development of speciﬁc features of the reconstructed language and its oﬀspring. Schwink (1994) 10 / 50

Traditional Language Comparison Problems Transparency: Philological Knowledge Representation Frucht. Sf
std. (9. Jh.), mhd. vruht, ahd. fruht, as. fruht. Ent- lehnt aus l. frūctus m. gleicher Bedeutung (zu l. fruī “ge- nieße”). Das deutsche Wort ist Femininum geworden im Anschluß an die ti- Abstrakta wie Flucht² usw. Adjekti- ve: fruchtig, fruchtbar; Verb: (be-)fruchten. Ebenso nndl. vrucht, ne. fruit, nfrz. fruit, nschw. frukt, nnorw. frukt; frugal. (Kluge und Seebold 2002) 11 / 50

Traditional Language Comparison Problems Applicability – 7,106 languages (Lewis &
Fennig 2013) – 147 language families (ibid.) – 25244065 languages which could be compared 12 / 50

Traditional Language Comparison Problems Applicability The amount of digitally available
data for the languages of the world is growing from day to day, while there are only a few historical linguists who are trained to carry out the comparison of these languages. It seems impossible to handle this task when relying only on the traditional, time- consuming manual procedures developed in traditional historical linguistics. 12 / 50

Traditional Language Comparison Problems Adequacy One time is never, two
times is ever! (a mathematician friend on the treatment of probability in Indo-European linguistics) 13 / 50

Traditional Language Comparison Problems Summary Despite its achievements, traditional historical
linguistics has some clear shortcomings, such as a lack of transparency in methodology, the “philological” form of knowledge representation, and the questionable validity of certain results. 14 / 50

Computational Language Comparison Computational Language Comparison 15 / 50

Computational Language Comparison Characteristics Characteristics P(A|B)=(P(B|A)P(A))/(P(B) 16 / 50

Computational Language Comparison Characteristics Characteristics “Indo-European and computational cladistics” (Ringe,
Warnow and Taylor 2002) “Language-tree divergence times support the Anatolian theory of Indo-European origin” (Gray und Atkinson 2003) “Language classiﬁcation by numbers” (McMahon und McMahon 2005) “Curious Parallels and Curious Connections: Phylogenetic Thinking in Biology and Historical Linguistics” (Atkinson und Gray 2005) “Automated classiﬁcation of the world’s languages” (Brown et al. 2008) “Computational Feature-Sensitive Reconstruction of Language Relationships: Developing the ALINE Distance for Comparative Historical Linguistic Reconstruction” (Downey et al. 2008) “Networks uncover hidden lexical borrowing in Indo-European language evolution” (Nelson-Sathi et al. 2011) “A pipeline for computational historical linguistics” (Steiner, Stadler, und Cysouw 2011) 17 / 50

Computational Language Comparison Characteristics Points of Interest and Goals phylogenetic
reconstruction sequence comparison general questions of language development 18 / 50

Computational Language Comparison Characteristics Points of Interest and Goals phylogenetic
reconstruction sequence comparison general questions of language development Primary Goal If we cannot guarantee getting the same results from the same data considered by diﬀerent linguists, we jeopardize the essential scientiﬁc criterion of repeatability. (McMahon & McMahon 2005) 18 / 50

Computational Language Comparison Characteristics Methods and Theories phylogenetic reconstruction (cf.,
among others, Gray & Atkinson 2003 Ringe et al. 2002, Brown et al. 2008) phonetic alignment (cf., among others, Kondrak 2000, Prokić et al. 2009, List 2012a) cognate detection (cf. Steiner et al. 2011, List 2012b) borrowing detection (cf. Nelson-Sathi et al. 2011, List et al. 2014a) 19 / 50

Computational Language Comparison Achievements Achievements 20 / 50

Computational Language Comparison Achievements New Perspectives external language history receives
more attention than before “Indo-Euro-Centrism” is replaced by a more cross-linguistic paradigm new questions regarding general language history new proposals to model language history 21 / 50

Computational Language Comparison Achievements New Approaches empirical data becomes the
center of interest probabilistic approaches replace “historical” approaches databases replace philological knowledge representation “informal” methods are formalized and automatized 22 / 50

Computational Language Comparison Problems Problems 23 / 50

Computational Language Comparison Problems Transparency 24 / 50

Computational Language Comparison Problems Transparency Evaluation criteria for applied automatic
methods are not very intuitive and vary greatly. 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is diﬃcult to communicate the results to traditional linguists. 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or – “impossible per se”, or 24 / 50

methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is diﬃcult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or – “impossible per se”, or – as useful as “rolling a dice”. 24 / 50

Computational Language Comparison Problems Applicability 25 / 50

Computational Language Comparison Problems Applicability Method Multilingual? No additional requirements?
Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & List 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 25 / 50

Computational Language Comparison Problems Applicability Method Multilingual? No additional requirements?
Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 25 / 50

Computational Language Comparison Problems Accuracy Data Problems (Geisler & List
forthcoming) Comparing two independently produced lexicostatistical datasets: database # languages # concepts Dyen et al. 1997 95 200 Tower of Babel 98 110 intersection 46 103 26 / 50

Computational Language Comparison Problems Accuracy Data Problems (Geisler & List
forthcoming) Comparing two independently produced lexicostatistical datasets: database # languages # concepts Dyen et al. 1997 95 200 Tower of Babel 98 110 intersection 46 103 Results up to 10 % diﬀerence in concept translations many undetected borrowings in both datasets up to 30 % diﬀerences in tree topologies for Bayesian analyses 26 / 50

Computational Language Comparison Problems Accuracy Greek Slavic Celtic DKB TOB
Indo-Iranian Romance Germanic Baltic Armenian Albanian 26 / 50

Computational Language Comparison Problems Accuracy South-Slavic East-Slavic West-Slavic DKB TOB
26 / 50

Computational Language Comparison Problems Summary Many quantitative methods which are
based on manually compiled datasets cannot cope with errors resulting from inconsistent data compilation. They are only as objective as the data being fed to them! Many quantitative approaches are insuﬃciently tested, and scholars are often content with results traditional linguists would never accept. Additionally, quantitative approaches are often presented in a way that makes it hard (not only for traditional linguists) to understand what they are based upon. Results are reported in an intransparent way, supplementary data is often lacking, concrete examples are seldom provided and source code (essential to check and replicate analyses) is missing in almost all recent publications. 27 / 50

28 / 50

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 28 / 50

PRO: - intuition - background knowledge - can juggle with
multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE 28 / 50

PRO: - intuition - background knowledge - can juggle with
multiple types of evidence CONTRA: - has to sleep and rest - does not like to count and do boring work - can oversee facts when doing boring work CONTRA: - no intuition - no background knowledge - can't juggle with multiple types of evidence PRO: - doesn't need to sleep - is very good at counting and boring work - doesn't make errors in boring work P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BOPP VERY, VERY LONG TITLE COMPUTER-ASSISTED LANGUAGE COMPARISON 28 / 50

Computer-Assisted Language Comparison Computer-Assisted Language Comparison 29 / 50

Computer-Assisted Language Comparison Datasets Datasets 30 / 50

Computer-Assisted Language Comparison Datasets Benchmark Databases 31 / 50

Computer-Assisted Language Comparison Datasets Benchmark Databases http://alignments.lingpy.org 31 / 50

Computer-Assisted Language Comparison Datasets Benchmark Databases 32 / 50

Computer-Assisted Language Comparison Datasets Benchmark Databases http://sequencecomparison.github.io 32 / 50

Computer-Assisted Language Comparison Datasets Benchmark Databases More? Originally, I planned
to publish a benchmark database for linguistic reconstruction in addition to the two benchmark databases mentioned before. Due to all different kinds of problems, this undertaking was delayed ever since I started to collect the first datasets. In the future, the initial ideas for the benchmark, along with the datasets created so far, will be included as part a larger collaborative effort to launch a database for cross-linguistic historical phonology (PhonoBank, MPI Jena). 33 / 50

Computer-Assisted Language Comparison Datasets Database of Cross-Linguistic Colexiﬁcations (CLICS) Key
Concept Russian German ... 1.1 world mir, svet Welt ... 1.21 earth, land zemlja Erde, Land ... 1.212 ground, soil počva Erde, Boden ... 1.420 tree derevo Baum ... 1.430 wood derevo Wald ... ... ... ... ... ... 34 / 50

Computer-Assisted Language Comparison Datasets Database of Cross-Linguistic Colexifications (CLICS) CLICS:
Crosslinguistic Colexifications - 221 Languages - 64 language families - 1280 concepts - 301,498 words - 45,667 polysemies (colexifications) - 16,239 different links between concepts - http://clics.lingpy.org 34 / 50

Computer-Assisted Language Comparison Datasets Database of Cross-Linguistic Colexiﬁcations (CLICS) 684
678 871 1043 6 30 129 196 1243 128 869 853 650 344 1103 150 185 627 232 709 1035 1206 177 97 311 496 606 137 207 444 840 1077 325 222 1063 1138 1204 1258 559 723 495 766 914 38 1101 652 865 891 872 633 291 980 700 144 410 430 1025 406 464 787 622 131 242 918 275 1159 99 1174 671 1038 786 705 641 760 1259 356 391 197 10 214 299 63 191 619 644 792 1205 897 67 1231 213 226 747 681 399 841 439 773 123 800 16 1067 1227 696 417 550 68 76 108 360 1244 339 500 81 867 79 1097 98 96 833 771 715 455 380 1268 1186 1046 39 252 1228 66 23 1112 133 676 336 739 1150 1071 986 485 112 372 1109 830 721 1053 1057 601 573 556 527 1248 614 488 908 499 1002 309 442 814 1193 569 458 258 563 653 682 774 70 1151 948 801 1082 243 47 71 83 153 1265 934 85 1215 1199 523 581 422 21 358 1261 111 354 219 759 15 890 261 1222 141 158 74 806 1031 845 770 850 903 1224 419 754 433 798 188 1256 613 528 208 539 323 981 132 1055 1001 790 804 844 1118 907 640 446 815 923 498 201 1184 578 566 427 532 452 151 750 598 1094 345 735 777 978 599 492 390 286 1107 742 1015 1202 1210 1257 1275 859 988 69 752 596 290 126 110 950 922 1047 741 253 347 385 620 966 221 431 3 224 1194 999 953 1029 852 301 389 318 530 1048 1032 175 701 544 1119 241 94 745 835 1270 62 107 159 20 767 512 331 248 549 1013 946 974 1022 1100 477 302 233 1168 1003 1211 570 307 40 945 1269 784 546 437 901 350 238 305 1191 482 1012 977 906 783 524 117 457 603 836 1181 880 229 124 216 1113 1074 72 586 647 447 2 113 1179 7 1006 665 397 502 610 1274 707 327 659 667 824 917 985 1089 346 1229 101 542 1042 727 782 733 967 462 592 468 1106 440 478 308 577 698 776 75 1155 51 145 517 359 938 1157 1160 1183 947 1102 1135 1252 343 608 537 103 634 251 383 506 25 829 396 686 679 574 516 42 250 379 809 602 660 780 765 697 856 899 594 1008 393 179 114 1140 11 100 1209 618 600 192 1277 896 1142 1278 762 421 713 182 521 861 672 297 1116 1190 1192 140 1212 46 493 1187 157 1225 212 403 519 616 173 413 912 1110 84 756 793 636 118 889 692 998 366 711 1045 61 240 1263 199 648 832 289 522 368 1091 931 982 949 400 119 388 811 53 59 1069 708 952 545 763 1238 184 825 377 1242 1233 262 635 269 1062 1061 1073 933 17 1247 352 64 384 50 632 736 1246 822 781 758 1 939 595 778 105 860 1049 1066 1072 995 503 370 919 1149 1127 1128 972 1126 245 921 973 675 587 1235 960 928 926 1143 548 1250 86 1021 32 1068 719 965 259 1070 863 638 303 324 873 249 892 976 1007 722 36 459 293 165 209 557 1245 788 862 651 900 31 483 236 935 1052 115 294 680 831 44 453 206 971 1273 170 753 256 1148 200 450 382 1240 561 615 317 572 725 870 438 139 1011 646 1117 392 45 276 264 704 1080 174 1050 808 1197 508 576 225 562 471 1217 333 1014 593 92 1034 611 1171 312 802 1253 29 902 244 582 466 668 878 341 432 1163 625 904 164 467 1195 1232 796 828 281 629 349 1166 411 369 387 1208 394 415 1000 58 1098 148 287 1223 818 263 220 838 876 313 260 65 1165 5 355 106 1172 490 718 171 1139 163 785 881 887 1169 319 585 553 894 306 314 1041 1009 799 674 848 1201 1004 689 1085 1218 1145 1170 228 911 279 73 104 690 1254 402 340 169 693 868 893 1018 78 1092 194 555 198 834 1249 997 932 237 1176 666 956 624 1262 541 520 795 866 702 4 734 1095 1180 728 964 1079 271 842 1241 1056 154 751 353 905 1136 504 909 910 1133 362 583 670 1124 381 1216 215 178 571 470 142 376 1154 172 296 533 364 963 152 797 1213 803 1051 738 426 1036 1153 637 823 915 428 1075 560 547 1137 35 882 89 511 1122 805 494 1130 1188 1086 1236 669 588 930 703 942 18 655 335 155 710 1156 1028 465 147 183 414 1221 273 166 1054 278 55 460 812 1090 810 180 768 143 156 404 367 1182 231 288 136 456 82 529 970 1016 729 395 187 604 408 330 1064 34 1267 847 726 543 677 642 940 645 958 683 695 864 1058 605 1084 451 443 699 1167 959 925 1198 227 886 628 1178 337 991 813 657 1185 1039 769 1081 484 712 1189 944 1207 322 33 685 424 80 270 937 1177 283 1237 816 130 161 189 77 300 1026 463 1104 326 589 60 983 474 1093 744 748 554 292 41 267 984 373 1214 957 1024 969 507 37 874 1030 630 579 962 535 706 688 122 497 1060 1083 1027 102 510 405 1134 658 617 936 929 363 1175 361 536 534 1219 181 386 884 418 558 8 479 979 551 505 316 298 26 315 761 202 1144 176 473 348 134 639 663 717 885 924 149 49 1078 1040 57 167 764 1173 673 280 1152 277 1272 1065 272 827 531 607 1123 257 996 436 9 826 234 1096 875 525 304 1108 475 1132 714 846 540 716 1005 1105 357 1162 694 920 743 28 994 1200 168 1266 420 515 568 755 895 218 916 730 807 210 375 854 1010 879 1125 268 1129 1114 1255 1158 1279 487 486 398 597 661 135 565 621 193 321 1230 513 654 265 612 737 855 211 1196 246 1264 584 338 749 1271 434 121 423 509 839 1147 656 230 239 489 14 469 22 1044 351 448 282 329 961 254 989 371 284 223 843 821 24 1023 643 819 285 514 746 757 791 138 186 849 93 951 127 877 1088 518 1164 1260 501 54 190 95 43 205 1276 116 146 662 217 461 883 204 1033 310 472 12 412 332 817 649 794 1037 943 927 481 968 425 109 195 857 1121 564 687 664 724 87 1120 88 449 429 255 987 992 1111 591 575 491 720 851 328 941 990 1019 993 1087 955 580 1226 975 1099 732 235 779 365 1234 441 609 247 334 91 1251 1131 913 691 52 274 1017 435 90 407 480 1239 13 623 0 266 626 295 954 1059 552 898 858 772 526 1115 48 1161 125 590 454 1020 1141 203 740 1146 342 820 1220 56 320 416 27 401 476 19 120 1203 445 789 775 888 567 378 1076 160 162 409 731 631 374 538 837 34 / 50

Computer-Assisted Language Comparison Datasets Database of Cross-Linguistic Colexiﬁcations (CLICS) Concept
"money" is part of a cluster with the central concept "fishscale" with a total of 10 nodes. Hover over forms for each link. Click on the forms to check their sources. Click HERE to export the current network. ity: Line weights: Coloring: Family silver leather fishscale bark coin fur snail skin, hide money shell 49 links for "silver" and "money": Language Family Form 1. Ignaciano Arawakan ne 2. Aymara, Central Aymaran ḳulʸḳi 3. Tsafiki Barbacoan kaˈla 4. Seselwa Creole French Creole larzan 5. Miao, White Hmong-Mien nyiaj 6. Breton Indo-European arhant 7. French Indo-European argent 8. Gaelic, Irish Indo-European airgead 9. Welsh Indo-European arian 10. Cofán Isolate koriΦĩʔdi 34 / 50

Computer-Assisted Language Comparison Datasets Database of Cross-Linguistic Colexiﬁcations (CLICS) Concept
"wheel" is part of a cluster with the central concept "leg" with a total of 11 nodes. Hover over the e each link. Click on the forms to check their sources. Click HERE to export the current network. ity: Line weights: Coloring: Geolocation sphere, ball round footprint foot calf of leg circle thigh wheel leg hip buttocks 6 links for "foot" and "wheel": Language Family Form 1. Cofán Isolate c̷ɨʔtʰe 2. Puinave Isolate sim 3. Yaminahua Panoan taɨ 4. Wayampi Tupi pɨ 5. Pumé Unclassified taɔ 6. Ninam Yanomam mãhuk 34 / 50

Computer-Assisted Language Comparison Standards Standards 35 / 50

Computer-Assisted Language Comparison Standards Standardizing Concept Labeling 36 / 50

Computer-Assisted Language Comparison Standards Standardizing Concept Labeling: Background 36 /
50

Computer-Assisted Language Comparison Standards Standardizing Concept Labeling: Background Concept List
# Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID: 3232) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID: 3232) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID: 3232) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID: 3232) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID: 3232) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID: 3232) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID: 3232) Matisoﬀ (1978) 200 fat/grease GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID: 3232) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID: 3232) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID: 3232) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID: 3232) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID: 3232) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID: 3232) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID: 3232) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID: 3232) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID: 3232) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID: 3232) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID: 3232) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID: 3232) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID: 3232) Concept labels for “GREASE” in 22 diﬀerent concept lists (see List et al. 2015, online at http://concepticon.clld.org) 36 / 50

Computer-Assisted Language Comparison Standards Standardizing Concept Labeling: Background Concept labels
for “GREASE” in 22 diﬀerent concept lists (see List et al. 2015, online at http://concepticon.clld.org) Concept List # Items Concept Label Concept ID Allen (2007) 500 animal oil; 动物油(脂肪) GREASE (CONCEPTICON-ID:323) Gregersen (1976) 217 fat-grease*fat-grease GREASE (CONCEPTICON-ID:323) Heggarty (2005) 150 fat (grease); grasa GREASE (CONCEPTICON-ID:323) Swadesh (1955) 100 fat (grease) GREASE (CONCEPTICON-ID:323) Alpher and Nash (1999) 151 fat, grease GREASE (CONCEPTICON-ID:323) Hale (1961) 100 fat, grease GREASE (CONCEPTICON-ID:323) OGrady and Klokeid (1969) 100 fat, grease GREASE (CONCEPTICON-ID:323) Blust (2008) 210 fat/grease GREASE (CONCEPTICON-ID:323) Matisoﬀ (1978) 200 fat/grease GREASE (CONCEPTICON-ID:323) Samarin (1969) 218 fat/grease GREASE (CONCEPTICON-ID:323) Dunn et al. (2012) 207 fat GREASE (CONCEPTICON-ID:323) Swadesh (1950) 215 fat GREASE (CONCEPTICON-ID:323) Zgraggen (1980) 380 fat GREASE (CONCEPTICON-ID:323) Jachontov (1991) 100 fat n. GREASE (CONCEPTICON-ID:323) Wiktionary (2003) 207 fat (noun) GREASE (CONCEPTICON-ID:323) Starostin (1991) 110 fat n.; жир GREASE (CONCEPTICON-ID:323) TeilDautrey et al. (2008) 430 fat, oil GREASE (CONCEPTICON-ID:323) Swadesh (1952) 200 fat (organic substance) GREASE (CONCEPTICON-ID:323) Shiro (1973) 200 grease (fat) GREASE (CONCEPTICON-ID:323) Samarin (1969) 100 grease; graisse; Fett; grasa GREASE (CONCEPTICON-ID:323) Wang (2006) 200 pig oil; 猪油 GREASE (CONCEPTICON-ID:323) Haspelmath and Tadmor (2009) 1460 the grease or fat GREASE (CONCEPTICON-ID:323) 36 / 50

Computer-Assisted Language Comparison Standards Standardizing Concept Labeling: The Concepticon The
Concepticon is an attempt to link the many diﬀerent concept lists (“Swadesh Lists”) which are used in the linguistic lite- rature. In practice, all entries from the various concept lists are linked to a concept set as an intermediate way to reference the concepts. The Concepticon links 9611 concepts from 51 concept lists to 2206 concept sets and deﬁnes 243 relations between the concept sets. List, Cysouw & Forkel (2015): Concepticon. Version 0.1, http:// concepticon.clld.org. 37 / 50

Computer-Assisted Language Comparison Standards Standardizing Lexical Representation 38 / 50

Computer-Assisted Language Comparison Standards Standardizing Lexical Representation Dialect Entry IPA
Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ Lexical entries for “GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) 38 / 50

Computer-Assisted Language Comparison Standards Standardizing Lexical Representation Lexical entries for
“GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 38 / 50

“GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i +⁴⁴ ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 38 / 50

“GREASE” (“pork fat”) in 10 Chinese dialect varieties (data taken from Wang and Hamed 2006) Dialect Entry IPA Segments Morphemes Beijing 大油 ta⁵¹ iou³⁵ t a ⁵¹ i o u ³⁵ t a ⁵¹ + i o u ³⁵ Changsha 油 tɕy³³ iəu¹³ tɕ y ³³ i ə u ¹³ tɕ y ³³ + i ə u ¹³ Chengdu 猪油 tsu⁴⁴iəu³¹ ts u ⁴⁴ i ə u ³¹ ts u ⁴⁴ + i ə u ³¹ Fuzhou 猪油 ty⁴⁴iu⁵² t y ⁴⁴ i u ⁵² t y ⁴⁴ + i u ⁵² Guangzhou 猪膏 tʃy⁵⁵kou⁵³ tʃ y ⁵⁵ k ou ⁵³ tʃ y ⁵⁵ + k ou ⁵³ Meixian 油 jiu¹² j i u ¹² j i u ¹ ² Nanchang 油 iu⁵⁵ i u ⁵⁵ i u ⁵⁵ Taibei ti44 iu13豬油 ti⁴⁴ iu¹³ t i ⁴⁴ i u ¹³ t i ⁴⁴ + i u ¹³ Wenzhou 猪油 tsei⁴⁴ ɦiau³¹ ts e i ⁴⁴ ɦ i a u ³¹ ts e i ⁴⁴ + ɦ i a u ³¹ Xiamen 油 iu²⁴ i u ²⁴ i u ²⁴ 38 / 50

Computer-Assisted Language Comparison Standards Standards the Representation of Judgments 39
/ 50

Computer-Assisted Language Comparison Standards Standards the Representation of Judgments Language
Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k Cognate judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) 39 / 50

Computer-Assisted Language Comparison Standards Standards the Representation of Judgments Cognate
judgments for “grease/fat” across 10 Austronesian languages (data taken from Greenhill et. al 2008, online at http://language.psy.auckland.ac.nz/austronesian/) Language Lexical Entry Cognacy Alignment Central Amis simar 2 s i m a r Thao lhimash 2 lh i m a sh Hanunóo tabáʔ 23 t a b á ʔ Nias tawõ 23 t a w õ - Mailu mona 1 m o n a - Maloh -iñak 1 - i ñ a k Tetum mina 1 m i n a - Banggi laːna 24 l aː n a - Berawan (Long Terawan) ləməʔ 24 l ə m ə ʔ Iban lemak 24 l e m a k 39 / 50

Computer-Assisted Language Comparison Standards Jena Wordlist Standard 40 / 50

Computer-Assisted Language Comparison Standards Jena Wordlist Standard JENA WORDLIST STANDARD
The Jena Wordlist Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray 40 / 50

Computer-Assisted Language Comparison Standards Jena Wordlist Standard The Jena Wordlist
Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD DEFINE STANDARDS FOR - Wordlists - Cognate Sets - Alignments PROVIDE TOOLS FOR - Data Validation - Data Exchange - Data Enrichment 40 / 50

Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD arbitrarité Glottolog http://glottolog.clld.org Phoible http://phoible.clld.org CONCEPTICON http://concepticon.clld.org [ˈfɔi.bł] INTEGRATE EXISTING STANDARDS 40 / 50

Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray PROVIDE TOOLS FOR EDITING AND ANALYSIS LingPy http://lingpy.org TSV EDICTOR http://tsv.lingpy.org JENA WORDLIST STANDARD 40 / 50

Standard is being developed by the NESCent style working group “GlottoBank: Towards a Global Language Phylogeny” under the direction of Russel Gray JENA WORDLIST STANDARD LexiBank - Cross-Linguistic Database of Lexical Cognate Sets PhonoBank - Cross-Linguistic Database of Regular Sound Change Patterns USE THE STANDARD TO BUILD NEW DATABASES 40 / 50

Computer-Assisted Language Comparison Software Tools Software Tools 41 / 50

Computer-Assisted Language Comparison Software Tools Background: Computer-Assisted Workﬂows 42 /
50

Computer-Assisted Language Comparison Software Tools Background: Computer-Assisted Workﬂows P(A|B)=(P(B|A)P(A))/(P(B) FRANZ
BOPP VERY, VERY LONG TITLE Semantic Tagging Segmentation Cognate Detection Alignment Analysis Linguistic Reconstruction Phylogenetic Reconstruction HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] RAW DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] WORDLIST DATA HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] TOKENS, MORPHEMES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] COGNATE SETS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] SOUND CORRESPON- DENCES HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PROTO- FORMS HAND [hænd] FOOT [fʊt] EARTH [ɜːrθ] TREE [triː] BARK [bɑːrk] PHYLO- GENIES PROVIDES AUTOMATIC ANALYSES REVISES AUTOMATIC ANALYSES A possible computer-assisted, iterative workﬂow with automatic and manual components. 42 / 50

Computer-Assisted Language Comparison Software Tools LingPy and EDICTOR 43 /
50

Computer-Assisted Language Comparison Software Tools LingPy and EDICTOR LingPy http://lingpy.org
TSV EDICTOR http://tsv.lingpy.org 43 / 50

Computer-Assisted Language Comparison Software Tools LingPy and EDICTOR LingPy and
EDICTOR: Two tools for computer-assisted language comparison. TSV EDICTOR http://tsv.lingpy.org Software Library for Automatic Tasks in Historical Linguistics - phonetic segmentation - phonetic alignment - cognate detection - ancestral state reconstruction - borrowing detection - phylogenetic reconstruction 43 / 50

Computer-Assisted Language Comparison Software Tools LingPy and EDICTOR LingPy and
EDICTOR: Two tools for computer-assisted language comparison. TSV LingPy http://lingpy.org Online Tool for Computer- Assisted Language Comparison - server- and client-based - data validation - phonetic segmentation - cognate set editor - alignment editor - correspondence evaluation 43 / 50

Computer-Assisted Language Comparison Software Tools Demo Testﬁle: rom.xls, from the
Global Lexicostatistical Database (GLD, Starostin 2014), downloadable from http://starling.rinet.ru/new100/rom.xls. Spreadsheet: Tool for data conversion from GLD-Format (Excel spreadsheet) to LingPy (tsv), available at http://dighl.github.com/spreadsheet. LingPy: Use LingPy to tokenize the data (currently not implemented in Spreadsheet), compute a phylogenetic tree (Neighbor-Joining or UPGMA), test automatic cognate detection, align the data, and convert the data to Nexus-Format. Edictor: Use Edictor to inspect the data, carry out manual alignment analyses, and check and edit the cognate judgments. Additional scripts accompanying the demo available online at: https://gist. github.com/LinguList/17548931a1aa8862c408 44 / 50

Challenges Challenges 45 / 50

Challenges Modeling Language Change Modeling Language Change 'soh₂-wl̩- sh₂uˈen- SUN
Indo-European 46 / 50

Indo-European soːwel- sunːoː- SUN Germanic 46 / 50

Indo-European soːwel- sunːoː- SUN Germanic zɔnə SUN German suːl SUN Swedish 46 / 50

Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN Romance zɔnə SUN German suːl SUN Swedish 46 / 50

Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance zɔnə SUN German suːl SUN Swedish 46 / 50

Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish 46 / 50

Challenges Modeling Language Change Modeling Language Change 'soh₂-wl◌̩ - sh₂uˈen-
SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish SEM ANTIC SHIFT M O RPH O LO G ICAL CH AN G E M O R PH O LO G ICA L CH A N G E MORPHOLOGICAL CHANGE MORPHOLOGICAL CHANGE 46 / 50

Challenges Modeling Language Change Modeling Language Change So far, our
linguistic databases mostly model the relations between diﬀerent linguistic entities (cognacy, borrowing, etc.). To fully reﬂect what is “philologically” encoded in ety- mological dictionaries, however, we need to start thinking of how to model processes. 46 / 50

Challenges Limitations of Knowledge Limitations of Knowledge "wolf" lupus Latin
"wolf" lupus Latin ? 47 / 50

"wolf" *ulkʷo- Indo-European "wolf" *lupo- Sabellic "wolf" *lukʷo- Italic 47 / 50

"wolf" lupus Latin "marten" *ulp- Indo-European "fox" *ulp- Italic "fox" volpes Latin 47 / 50

Challenges Limitations of Knowledge Limitations of Knowledge "wolf" *lukʷo- Italic
"wolf" lupus Latin "wolf" *ulkʷo- Indo-European "wolf" lupus Latin "marten" *ulp- Indo-European "fox" *ulp- Italic "fox" volpes Latin ?? "wolf" *lupo- Sabellic 47 / 50

Challenges Limitations of Knowledge Limitations of Knowledge Although our computational
algorithms are getting better and better at modeling uncertainties, our databases still give the impression as if they represented fully proven facts. We need to ﬁnd a way to include uncertainties when modeling and representing our data. 47 / 50

Challenges Reconciliation of Evidence Reconciliation of Evidence Fúzhōu Měixiàn Guǎngzhōu
Běijīng Fúzhōu Měixiàn Guǎngzhōu Běijīng 48 / 50

Challenges Reconciliation of Evidence Reconciliation of Evidence Fúzhōu Měixiàn Guǎngzhōu
Běijīng 48 / 50

Challenges Reconciliation of Evidence Reconciliation of Evidence LOSS INNO VATIO
N INNO VATIO N BORROWING 48 / 50

Challenges Reconciliation of Evidence Reconciliation of Evidence Despite the massive
body of data which has been accumu- lated during the last two centuries of research, we are still far away from being able to suﬃciently reconcile all the diﬀerent types of evidence which are important for our discipline. 48 / 50

P(A|B)=(P(B|A)P(A))/(P(B) FRANZ BO PP VERY, VERY LO NG TI TLE
It’s a very long way up to the top... 49 / 50

... but together we can m ake it! 49 / 50

Concluding Remarks The possibilities for research in historical linguistics are nowadays greater than ever before. But so are our challenges. In order to be up to the job, we cannot do without computers, but likewise, we cannot do without the intuition and experience of trained historical linguists. What we further need are combined eﬀorts of standardiza- tion and knowledge exchange. We need to bridge disciplines and break down the frontiers between diﬀerent schools. 50 / 50

Thanks for Your Attention! 50 / 50

Datasets and Software Tools for Computer-Assist...

Datasets and Software Tools for Computer-Assisted Language Comparison

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript