Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer-Assisted Language Comparison

Computer-Assisted Language Comparison

Talk held at the CLT Seminar (Centre for Language Technology, University of Gothenburg)

Johann-Mattis List

October 09, 2014
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. Computer-Assisted Language Comparison Bridging the Gap between Traditional and Quantitative

    Approaches in Historical Linguistics Johann-Mattis List Forschungszentrum Deutscher Sprachatlas Philipps-University Marburg 2014-10-09 1 / 50
  2. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 4 / 50
  3. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 4 / 50
  4. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - - * Proto-Germanic t a n d English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - - 4 / 50
  5. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - - Proto-Germanic t a n θ - English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - - 4 / 50
  6. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - Proto-Germanic t a n θ - English t ʊː - θ ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - 4 / 50
  7. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - Proto-Germanic t a n θ - English t ʊː - θ Proto-Indo-European d e n t - Italian d ɛ n t ə Proto-Romance d e n t e French d ɑ̃ - - 4 / 50
  8. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    - * Proto-Germanic t a n d English t ʊː - θ Proto-Indo-European d e n t Italian d ɛ n t ə * Proto-Romance d e n t French d ɑ̃ - - 4 / 50
  9. Traditional Historical Linguistics Characteristics Research Object German ʦ aː n

    Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ German ʦ aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ 1 4 / 50
  10. Traditional Historical Linguistics Characteristics Research Object History individual events (description(

    individual processes (description) general processes (modeling, analysis) Language History individual language states (description of sound system, grammar, lexicon) individual instances of language development (description of sound change patterns, grammaticalization, lexical change) general language development (modeling and analysis of sound change processes, grammaticalization, lexical change) 5 / 50
  11. Traditional Historical Linguistics Characteristics Research Object Internal Language History (ontogenesis)

    etymology historical grammar historical phonology External Language History (phylogenesis) linguistic reconstruction proof of language relationship genetic classification General Tendencies in Language History processes and mechanisms of sound change grammaticalization lexical change 6 / 50
  12. Traditional Historical Linguistics Characteristics Origins Uniformitarianism “universality of change” –

    change is independent of time and space “graduality of change” – change is neither abrupt nor chaotic “uniformity of change” – change is not heterogeneous, but uniform Founding Fathers Franz Bopp (1791–1867): language comparison (Bopp 1816) Rasmus Rask (1787-1832) and Jacob Grimm (1785-1863): sound law (Rask 1818, Grimm 1822) August Schleicher (1821–1868): family tree and linguistic reconstruction (Schleicher 1853 & 1861) 7 / 50
  13. Traditional Historical Linguistics Achievements Methods and Theories Comparative Method (Meillet

    1925) Basic procedure for proving language relationship and reconstructing unattested ancestral language states, etymologies, and genetic classifications. Family Tree Model and Wave Theory (Schleicher 1853, Schmidt 1872) Two partially incompatible models to describe historical language relations. Regularity Hypothesis (Osthoff & Brugmann 1878) Fundamental working hypothesis that states that certain sound change processes proceed regularly (universally, gradually, and in a uniform manner). 9 / 50
  14. Traditional Historical Linguistics Achievements Comparative Method proof of relationship identification

    of cognates identification of sound correspondences reconstruction of proto-forms internal classification 10 / 50
  15. Traditional Historical Linguistics Achievements Comparative Method proof of relationship identification

    of cognates identification of sound correspondences reconstruction of proto-forms internal classification 10 / 50
  16. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 10 / 50
  17. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 10 / 50
  18. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn d ɔː n 10 / 50
  19. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 2 x d d 1 x n n 1 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 10 / 50
  20. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x ? n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 10 / 50
  21. Traditional Historical Linguistics Achievements Comparative Method Cognate List Alignment Correspondence

    List German dünn d ʏ n GER ENG Frequ. d θ 3 x d d 1 x n n 2 x m m 1 x ŋ ŋ 1 x English thin θ ɪ n German Ding d ɪ ŋ English thing θ ɪ ŋ German dumm d ʊ m English dumb d ʌ m German Dorn d ɔɐ n English thorn θ ɔː n 10 / 50
  22. Traditional Historical Linguistics Achievements Insights Internal Language History Thanks to

    historical linguistics, the history of a considerable (but still small) amount of languages has been thoroughly investigated. External Language History Thanks to historical linguistics, a considerable amount of the languages in the world has been genetically classified (although there remain many unsolved and controversially discussed questions). General Language History Some work on general processes of language history has been done, yet many questions still remain unsolved or are controversially debated. 11 / 50
  23. Traditional Historical Linguistics Problems Transparency Part of the process of

    “becoming” a competent Indo-Europeanist has always been recognized as coming to grasp “intuitively” concepts and types of changes in language so as to be able to pick and choose between alternative explanations for the history and development of specific features of the reconstructed language and its offspring. Schwink (1994) 13 / 50
  24. Traditional Historical Linguistics Problems Applicability – 7,106 languages (Lewis &

    Fennig 2013) – 147 language families (ibid.) – 25244065 languages which could be compared 14 / 50
  25. Traditional Historical Linguistics Problems Applicability The amount of digitally available

    data for the lan- guages of the world is growing from day to day, while there are only a few historical linguists who are trained to carry out the comparison of these languages. It seems impossible to handle this task when relying only on the traditional, time- consuming manual procedures developed in tra- ditional historical linguistics. 14 / 50
  26. Traditional Historical Linguistics Problems Adequacy One time is never, two

    times is ever! (a mathematician friend on the treatment of probability in Indo-European linguistics) 15 / 50
  27. Traditional Historical Linguistics Problems Summary Despite its achievements, traditional historical

    linguistics has some clear shortcomings, such as a lack of transparency in methodology, the “philological” form of knowledge representation, and the questionable validity of certain results. 16 / 50
  28. Traditional Historical Linguistics Problems Example on “Philological Knowledge Representation” Frucht.

    Sf std. (9. Jh.), mhd. vruht, ahd. fruht, as. fruht. Ent- lehnt aus l. frūctus m. gleicher Bedeutung (zu l. fruī “ge- nieße”). Das deutsche Wort ist Femininum geworden im Anschluß an die ti- Abstrakta wie Flucht² usw. Adjekti- ve: fruchtig, fruchtbar; Verb: (be-)fruchten. Ebenso nndl. vrucht, ne. fruit, nfrz. fruit, nschw. frukt, nnorw. frukt; frugal. (Kluge und Seebold 2002) 17 / 50
  29. Quantitative Historical Linguistics Characteristics Characteristics “Indo-European and computational cladistics” (Ringe,

    Warnow and Taylor 2002) “Language-tree divergence times support the Anatolian theory of Indo-European origin” (Gray und Atkinson 2003) “Language classification by numbers” (McMahon und McMahon 2005) “Curious Parallels and Curious Connections: Phylogenetic Thinking in Biology and Historical Linguistics” (Atkinson und Gray 2005) “Automated classification of the world’s languages” (Brown et al. 2008) “Computational Feature-Sensitive Reconstruction of Language Relationships: Developing the ALINE Distance for Comparative Historical Linguistic Reconstruction” (Downey et al. 2008) “Networks uncover hidden lexical borrowing in Indo-European language evolution” (Nelson-Sathi et al. 2011) “A pipeline for computational historical linguistics” (Steiner, Stadler, und Cysouw 2011) 20 / 50
  30. Quantitative Historical Linguistics Characteristics Points of Interest and Goals phylogenetic

    reconstruction sequence comparison general questions of language development 21 / 50
  31. Quantitative Historical Linguistics Characteristics Points of Interest and Goals phylogenetic

    reconstruction sequence comparison general questions of language development Goals If we cannot guarantee getting the same results from the same data considered by different linguists, we jeopardize the essential scientific criterion of repeatability. (McMahon & McMahon 2005) 21 / 50
  32. Quantitative Historical Linguistics Characteristics Methods and Theories phylogenetic reconstruction (cf.,

    among others, Gray & Atkinson 2003 Ringe et al. 2002, Brown et al. 2008) phonetic alignment (cf., among others, Kondrak 2000, Prokić et al. 2009, List 2012a) cognate detection (cf. Steiner et al. 2011, List 2012b) borrowing detection (cf. Nelson-Sathi et al. 2011, List et al. 2014a) 22 / 50
  33. Quantitative Historical Linguistics Achievements New Perspectives external language history receives

    more attention than before “Indo-Euro-Centrism” is replaced by a more cross-linguistic paradigm new questions regarding general language history new proposals to model language history 24 / 50
  34. Quantitative Historical Linguistics Achievements New Approaches empirical data becomes the

    center of interest probabilistic approaches replace “historical” approaches databases replace philological knowledge representation “informal” methods are formalized and automatized 25 / 50
  35. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Alignment Analyses Alignment

    analyses display sequence similarities by representing multiple sequences as rows of a matrix in which common segments are placed in the same column. Alignments are a formal way to deal with general tasks of sequence comparison. Although never explicitly labeled or displayed, alignments are virtually present in all analyses in historical linguistics dealing with the comparison of sound sequences (words, morphemes). 26 / 50
  36. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Alignment Analyses Alignment

    analyses display sequence similarities by representing multiple sequences as rows of a matrix in which common segments are placed in the same column. Alignments are a formal way to deal with general tasks of sequence comparison. Although never explicitly labeled or displayed, alignments are virtually present in all analyses in historical linguistics dealing with the comparison of sound sequences (words, morphemes). t ɔ x t ə r d ɔː t ə r 26 / 50
  37. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Alignment Analyses Alignment

    analyses display sequence similarities by representing multiple sequences as rows of a matrix in which common segments are placed in the same column. Alignments are a formal way to deal with general tasks of sequence comparison. Although never explicitly labeled or displayed, alignments are virtually present in all analyses in historical linguistics dealing with the comparison of sound sequences (words, morphemes). t ɔ x t ə r d ɔː t ə r 26 / 50
  38. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Alignment Analyses Alignment

    analyses display sequence similarities by representing multiple sequences as rows of a matrix in which common segments are placed in the same column. Alignments are a formal way to deal with general tasks of sequence comparison. Although never explicitly labeled or displayed, alignments are virtually present in all analyses in historical linguistics dealing with the comparison of sound sequences (words, morphemes). t ɔ x t ə r d ɔː - t ə r 26 / 50
  39. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound Classes Sounds

    which frequently occur in correspondence relations in genetically related languages can be clustered into classes. It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1964]: 35). 27 / 50
  40. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound Classes Sounds

    which frequently occur in correspondence relations in genetically related languages can be clustered into classes. It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1964]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 50
  41. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound Classes Sounds

    which frequently occur in correspondence relations in genetically related languages can be clustered into classes. It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1964]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 50
  42. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound Classes Sounds

    which frequently occur in correspondence relations in genetically related languages can be clustered into classes. It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1964]: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 50
  43. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound Classes Sounds

    which frequently occur in correspondence relations in genetically related languages can be clustered into classes. It is thereby assumed that “phonetic correspondences inside a ‘type’ are more regular than those between different ‘types’” (Dolgopolsky 1986[1964]: 35). K T P S 1 27 / 50
  44. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound-Class-Based Phonetic Alignment

    (SCA, List 2012 a) Sound classes and alignment analyses can be combined. Sound sequences are internally represented as sound classes. Alignments are carried out using standard algorithms developed in evolutionary biology. 28 / 50
  45. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment Sound-Class-Based Phonetic Alignment

    (SCA, List 2012 a) Sound classes and alignment analyses can be combined. Sound sequences are internally represented as sound classes. Alignments are carried out using standard algorithms developed in evolutionary biology. INPUT tɔxtər dɔːtər TOKENIZATION t, ɔ, x, t, ə, r d, ɔː, t, ə, r CONVERSION t ɔ x … → T O G … d ɔː t … → T O T … ALIGNMENT T O G T E R T O - T E R CONVERSION T O G … → t ɔ x … T O - … → d oː - … OUTPUT t ɔ x t ə r d ɔː - t ə r 1 28 / 50
  46. Quantitative Historical Linguistics Achievements Examples: Phonetic Alignment SCA reaches an

    accuracy of more than 90 % for multiple alignment analyses, using the conservative column scores as evaluation scores. SCA can be applied to almost all languages, including tone languages (clicks are not yet supported), provided the data is given in regular phonetic transcription. SCA models prosodic properties of sound sequences and scores sound segments differently, depending on their position in the sequence, thereby accounting for general theories of prosodic strength . SCA is integrated in LingPy (http://lingpy.org, List & Moran 2013, an open source Python toolkit for quantitative tasks in historical linguistics and has been successfully tested on all major platforms (Mac, Linux, Microsoft). 29 / 50
  47. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection LexStat (List

    2012, List 2014) LexStat is a method for automatic cognate detection in multilingual wordlists. It uses on sound-class-based sequence alignment (SCA) analyses as a proxy to infer language-specific sound similarities (similar to the notion of sound correspondences in historical linguistics). Using the automatically inferred sound similarities, LexStat partitions words into cognate sets. 30 / 50
  48. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Basic Procedure

    for Multilingual Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS PAIRWISE COMPARISON 31 / 50
  49. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Basic Procedure

    for Multilingual Cognate Detection WORDLIST DATA PAIRWISE DISTANCES BETWEEN WORDS COGNATE SETS COGNATE CLUSTERING PAIRWISE COMPARISON 31 / 50
  50. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Cognate Clustering

    Analysis ID Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 31 / 50
  51. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Cognate Clustering

    Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 Analysis ID Taxa Word Gloss GlossID IPA ... ... ... ... ... ... 21 German Frau woman 20 frau 22 Dutch vrouw woman 20 vrɑu 23 English woman woman 20 wʊmən 24 Danish kvinde woman 20 kvenə 25 Swedish kvinna woman 20 kviːna 26 Norwegian kvine woman 20 kʋinə ... ... ... ... ... ... 31 / 50
  52. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Cognate Clustering

    Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini 31 / 50
  53. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Cognate Clustering

    Swedish English Danish Norwegian Dutch German kvinna woman kvinde kvine vrouw Frau Swedish kvina 0.00 0.69 0.07 0.12 0.71 0.78 English wumin 0.69 0.00 0.66 0.57 0.68 0.87 Danish kveni 0.07 0.66 0.00 0.08 0.67 0.71 Norwegian kwini 0.12 0.57 0.08 0.00 0.75 0.74 Dutch frou 0.71 0.68 0.67 0.75 0.00 0.17 German frau 0.78 0.87 0.71 0.74 0.17 0.00 German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini 31 / 50
  54. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection Cognate Clustering

    German Frau frau Dutch vrouw vrou English woman wumin Danish kvinde kveni Swedish kvinna kvina Norwegian kvine kwini Analysis ID Taxa Word Gloss GlossID IPA CogID ... ... ... ... ... ... ... 21 German Frau woman 20 frau 1 22 Dutch vrouw woman 20 vrɑu 1 23 English woman woman 20 wʊmən 2 24 Danish kvinde woman 20 kvenə 3 25 Swedish kvinna woman 20 kviːna 3 26 Norwegian kvine woman 20 kʋinə 3 ... ... ... ... ... ... ... 31 / 50
  55. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection INPUT TOKENIZATION

    PREPROCESSING LOG-ODDS D ISTANCE COGNATE OUTPUT CORRESPONDENCE DETECTION USING PHONETIC ALIGNMENT LOOP DISTRIBUTION LexStat Algorithm (List 2014) EXPECTED ATTESTED DISTRIBUTION CALCULATION CLUSTERING 31 / 50
  56. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection B-Cubed F-Scores

    on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 32 / 50
  57. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection B-Cubed F-Scores

    on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% 92% 81% 89% 81% 32 / 50
  58. Quantitative Historical Linguistics Achievements Examples: Automatic Cognate Detection B-Cubed F-Scores

    on BDCD Benchmark (List 2014) Bai (Tibeto-Burman) Indo-European Japanese and Ryukyu Ob-Ugrian Austronesian Sinitic (Chinese Dialects) 60 65 70 75 80 85 90 95 Turchin NED SCA LexStat 75% 93% 32 / 50
  59. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks Hugo Schuchardt (1842-1927)

    “We connect the branches and twigs of the tree with countless horizon- tal lines and it ceases to be a tree.” (Schuchardt 1870 [1900]: 11) 33 / 50
  60. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks Biological Workflow (Dagan

    & Martin 2007, Dagan et al. 2008) 1 collect phyletic pattern data (shared gene families) of the taxa that shall be investigated 2 use gain-loss mapping techniques with different weighting models, allowing for different amounts of gain events to analyze how the gene families evolved along a given reference tree 3 use ancestral genome sizes as an external criterion to determine the best weighting model 4 assume that all patterns for which the best model yields more than one gain event result from lateral gene transfer 5 reconstruct a minimal lateral network (MLN) by connecting multiple gains for the same gene family by lateral edges 34 / 50
  61. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks Linguistic Workflow (Nelson-Sathi

    et al. 2011, List et al. 2014a) 1 collect phyletic pattern data (shared cognates) of the languages that shall be investigated 2 use gain-loss mapping techniques with different weighting models, allowing for different amounts of to analyze how the cognates evolved along a given reference tree 3 use ancestral vocabulary size distributions as an external criterion to determine the best weighting model 4 allow for a substantial amount (5%) of parallel evolution 5 assume that all patterns for which the best model yields more than one gain event result from lateral gene transfer 6 reconstruct a minimal lateral network by connecting multiple gains of the same cognate by lateral edges 35 / 50
  62. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks List et al.

    (2014b) . . ---Lánzhōu . Fùzhōu -- . Xiāngtàn -- . M ěixiàn -- . H ongkong -- . ---Wǔhàn . ---Běijīng . ---Kùnmíng . Hángzhōu -- . Xiàmén -- . ---Chéngdū . Sùzhōu -- . Shànghǎi -- . Táiběi -- . ---Zhèngzhōu . Shèxiàn -- . ---Nánjīng . ---Guìyáng . W énzhōu -- . N ánníng -- . Tūnxī -- . ---Tiānjìn . Shāntóu -- . ---Xīníng . ---Q īngdǎo . ---Ürüm qi . ---Píngyáo . Nánchàng -- . ---Tàiyuán . Chángshā -- . Hǎikǒu -- . ---Héfèi . Jiàn'ǒu -- . ---Yīnchuàn . ---Hohhot . Táoyuán -- . ---Xī'ān . G uǎngzhōu -- . ---Harbin . ---Jìnán . 0 . 0 . 0 . Inferred Links 36 / 50
  63. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks List et al.

    (2014b) . . ---Lánzhōu . Fùzhōu -- . Xiāngtàn -- . M ěixiàn -- . H ongkong -- . ---Wǔhàn . ---Běijīng . ---Kùnmíng . Hángzhōu -- . Xiàmén -- . ---Chéngdū . Sùzhōu -- . Shànghǎi -- . Táiběi -- . ---Zhèngzhōu . Shèxiàn -- . ---Nánjīng . ---Guìyáng . W énzhōu -- . N ánníng -- . Tūnxī -- . ---Tiānjìn . Shāntóu -- . ---Xīníng . ---Q īngdǎo . ---Ürüm qi . ---Píngyáo . Nánchàng -- . ---Tàiyuán . Chángshā -- . Hǎikǒu -- . ---Héfèi . Jiàn'ǒu -- . ---Yīnchuàn . ---Hohhot . Táoyuán -- . ---Xī'ān . G uǎngzhōu -- . ---Harbin . ---Jìnán . 0 . 0 . 0 . Inferred Links 36 / 50
  64. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks List et al.

    (2014b) . . ---Lánzhōu . Fùzhōu -- . Xiāngtàn -- . M ěixiàn -- . H ongkong -- . ---Wǔhàn . ---Běijīng . ---Kùnmíng . Hángzhōu -- . Xiàmén -- . ---Chéngdū . Sùzhōu -- . Shànghǎi -- . Táiběi -- . ---Zhèngzhōu . Shèxiàn -- . ---Nánjīng . ---Guìyáng . W énzhōu -- . N ánníng -- . Tūnxī -- . ---Tiānjìn . Shāntóu -- . ---Xīníng . ---Q īngdǎo . ---Ürüm qi . ---Píngyáo . Nánchàng -- . ---Tàiyuán . Chángshā -- . Hǎikǒu -- . ---Héfèi . Jiàn'ǒu -- . ---Yīnchuàn . ---Hohhot . Táoyuán -- . ---Xī'ān . G uǎngzhōu -- . ---Harbin . ---Jìnán . 1 . 10 . 20 . Inferred Links 36 / 50
  65. Quantitative Historical Linguistics Achievements Examples: Phylogenetic Networks List et al.

    (2014b) . . ---Lánzhōu . Fùzhōu -- . Xiāngtàn -- . M ěixiàn -- . H ongkong -- . ---Wǔhàn . ---Běijīng . ---Kùnmíng . Hángzhōu -- . Xiàmén -- . ---Chéngdū . Sùzhōu -- . Shànghǎi -- . Táiběi -- . ---Zhèngzhōu . Shèxiàn -- . ---Nánjīng . ---Guìyáng . W énzhōu -- . N ánníng -- . Tūnxī -- . ---Tiānjìn . Shāntóu -- . ---Xīníng . ---Q īngdǎo . ---Ürüm qi . ---Píngyáo . Nánchàng -- . ---Tàiyuán . Chángshā -- . Hǎikǒu -- . ---Héfèi . Jiàn'ǒu -- . ---Yīnchuàn . ---Hohhot . Táoyuán -- . ---Xī'ān . G uǎngzhōu -- . ---Harbin . ---Jìnán . 1 . 4 . 8 . Inferred Links 36 / 50
  66. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. 38 / 50
  67. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is difficult to communicate the results to traditional linguists. 38 / 50
  68. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as 38 / 50
  69. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or 38 / 50
  70. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or – “impossible per se”, or 38 / 50
  71. Quantitative Historical Linguistics Problems Transparency Evaluation criteria for applied automatic

    methods are not very intuitive and vary greatly. Benchmark databases are rarely used, especially in phylogenetic approaches eyeballing of phylogenetic trees is sold as proof for “valid approaches”. It is difficult to communicate the results to traditional linguists. → Many linguists regard automatic approaches as – not trustworthy and error-prone, or – “impossible per se”, or – as useful as “rolling a dice”. 38 / 50
  72. Quantitative Historical Linguistics Problems Applicability Method Multilingual? No additional requirements?

    Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & List 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 39 / 50
  73. Quantitative Historical Linguistics Problems Applicability Method Multilingual? No additional requirements?

    Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 39 / 50
  74. Quantitative Historical Linguistics Problems Applicability Method Multilingual? No additional requirements?

    Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 39 / 50
  75. Quantitative Historical Linguistics Problems Applicability Method Multilingual? No additional requirements?

    Freely Available? Mackay & Kondrak 2005 ✗ ✓ ✗ Bergsma & Kondrak 2007 ✓ ✓ ✗ Turchin et al. 2010 ✓ ✓ ✓ Berg-Kirkpatrick & Klein 2011 ✗ ✓ ✗ Hauer & Kondrak 2011 ✓ ✓ ✗ Steiner et al. 2011 ✓ ✓ ✗ List 2012 & 2014 ✓ ✓ ✓ Beinborn et al. 2013 ✗ ? ✗ Bouchard-Côté et al. 2013 ✓ ✗ ✗ Rama 2013 ✗ ✓ ✗ Ciobanu & Dinu 2014 ✗ ✓ ✗ … … … … 39 / 50
  76. Quantitative Historical Linguistics Problems Accuracy Data Problems (Geisler & List

    forthcoming) Comparing two independently produced lexicostatistical datasets: database # languages # concepts Dyen et al. 1997 95 200 Tower of Babel 98 110 intersection 46 103 40 / 50
  77. Quantitative Historical Linguistics Problems Accuracy Data Problems (Geisler & List

    forthcoming) Comparing two independently produced lexicostatistical datasets: database # languages # concepts Dyen et al. 1997 95 200 Tower of Babel 98 110 intersection 46 103 Results up to 10 % difference in concept translations many undetected borrowings in both datasets up to 30 % differences in tree topologies for Bayesian analyses 40 / 50
  78. Quantitative Historical Linguistics Problems Summary Many quantitative methods which are

    based on manually compiled datasets cannot cope with errors resulting from inconsistent data compilation. They are only as objective as the data being fed to them! Many quantitative approaches are insufficiently tested, and scholars are often content with results traditional linguists would never accept. Additionally, quantitative approaches are often presented in a way that makes it hard (not only for traditional linguists) to understand what they are based upon. Results are reported in an intransparent way, supplementary data is often lacking, concrete examples are seldom provided and source code (essential to check and replicate analyses) is missing in almost all recent publications. 41 / 50
  79. Computer-Assisted Language Comparison Bridging the Gap Bridging the Gap So

    far, the majority of computational approaches in histori- cal linguistics largely disregards the actual needs of histori- cal linguistics. Despite the frequent claims that the algorith- ms are intended to supplement traditional research, many of them are mere attempts to prove the power of modern machine learning approaches and completely disregard the achievements of traditional research in historical linguistics. 43 / 50
  80. Computer-Assisted Language Comparison Bridging the Gap Bridging the Gap If

    we really want to make a difference with computational ap- proaches and not simply seek to replace every expert who likes books with a computer or abacus, we need to work much, much harder, on a real integration of computational and traditional approaches. 43 / 50
  81. Computer-Assisted Language Comparison Bridging the Gap Bridging the Gap P(A|B)=(P(B|A)P(A))/(P(B)

    FRANZ BOPP VERY, VERY LONG TITLE Apart from “computational historical linguistics”, we need to establish a new discipline of “computer-aided historical linguistics”. Such a framework needs bench- marks and new standards to cope with general problems of quantitati- ve approaches. However, such a framework will also need additional resources that help traditional approaches to leave the “realm of intuition”. 43 / 50
  82. Computer-Assisted Language Comparison Examples Benchmark Databases for Historical Linguistics First

    benchmark databases have been compiled and published: Benchmark Database of Phonetic Alignments (BDPA, List & Prokić 2014, http://alignments.lingpy.org) Benchmark Database for Cognate Detection (BDCD, presented in List 2014, http://sequencecomparison.github.io). Benchmark Database for Linguistic Reconstruction (BDLR, in preparation). 45 / 50
  83. Computer-Assisted Language Comparison Examples Benchmark Databases for Historical Linguistics All

    data is given in phonetic transcriptions (IPA), tokenized into phonemic units, freely available for download, and can be directly used in LingPy. 45 / 50
  84. Computer-Assisted Language Comparison Examples Visualizations and Interactive Applications Often, automatic

    approaches hide essential aspects of their analyses. These aspects are not only valid to test the power of methods, but also to get the best out of the results. Aggregation of results is useful for publications, but we know, that “every word has its own history”, and traditional research has always been concerned with this. Visualizing and reporting all detailed decisions and findings of automatic methods will not only increase their transparency, it may also help convincing traditional scholars that computational approaches may provide valuable insights. Apart from static visualizations, JavaScript and HTML5 offer unique ways for interactive data visualization and make it easy to produce, share, and explore what automatic methods have produced. So far, we have develop JavaScript prototype tools that – visualize phonetic alignments of cognate sets (JavaScript Cognate Viewer, JCOV, http://github.com/dighl/jcov/), – allow to edit and refine alignments and cognate sets online using online tools (Etymological Dictionary Editor, EDICTOR, http://tsv.lingpy.org), and – tools that visualize phylogenetic trees in geographic space (together with T. Mayer, Tree Explorer, TREX, http://github.com/dighl/TREX). 46 / 50
  85. Computer-Assisted Language Comparison Challenges Challenges German m oː n t

    - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e 47 / 50
  86. Computer-Assisted Language Comparison Challenges Challenges German m oː n t

    - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 47 / 50
  87. Computer-Assisted Language Comparison Challenges Challenges German m oː n t

    - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - "MOON" "MOON" "SHINE" "LIGHT" 47 / 50
  88. Computer-Assisted Language Comparison Challenges Challenges Fúzhōu Měixiàn Guǎngzhōu Běijīng INNO

    VATIO N INNO VATIO N INNO VATIO N BO RRO W ING LO SS INNO VATIO N INNO VATIO N 47 / 50
  89. Computer-Assisted Language Comparison Challenges Challenges SEMANTIC CHANGE MORPHOLOGICAL CHANGE S

    T R A T IC C H A N G E Three Dimensions of Lexical Change (Gévaudan 2007) 47 / 50
  90. Computer-Assisted Language Comparison Challenges Challenges In order to cope with

    the multiple dimensions of lexical change, we need new methods and models in historical linguistics, which ex- plicitly deal with borrowing, partial cognacy, and semantic change. Following the lead of evolutionary biology, these methods could be combined under a unified framework of tree reconciliation (Page & Cotton 2002) in historical linguistics. 48 / 50
  91. Conclusion Conclusion Automatic approaches are constantly gaining ground in historical

    linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. 49 / 50
  92. Conclusion Conclusion Automatic approaches are constantly gaining ground in historical

    linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the gap between traditional and computational approaches which are mostly applied independently from each other. 49 / 50
  93. Conclusion Conclusion Automatic approaches are constantly gaining ground in historical

    linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the gap between traditional and computational approaches which are mostly applied independently from each other. In order to increase the interaction between traditional and computational historical linguists, we need a paradigm shift in historical linguistic. 49 / 50
  94. Conclusion Conclusion Automatic approaches are constantly gaining ground in historical

    linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the gap between traditional and computational approaches which are mostly applied independently from each other. In order to increase the interaction between traditional and computational historical linguists, we need a paradigm shift in historical linguistic. Computational linguists need to increase the transparency of their results, focusing on their detailed and interactive presentation instead of hiding behind numbers. 49 / 50
  95. Conclusion Conclusion Automatic approaches are constantly gaining ground in historical

    linguistics. Nevertheless, the majority of the new approaches shows a great lack in transparency and applicability. One reason for this is the gap between traditional and computational approaches which are mostly applied independently from each other. In order to increase the interaction between traditional and computational historical linguists, we need a paradigm shift in historical linguistic. Computational linguists need to increase the transparency of their results, focusing on their detailed and interactive presentation instead of hiding behind numbers. Traditional linguists need to increase the transparency of their methods, focusing on formalizing their intuitions instead of hiding behind their “expert insights”. 49 / 50