Studying language contact within a computer-assisted framework

E01961dd2fbd219a30044ffe27c9fb70?s=47 Johann-Mattis List
June 01, 2019
29

Studying language contact within a computer-assisted framework

Talk held at the 64th Annual Conference of the International Linguistic Association (2019-05-30/2019-06-01, Universidad Nacional de San Martín, Buenos Aires).

E01961dd2fbd219a30044ffe27c9fb70?s=128

Johann-Mattis List

June 01, 2019
Tweet

Transcript

  1. Studying Language Contact within a Computer-Assisted Framework Johann-Mattis List Research

    Group “Computer-Assisted Language Comparison” Department of Linguistic and Cultural Evolution Max-Planck Institute for the Science of Human History Jena, Germany 2019-06-01 very long title P(A|B)=P(B|A)... 1 / 32
  2. Introduction Introduction Introduction Language contact and lexical borrowing 2 /

    32
  3. Introduction Language Contact and Language History Language History August Schleicher

    (1821-1868) 3 / 32
  4. Introduction Language Contact and Language History Language History August Schleicher

    (1821-1868) “These assumptions, which follow logically from the results of our re- search, can be best illustrated by the image of a branching tree.” (Schle- icher 1853: 787) 3 / 32
  5. Introduction Language Contact and Language History Language History Schleicher (1853)

    4 / 32
  6. Introduction Language Contact and Language History Language Contact Johannes Schmidt

    (1843-1901) “I want to replace [the tree] by the im- age of a wave that spreads out from the center in concentric circles be- coming weaker and weaker the far- ther they get away from the center.” (Schmidt 1872: 27, my translation) 5 / 32
  7. Introduction Language Contact and Language History Language Contact Schmidt (1875)

    6 / 32
  8. Introduction Language Contact and Language History Language History and Language

    Contact Hugo Schuchardt (1842-1927) 7 / 32
  9. Introduction Language Contact and Language History Language History and Language

    Contact Hugo Schuchardt (1842-1927) “We connect the branches and twigs of the tree with countless horizon- tal lines and it ceases to be a tree.” (Schuchardt 1870 [1900]: 11) 7 / 32
  10. Introduction Language Contact and Language History Language History and Language

    Contact 8 / 32
  11. Introduction Language Contact and Language History Language History and Language

    Contact 8 / 32
  12. Introduction Studying Language Contact Similarities between Languages similarities coincidental Grk.

    theós Lat. deus ‘god’ non-coincidental natural Chi. māma Ger. Mama ‘mother’ non-natural genealogical Eng. tooth Ger. Zahn ‘tooth’ non-genealogical Eng. Marlboro Chi. wànbǎolù proper name List (2014): DUP: Düsseldorf, List (forthcoming) 9 / 32
  13. Introduction Studying Language Contact Detecting Language Contact Evidence Example direct

    Cantonese [t￿ai³³-iœŋ²¹] ￿￿ (Mandarin tàiyáng) phylogeny-related English mountain vs. French montagne, Spanish montaña trait-related German Damm vs. English dam distribution-based German Job, Joker, Junkie, Journal . List (forthcoming) 10 / 32
  14. Introduction Studying Language Contact Detecting Language Contact convenient shortcuts: treat

    lookalikes between Chinese and Hmong-Mien as borrowings from Chinese, for historical reasons (Ratliff 2010) assume all vocabulary from a specific semantic field to be borrowed (e.g., religion, seafaring, etc.) 11 / 32
  15. Introduction Computational Historical Linguistics Computational Historical Linguistics starting in the

    early 21st century with phylogenetic approaches (Gray and Atkinson 2003, Ringe et al. 2002) accompanied by pioneering work on sequence comparison (Kondrak 2000) later followed by more and more approaches on different topics (phylogenetic networks, Nakhleh et al. 2005, automatic cognate detection, Hauer and Kondrak 2011), now a fully established sub-field of historical linguistics 12 / 32
  16. Introduction Computational Historical Linguistics Computational Approaches to Language Contact Proposed

    solutions: conflicts in the phylogeny, explain them by invoking borrowings (MLN approach, Nelson-Sathi et al. 2011, List et al. 2014) similar words among unrelated languages (Mennecier et al. 2016) tree reconciliation methods (Willems et al. 2016) borrowability statistics (Sergey Yakhontov, as reported by Starostin 1990, Chén 1996, McMahon et al. 2005) 13 / 32
  17. Introduction Computational Historical Linguistics Computational Approaches to Language Contact Performance

    of proposed solutions: conflicts in the phylogeny tend to overestimate the amount of borrowing, since there are multiple reasons for conflicts in phylogenies, not only borrowing (Morrison 2011) sequence comparison on unrelated languages seem solid, but one needs to be careful with chance resemblances based on onomatopoetic words etc. (mama, papa, etc., Jakobson 1960, Blasi et al. 2016) tree reconciliation methods are unrealistic if word trees are derived from simple edit distances sublist-approaches may be useful, but they require large accounts on known borrowings, which we usually lack 13 / 32
  18. Computer-Assisted Language Comparison Computer-Assisted Language Comparison very long title P(A|B)=P(B|A)...

    14 / 32
  19. Computer-Assisted Language Comparison Background Historical Linguistics in the Digital Age

    data in linguistics are steadily increasing our qualitative methods reach their practical limits we need to take computational methods into account but computational methods are not very accurate and may yield wrong results 15 / 32
  20. Computer-Assisted Language Comparison Project Project 16 / 32

  21. Computer-Assisted Language Comparison Project Project 16 / 32

  22. Computer-Assisted Language Comparison Project Project 16 / 32

  23. Computer-Assisted Language Comparison Project Project 16 / 32

  24. Computer-Assisted Language Comparison Project Project 16 / 32

  25. Computer-Assisted Language Comparison Project Project 16 / 32

  26. Computer-Assisted Language Comparison Project Project 16 / 32

  27. Computer-Assisted Language Comparison Project Project 16 / 32

  28. Computer-Assisted Language Comparison Project Project 16 / 32

  29. Computer-Assisted Language Comparison Project Project 16 / 32

  30. Computer-Assisted Language Comparison Project Project 16 / 32

  31. Computer-Assisted Language Comparison CALC CALC very long title P(A|B)=P(B|A)... ERC

    Starting Grant (2017-2022) Host: MPI-SHH (Jena) Current team: 2 post-docs, 2 docs, and myself Objectives go beyond historical linguistics and Sino-Tibetan (but they are our starting point) http://calc.digling.org 17 / 32
  32. Studying Language Contact with CALC Studying Borrowing with CALC $

    18 / 32
  33. Studying Language Contact with CALC Computer-Assisted Problem Solving Computer-Assisted Problem

    Solving 1 identify the core class of your problem (modeling, inference, analysis) 2 formalize the problem in a way that allows one to test it (specify data and techniques for evaluation) 3 do not hesitate to define sub-problems, given that qualitative solutions are often holistic 4 look at existing qualitative solutions 5 search for inspiration in neighboring disciplines (graph theory, computer science, evolutionary biology) by looking for similar processes that could be addressed in an analogous or similar way 6 accept a qualitative or semi-automatic solution for inference processes, but make sure that the results are annotated in a machine-readable way 7 insist on transparent output (no black boxes) to allow for an immediate review of results by experts 19 / 32
  34. Studying Language Contact with CALC Computer-Assisted Problem Solving Modeling, Inference,

    and Analysis 20 x 10 x 5 x ? Modeling Inference Analysis 20 / 32
  35. Studying Language Contact with CALC Computer-Assisted Problem Solving Identify Core

    Problems (1-3) borrowing is a process that happened during different stages in time, reflected in form of borrowing or contact layers identification of source and target of borrowings is almost impossible without knowing the history of a given area distinguishing borrowing from inheritance, chance, and typological patterns of denotation is also difficult for classical linguistics contact areas may overlap 21 / 32
  36. Studying Language Contact with CALC Computer-Assisted Problem Solving Look at

    Existing Solutions (4-6) recent borrowings can be detected with automatic sequence comparison approaches (Mennecier et al. 2016) searching for borrowings in unrelated languages spoken in similar regions can control for inheritance (Mennecier et al. 2016) highly advanced techniques for cognate detection are available by now (List 2014, List et al. 2017) methods for clustering and partitioning are well advanced, but need to be applied in a correct fashion 22 / 32
  37. Studying Language Contact with CALC Computer-Assisted Problem Solving Insist on

    Transparent Output (7) lift the data to a high standard of phonetic transcriptions use interactive applications to transparently share the findings rigorous testing and training on datasets from different languages of the world prefer direct output (concrete items identifying a contact area) for initial studies 23 / 32
  38. Example from SEA Languages Example from South-East Asian Languages Burmish_Achang

    Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min 24 / 32
  39. Example from SEA Languages Data Preparation Language Data 48 SEA

    languages from three different families (Sino-Tibetan, Hmong-Mien, Tai-Kadai), aggregated from four different sources (Beijing University 1964, Sun et al. 1991, Chen 2012, Castro 2015) unified phonetic transcriptions following the Cross-Linguistic Transcription System framework (Anderson et al. forthcoming, https://clts.clld.org) unification of elicitation glosses with help of Concepticon (List et al. 2016, https://concepticon.clld.org) data curation following the principles of the Cross-Linguistic Data Formats initiative (Forkel et al. 2018, https://cldf.clld.org) first inspection of data with help of EDICTOR (List 2017, http://edictor.digling.org) 25 / 32
  40. Example from SEA Languages Borrowing Inference Borrowing Inference A within-family

    cognate detection using LexStat as implemented in LingPy (List 2014, List et al. 2017) B cross-family borrowing detection using a new feature-based prosody-aware approach for pronunciation distance calculation and flat clustering approach C interactive analysis of inferences D partition cognate sets into groups indicative of a contact zone 26 / 32
  41. Example from SEA Languages Borrowing Inference Borrowing Inference: Pronunciation Distance

    Calculation pronunciation distance depends on prosody (with weak and strong positions in each word, see List 2014) feature systems for huge numbers of sounds were lacking so far, but are available now with CLTS (Anderson et al. forthcoming) alignment methods are well-developed and can be used to compare words beforehand (List 2014) The approach is work in progress, contact me for more information. 27 / 32
  42. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    28 / 32
  43. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    Burmish_Achang Baheng, East Baheng_West Bana Sinitic_Beijing Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Burmish Hmongic Sinitic Mienic Sui Bai Nesu 28 / 32
  44. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    two major contact areas (Hmong-Mien and Sui, Sinitic/Bai and Hmong-Mien) not all languages under similar influence inspection shows that most borrowings can be confirmed 28 / 32
  45. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    partitioning cognate sets and their associated meanings based on their distribution across languages yields about 6 groups in which five and more concepts are consistently shared the groups show different distributions and offer additional insights into the distribution of shared lexical traits as some problems are not yet handled (missing data, specific coding errors), a manual analysis should ideally start from here 28 / 32
  46. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    ASK (INQUIRE) BEAN BIG BIRD CHICKEN CRY DAY (NOT NIGHT) DIE DRINK DUCK EGG FAECES (EXCREMENT) FAR HORSE HUNDRED KILL OLD (USED) ROPE THIS Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  47. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    BEAR BITE CHILI PEPPER CHOOSE FAST HOE PEAR POOR SALTY WASH Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  48. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    BE HUNGRY FIREWOOD HARD JUMP MOUTH SOUP THIN (SLIM) WELL Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  49. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    ANT CLAW MONKEY SPARROW SWEET POTATO YOUNGER BROTHER Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  50. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    DRINK FAST NOSE THICK THUNDER Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  51. Example from SEA Languages Data Analysis Data Analysis: Contact Areas

    GRASS PEAR RIDE TIRED WALK Burmish_Achang Baheng, East Baheng_West Bana Biao Min Sui_Banliang Sinitic_Changsha Sinitic_Chaozhou Sinitic_Chengdu Chuanqiandian Chuanqiandian_Central_Guizhou Chuanqiandian_Northeast_Yunnan Chuanqiandian_Southern_Guizhou Dongnu Sinitic_Guangzhou Bai_Jianchuan Jiongnai Kim_Mun Sinitic_Kunming Bai_Luobenzhuo Luobuohe_Eastern Luobuohe_Western Sinitic_Meixian Mien Sinitic_Nanchang Numao Nunu Sui_Pandong Qiandong_East Qiandong_North Qiandong_South Qiandong_West Sui_Sandong She Sinitic_Xi_an Xiangxi_East Xiangxi_West Bai_Xiangyun Sinitic_Yangjiang Yi_Dafang Yi_Mile Yi_Mojiang Yi_Nanhua Yi_Nanjian Yi_Xide Younuo Zao_Min Sui Sinitic Nesu Bai Mienic Burmish Hmongic 28 / 32
  52. Example from SEA Languages Data Analysis Data Analysis: Concept Statistics

    by checking the purity of cognates sets with respect to the families across which they occur, we can derive rankings of concepts, according to their relative borrowability in our dataset borrowability often thought of as a stable characteristics of concepts, also due to Swadesh’s doctrine of basic vocabulary, but it is clear that concepts evolve with culture, and terms for technical innovations may therefore be highly borrowable, as long as they are new, but they would later not be borrowed again therefore, all statistics on borrowability have to be taken with care, as they also reflect the history of a given region and not necessarily general patterns of language change 29 / 32
  53. Example from SEA Languages Data Analysis Data Analysis: Concept Statistics

    29 / 32
  54. Example from SEA Languages Data Analysis Data Analysis: Concept Statistics

    29 / 32
  55. Example from SEA Languages Data Analysis Data Analysis: Concept Statistics

    a weak (Spearman rank: -0.18, p<0.005) negative correlation between the purity of concepts with respect to potential borrowings and the borrowing statistics of the World Loanword Project (WOLD, Haspelmath and Tadmor 2008) a weak (Spearman rank: 0.19, p<0.005) positive correlation with the WOLD project’s age score the new ranks based on concept purity could be used to expand the limited scope of the WOLD project systematically 29 / 32
  56. Outlook Outlook *deh3 - ? 30 / 32

  57. Outlook Outlook enhance the accuracy of our contact inference workflow

    apply to more language families (esp. South-American languages) work on inference of more ancient borrowings work on inference of borrowing directions enhance the interactive output 31 / 32
  58. Outlook 32 / 32