of occupation word vector and female gender count Distribution of effect sizes across languages Language semantics reflects (and influences) culture. Languages vary in their semantics. (Lewis & Lupyan, 2018) German Japanese Career-family bias in language
of the general principles that govern variability -- “topographies” of variability Kay & Regier (2006) COLOR Kemp et al. (2007) KINSHIP Youn et al. (2016) NATURAL KINDS Cross-linguistic semantic variability
of English 35 L1s ~ 360 words/essay 28 prompts, e.g.,: - People's behavior is largely determined by forces not of their own making. - A nation should require all of its students to study the same national curriculum until they enter college. - Nations should pass laws to preserve any remaining wilderness areas in their natural state, even if these areas could be developed for economic gain.
Modern Greek Gujarati Hindi Igbo Indonesian Italian Japanese Kannada Korean Malayalam Marathi Nepali Eastern Panjabi Polish Portuguese Romanian Russian Spanish Tagalog Tamil Telugu Thai Turkish Urdu Vietnamese Yoruba Western Farsi • Language Family • • • • • • • • • • • Afro−Asiatic Atlantic−Congo Austroasiatic Austronesian Dravidian Indo−European Japonic Korean Sino−Tibetan Tai−Kadai Turkic Language Typology • VSO SVO SOV other Languages
GER GRE GUJ HIN IBO IND ITA JPN KAN KOR MAL MAR NEP PAN POL POR RUM RUS SPA TAM TEL TGL THA TUR URD VIE YOR −1.00 −0.75 −0.50 −0.25 0.00 −1.00 −0.75 −0.50 −0.25 aff a a a a a a a a a a a Afro−Asiatic Atlantic−Congo Austroasiatic Austronesian Dravidian Indo−European Japonic Korean Sino−Tibetan TaiKadai Turkic Language centroids Italian French Spanish Portuguese Russian Polish Romanian Bulgarian Modern Greek German Dutch Tagalog Vietnamese Thai Indonesian Mandarin Chinese Korean Japanese Turkish Western Farsi Standard Arabic Yoruba Igbo English Tamil Malayalam Kannada Urdu Eastern Panjabi Telugu Marathi Hindi Bengali Nepali Gujarati
(even though essays written in English!) The relationships between the semantic spaces of different languages are structured Structure predicted by linguistic and cultural dimensions of the native speakers
GER GRE GUJ HIN IBO IND ITA JPN KAN KOR MAL MAR NEP PAN POL POR RUM RUS SPA TAM TEL TGL THA TUR URD VIE YOR −1.00 −0.75 −0.50 −0.25 0.00 −1.00 −0.75 −0.50 −0.25 aff a a a a a a a a a a a Afro−Asiatic Atlantic−Congo Austroasiatic Austronesian Dravidian Indo−European Japonic Korean Sino−Tibetan TaiKadai Turkic Language centroids Italian French Spanish Portuguese Russian Polish Romanian Bulgarian Modern Greek German Dutch Tagalog Vietnamese Thai Indonesian Mandarin Chinese Korean Japanese Turkish Western Farsi Standard Arabic Yoruba Igbo English Tamil Malayalam Kannada Urdu Eastern Panjabi Telugu Marathi Hindi Bengali Nepali Gujarati How do the languages differ from each other?
2: Characterizing the topography of differences Semantics are represented in memory as clusters of related meanings (e.g. Nelson, McEvoy, & Schreiber, 1998)
H3: Universal H2: Universal macrostructure animal dog zoo farm mammal pets cat school university community teacher bus recess college H4: No universality Topographies of cross-linguistic variability
crowdsourced ratings for 40K English words (Brysbaert, et al., 2013) Study 2: Method Some words refer to things or actions in reality, which you can experience directly through one of the five senses. We call these words concrete words. Other words refer to meanings that cannot be experienced directly but which we know because the meanings can be defined by other words. These are abstract words.
Concreteness appears to be a organizing principle in the system Semantics highly correlated with concreteness If true, should generalize to native language semantics
trained on corpus of Wikipedia in each of 35 languages (fastText; Bojanowski, Grave, Joulin, Mikolov, 2016) 1000 words from each decile from the concreteness norms Translated them into all 35 languages using Google Translate API Same analysis as second language text, predict same pattern
2011): - area of region in which the language is spoken (km2) - mean/sd temperature (celsius) and precipitation (cm) [2] Grammatical distance (Dryer & Haspelmath, 2013; Dediu, in prep.): Set of morphosyntactic features coded for each language (e.g., word order, plural marking, tenses, etc.) [3] Lexical distance (Bakker, et al, 2009; Wichmann, Rama, & Holman, 2011): Distances between words using Levenshtein’s edit distance
Tagalog English Mandarin Chinese Romanian Portuguese Standard Arabic Turkish Western Farsi Polish Korean Indonesian Bulgarian Thai Igbo Urdu Russian Gujarati Spanish Japanese Telugu French Malayalam Modern Greek Tamil Dutch Italian Hindi Bengali German Nepali Vietnamese Kannada Marathi Eastern Panjabi Native language Coherence 0.00 0.25 0.50 0.75 1.00 1.25 Tagalog English Mandarin Chinese Romanian Portuguese Standard Arabic Turkish Western Farsi Polish Korean Indonesian Bulgarian Thai Igbo Urdu Russian Gujarati Spanish Japanese Telugu French Malayalam Modern Greek Tamil Dutch Italian Hindi Bengali German Nepali Vietnamese Kannada Marathi Eastern Panjabi Native language Coherence = between ((within1 + within2 )/2) coherence = between ((within1 + within2 )/2) between > within within > between