linear model that i Coefficients are shown for each regressor word individually. Zero each row; error bars are 95% CIs across folds of data. Figure 4: Test Perplexity vs. llh (mGPT): We do not find a significant correlation between the llh and mGPT’s perplexity for a language or language family. et al., 2020). However, studies on Japanese have failed to replicate these results, suggesting that the relationship does not hold for all languages (Kuribayashi et al., 2021). Further, Oh and Schuler (2023) and Shain et al. (2022) show that this relationship may not hold even in English for the most recent language models. To investigate this, because erage s is only The do find across (⇢ = dence o Althoug gests th perplex guistic with th Nota this an train a a singl as opp multili we do share a Figure 2: Increase in PPP (from the full-gram to 2-gram settings) in each model type (ordered by their parameter size). The bar colors correspond to those in Figure 1. xl) does not imply that the difference is valueless, but this is just because the score is divided by the number of data points (e.g., 212,649 in the Dundee corpus) to facilitate inter-corpora compar- ison. As a statistical test, we compared the by- token squared residual errors from 2-gram models with those from full-context models using paired permutation tests (p=0.05). The short context, 2- gram models had significantly smaller fitting errors than the full context models (p < 0.001) in using relatively large LMs (GPT2-md-Wiki, GPT2-sm, GPT2-md, GPT2-lg, and GPT2-xl); smaller LMs (LSTM-xs-Wiki, and GPT2-xs-Wiki) have no sig- nificant differences (p ⇠ 0.4). Notably, we also observed that larger GPT-2s have less human-like behavior in the full setting (right-most column in Table 4). This trend was weakened by introducing our context limitation. Cross-linguistic consistency. Figure 1 and Ta- text limitation (full-context v.s. bigram) was larger in the largest LMs (GPT2-md in Japanese and GPT2-xl in English) than in the smallest LMs (LSTM-xs). Specifically, we compared the by- token decrease in squared residual errors; the large model exhibited a larger error decrease than the small model (p = 0.024 < 0.05 in Japanese, and p < 0.001 in English). In addition, the rank corre- lation between model size and PPP gain by context limitation was 0.50 in Japanese and 0.96 in English. General effectiveness of surprisal. Note that, in all the LMs, the PPP scores (equivalent to logLik) were significantly higher than 0 with the chi-square test (p < 10 31 even in the worst case); surprisal was an effective factor as existing studies reported. On top of this, we newly showed that their effect size differs due to the context limitation levels. 5.2 Does the potential training-inference mismatch bias our results? Vanilla LMs slightly underestimate the short- context advantage. We additionally trained Wiki- LMs (LSTM-xs-Wiki, GPT2-xs-Wiki, and GPT2- sm-Wiki) without the data modification handling the training-inference gap (Section 4.1) (hence- forth; vanilla LMs). Figure 3 shows the results of the models with and without the training modi- fication. The vanilla LMs slightly underestimated the short-context advantage; the PPP of 2-gram surprisal improved when we adopted the modified training. That is, mitigating the train-inference gap made clearer the trend that context limitation increases PPP. Carefully training n-gram neural be properly compared only in the context of a fixed reference vocabulary (Wilcox et al., 2020). Techni- cally, XGLM models produce a conditional proba- bility distribution over the same whole vocabulary, regardless of the language of the specific text they are processing. However, the models have received strong evidence during pre-training that some sub- portions of the vocabulary (e.g. Cyrillic tokens) should be essentially ignored while processing text in some languages (e.g. English), thus reducing their actual reference vocabulary. Hence, while we report the perplexity-based results in Appendix B, we focused on the link between the linguistic and psychological accuracy of the models by observing how the LogLik was affected by the parameter size of the model. The choice of employing param- eter size as a proxy of linguistic accuracy is sup- ported by the results in the original XGLM paper, where the authors reported better results in almost all downstream tasks with the bigger versions of the XGLM model family (Lin et al., 2021). The code employed in this study is publicly avail- able2. 5 Results The first main finding of our study is that sur- prisal is a solid predictor of reading times across the languages considered, confirming the previous observation that context-dependent probabilistic processing generalizes beyond the Germanic lan- guage sample typically considered in the literature (de Varda and Marelli, 2022). The XGLM-based Appendix A). The increase in goodness of fit that could be attributed to surprisal is displayed in Figure 1, grouped by model type and fixation measure. Con- cerning FF (1a), we reported a general decrease in LogLik when increasing the number of parame- ters, with the smallest XGLM564M variant outper- forming the bigger models in terms of psycholog- ical accuracy. A similar trend can be observed in GD (1b), although the difference in psychologi- cal accuracy between XGLM564M and XGLM1.7B appears to be rather small3. The results are differ- ent when considering TT as the dependent variable (1c), as in this case the model that provided the highest average increase in goodness of fit was XGLM1.7B 4. 6 Discussion In this experiment, we showed that large multilin- gual Transformer-based models were outperformed by their smaller variants in predicting early eye movement measurements of processing difficulty. These measurements are thought to reflect predic- tive processes, lexical access, and early semantic integration. This result corroborates the previous claims that cognitive modelling might constitute an exception to empirical scaling laws in NLP (Oh and Schuler, 2022). However, predictability es- timates computed by relatively larger variants of the same architecture – but not the largest – pro- vided surprisal estimates that better captured late Abstract In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the re- cent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different struc- tures from English. Our experiments demon- strate that this established generalization ex- hibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information den- sity. Overall, our results suggest that a cross- lingual evaluation will be necessary to con- struct human-like computational models. 1 Introduction It is well known that the probability of a word in context (i.e., surprisal) impacts its processing human language processing. For example, recent studies reported that LMs with better performance for next-word prediction could also better predict the human reading behavior (i.e. more human- like) (Fossum and Levy, 2012; Goodkind and Bick- nell, 2018; Wilcox et al., 2020). In this paper, we re-examine whether the re- cent findings on human-like computational mod- els can be generalized across languages. Despite the community’s ongoing search for a language- independent model (Bender, 2011), existing stud- ies have focused almost exclusively on the English language. Having said that, broad-coverage cross- linguistic evaluation of the existing reports is pro- hibitively difficult. In fact, data on human reading behavior (e.g., eye movement) is available only in limited languages. As an initial foray, this study focuses on the Japanese language as a representa- tive of languages that have typologically different characteristics from the English language. If the ob- servation is different between English and Japanese, the current findings on English data might lack a universality across languages. We specifically revisit the recent report—the lower perplexity a LM has, the more human-like the LM is—in the English and Japanese languages (Fos- sum and Levy, 2012; Goodkind and Bicknell, 2018; Wilcox et al., 2020). In addition to the importance ~300M params. ݴޠґଘͰഁ Lower perplexity is not always human-like (Kuribayashi+,21) ~1.5B params. ӳͰഁ Context limitations Make Neural Language Models More Human-Like (Kuribayashi+,22) ~4.5B 13ݴޠͰഁ Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times (Varda+,23) ~13B? 11ݴޠͰഁ Testing the Predictions of Surprisal Theory in 11 Languages (Wilcox+23) ୠ͠ಉҰݴޠͰෳͷϞσϧΛൺֱ͍ͯ͠ΔΘ͚Ͱͳ͍ ਓؒͦ͜·Ͱਖ਼֬ͳ࣍୯ޠͷ༧ଌ͕ Ͱ͖͍ͯͳͦ͞͏ʢαϓϥΠβϧܭࢉํ๏͕͋Δछශ͍͠ʣ was more equivocal. The relationship between online and offline measures of comprehension difficulty is cur- rently poorly understood, and we leave this discrepancy to future investigation. With respect to Hoover et al. (2022), their claims of superlogarithmicity are based on visual estimates (and descriptive statistics derived from those estimates) from models fitted only to the Natural Stories SPR dataset. Our results in fact partially replicate theirs, since estimates tend to be visually superlogarithmic in Natural Stories SPR (especially over the long right tail of surprisal values, see Supplementary Figure A5), and a slightly superlogarithmic model (SURP4/3) outperforms a logarithmic one on that dataset, aggregating over all language models. However, this outcome appears to be largely restricted to Natural Stories SPR and does not generalize to a broader sample of reading data. In the absence of reasons to think that Natural Stories SPR is an especially reliable source of evidence on this question (see Supplementary Information G for counterarguments), our results suggest that the Hoover et al. (2022) pattern may not be characteristic of reading in general. 3.2 Implications for Statistical Modeling of Human Subjective Word Proba- bilities Our results additionally differentiate computational models of human next-word prediction. Surprisal esti- mates from GPT-2(-small) (Radford et al., 2019) substantially outperform surprisal estimates from n-gram, PCFG, GPT-J, and GPT-3 models. GPT-2 therefore appears to reside in a “Goldilocks” region of psycho- metric performance between language models that are too constrained on the one hand (n-gram and PCFG models) and too powerful on the other (GPT-J and GPT-3). This outcome challenges the notion that previ- ously reported correlations between the linguistic and psychometric performance of language models (e.g., Goodkind and Bicknell, 2018; Hao et al., 2020; Wilcox et al., 2020) will extrapolate to models of ever- increasing size, complexity, and quantity of training data (see also Oh, Clark, and Schuler, 2022). Instead, the task of using language model predictions to estimate human reading times may be akin to tasks in natural language processing that show an “inverse scaling” property, whereby task performance is inversely related to model size (McKenzie et al., 2022b,a, 2023). This result has both methodological and scientific implica- tions. From a methodological standpoint, bigger is not always better; the selection of a language model for psycholinguistic research may need to consider additional dimensions (beyond perplexity). From a scientific standpoint, homing in on classes of models that best mimic human processing patterns offers the opportu- nity for new insights into the learning and processing mechanisms that underlie human language abilities (Schrimpf et al., 2020; Heilbron et al., 2022), a direction that we leave to future work. ~175B ӳޠͰഁ Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time (Shain+,23) each story or article did not fit into a single context window for the LMs, the second half of the previous context window served as the first half of a new context window to calculate surprisal estimates for the remaining tokens. In practice, most stories and articles fit completely within two context windows for the GPT-2 mod- els that have a context size of 1,024 tokens, and within one context window for the GPT-Neo and OPT models that have a context size of 2,048 tokens. Additionally, when a single word wt was tokenized into multiple subword tokens, nega- tive log probabilities of subword tokens corre- sponding to wt were added together to calculate S(wt) = − log P(wt | w1..t−1). 3.3 Regression Modeling Subsequently, following the methods of Oh et al. (2022), a ‘baseline’ LME model that contains baseline predictors capturing low-level cognitive processing and seventeen ‘full’ LME models that contain the baseline predictors and each LM sur- prisal predictor were fit to the exploratory set of self-paced reading times and go-past durations using lme4 (Bates et al., 2015). The baseline pre- dictors include word length measured in characters and index of word position within each sentence (both self-paced reading and eye-tracking), as well as saccade length and whether or not the previous word was fixated (eye-tracking only). All predictors were centered and scaled prior to model fitting, and the LME models included by-subject random slopes for all fixed effects as well as random intercepts for each subject and each word type. Additionally, for self-paced read- ing times collected from 181 subjects, a random intercept for each subject-sentence interaction was Figure 1: Perplexity measures from each LM variant, and improvements in regression model log-likelihood from including each surprisal estimate on the ex- ploratory set of Natural Stories (top) and Dundee data (bottom). Dotted lines indicate the least-squares re- gression line for each LM family. 125M, and OPT 125M) made the biggest contri- bution to regression model fit on both self-paced reading times and eye-gaze durations for the three LM families. More notably, surprisal esti- mates from larger LM variants within each family first half of a new context window to calculate surprisal estimates for the remaining tokens. In practice, most stories and articles fit completely within two context windows for the GPT-2 mod- els that have a context size of 1,024 tokens, and within one context window for the GPT-Neo and OPT models that have a context size of 2,048 tokens. Additionally, when a single word wt was tokenized into multiple subword tokens, nega- tive log probabilities of subword tokens corre- sponding to wt were added together to calculate S(wt) = − log P(wt | w1..t−1). 3.3 Regression Modeling Subsequently, following the methods of Oh et al. (2022), a ‘baseline’ LME model that contains baseline predictors capturing low-level cognitive processing and seventeen ‘full’ LME models that contain the baseline predictors and each LM sur- prisal predictor were fit to the exploratory set of self-paced reading times and go-past durations using lme4 (Bates et al., 2015). The baseline pre- dictors include word length measured in characters and index of word position within each sentence (both self-paced reading and eye-tracking), as well as saccade length and whether or not the previous word was fixated (eye-tracking only). All predictors were centered and scaled prior to model fitting, and the LME models included by-subject random slopes for all fixed effects as well as random intercepts for each subject and each word type. Additionally, for self-paced read- ing times collected from 181 subjects, a random intercept for each subject-sentence interaction was included. For eye-gaze durations collected from a much smaller number of 10 subjects, a random intercept for each sentence was included. After the regression models were fit, the ∆LL values were first calculated for each regression model by subtracting the log-likelihood of the baseline model from that of a full regression model. Moreover, to examine the trend between LM perplexity and predictive power of surprisal estimates, the perplexity of each LM variant was calcuated on the two corpora. 3.4 Results Figure 1: Perplexity measures from each LM variant, and improvements in regression model log-likelihood from including each surprisal estimate on the ex- ploratory set of Natural Stories (top) and Dundee data (bottom). Dotted lines indicate the least-squares re- gression line for each LM family. 125M, and OPT 125M) made the biggest contri- bution to regression model fit on both self-paced reading times and eye-gaze durations for the three LM families. More notably, surprisal esti- mates from larger LM variants within each family yielded strictly poorer fits to reading times, ro- bustly replicating the trend observed by Oh et al. (2022). Interestingly, the three LM families also seem to demonstrate a strong log-linear relation- ship between perplexity and ∆LL, as can be seen by the least-squares regression lines. All regres- sion lines had a slope significantly greater than 0 at p < 0.05 level according to a one-tailed t-test, with the exception of the regression line for GPT-2 on Natural Stories (p = 0.07). This trend is highly significant overall by a binomial test (five results with p < 0.05 out of six trials), and directly con- tradicts the findings of recent studies that report a ਤհจΑΓ