Slide 6
Slide 6 text
ݴޠϞσϧͷੑೳ্Ͱݟ͖͑ͯͨɼೝϞσϦϯά
ʹ͓͚ΔεέʔϦϯάଇʢϞσϧͷPPL∝ಡΈ࣌ؒͷઆ໌ྗʣͷഁ
2023/8/26
Figure 3: Model Coefficients: Coefficients for a linear model that i
Coefficients are shown for each regressor word individually. Zero
each row; error bars are 95% CIs across folds of data.
Figure 4: Test Perplexity vs.
llh
(mGPT): We do
not find a significant correlation between the llh
and
mGPT’s perplexity for a language or language family.
et al., 2020). However, studies on Japanese have
failed to replicate these results, suggesting that
the relationship does not hold for all languages
(Kuribayashi et al., 2021). Further, Oh and Schuler
(2023) and Shain et al. (2022) show that this
relationship may not hold even in English for the
most recent language models. To investigate this,
because
erage s
is only
The
do find
across
(⇢ =
dence o
Althoug
gests th
perplex
guistic
with th
Nota
this an
train a
a singl
as opp
multili
we do
share a
Figure 2: Increase in PPP (from the full-gram to 2-gram
settings) in each model type (ordered by their parameter
size). The bar colors correspond to those in Figure 1.
xl) does not imply that the difference is valueless,
but this is just because the score is divided by
the number of data points (e.g., 212,649 in the
Dundee corpus) to facilitate inter-corpora compar-
ison. As a statistical test, we compared the by-
token squared residual errors from 2-gram models
with those from full-context models using paired
permutation tests (p=0.05). The short context, 2-
gram models had significantly smaller fitting errors
than the full context models (p < 0.001) in using
relatively large LMs (GPT2-md-Wiki, GPT2-sm,
GPT2-md, GPT2-lg, and GPT2-xl); smaller LMs
(LSTM-xs-Wiki, and GPT2-xs-Wiki) have no sig-
nificant differences (p ⇠ 0.4).
Notably, we also observed that larger GPT-2s
have less human-like behavior in the full setting
(right-most column in Table 4). This trend was
weakened by introducing our context limitation.
Cross-linguistic consistency. Figure 1 and Ta-
text limitation (full-context v.s. bigram) was larger
in the largest LMs (GPT2-md in Japanese and
GPT2-xl in English) than in the smallest LMs
(LSTM-xs). Specifically, we compared the by-
token decrease in squared residual errors; the large
model exhibited a larger error decrease than the
small model (p = 0.024 < 0.05 in Japanese, and
p < 0.001 in English). In addition, the rank corre-
lation between model size and PPP gain by context
limitation was 0.50 in Japanese and 0.96 in English.
General effectiveness of surprisal. Note that, in
all the LMs, the PPP scores (equivalent to logLik)
were significantly higher than 0 with the chi-square
test (p < 10 31 even in the worst case); surprisal
was an effective factor as existing studies reported.
On top of this, we newly showed that their effect
size differs due to the context limitation levels.
5.2 Does the potential training-inference
mismatch bias our results?
Vanilla LMs slightly underestimate the short-
context advantage. We additionally trained Wiki-
LMs (LSTM-xs-Wiki, GPT2-xs-Wiki, and GPT2-
sm-Wiki) without the data modification handling
the training-inference gap (Section 4.1) (hence-
forth; vanilla LMs). Figure 3 shows the results
of the models with and without the training modi-
fication. The vanilla LMs slightly underestimated
the short-context advantage; the PPP of 2-gram
surprisal improved when we adopted the modified
training. That is, mitigating the train-inference
gap made clearer the trend that context limitation
increases PPP. Carefully training n-gram neural
be properly compared only in the context of a fixed
reference vocabulary (Wilcox et al., 2020). Techni-
cally, XGLM models produce a conditional proba-
bility distribution over the same whole vocabulary,
regardless of the language of the specific text they
are processing. However, the models have received
strong evidence during pre-training that some sub-
portions of the vocabulary (e.g. Cyrillic tokens)
should be essentially ignored while processing text
in some languages (e.g. English), thus reducing
their actual reference vocabulary. Hence, while we
report the perplexity-based results in Appendix B,
we focused on the link between the linguistic and
psychological accuracy of the models by observing
how the LogLik was affected by the parameter
size of the model. The choice of employing param-
eter size as a proxy of linguistic accuracy is sup-
ported by the results in the original XGLM paper,
where the authors reported better results in almost
all downstream tasks with the bigger versions of
the XGLM model family (Lin et al., 2021).
The code employed in this study is publicly avail-
able2.
5 Results
The first main finding of our study is that sur-
prisal is a solid predictor of reading times across
the languages considered, confirming the previous
observation that context-dependent probabilistic
processing generalizes beyond the Germanic lan-
guage sample typically considered in the literature
(de Varda and Marelli, 2022). The XGLM-based
Appendix A).
The increase in goodness of fit that could be
attributed to surprisal is displayed in Figure 1,
grouped by model type and fixation measure. Con-
cerning FF (1a), we reported a general decrease in
LogLik when increasing the number of parame-
ters, with the smallest XGLM564M variant outper-
forming the bigger models in terms of psycholog-
ical accuracy. A similar trend can be observed in
GD (1b), although the difference in psychologi-
cal accuracy between XGLM564M and XGLM1.7B
appears to be rather small3. The results are differ-
ent when considering TT as the dependent variable
(1c), as in this case the model that provided the
highest average increase in goodness of fit was
XGLM1.7B
4.
6 Discussion
In this experiment, we showed that large multilin-
gual Transformer-based models were outperformed
by their smaller variants in predicting early eye
movement measurements of processing difficulty.
These measurements are thought to reflect predic-
tive processes, lexical access, and early semantic
integration. This result corroborates the previous
claims that cognitive modelling might constitute
an exception to empirical scaling laws in NLP (Oh
and Schuler, 2022). However, predictability es-
timates computed by relatively larger variants of
the same architecture – but not the largest – pro-
vided surprisal estimates that better captured late
Abstract
In computational psycholinguistics, various
language models have been evaluated against
human reading behavior (e.g., eye movement)
to build human-like computational models.
However, most previous efforts have focused
almost exclusively on English, despite the re-
cent trend towards linguistic universal within
the general community. In order to fill the gap,
this paper investigates whether the established
results in computational psycholinguistics can
be generalized across languages. Specifically,
we re-examine an established generalization
—the lower perplexity a language model has,
the more human-like the language model is—
in Japanese with typologically different struc-
tures from English. Our experiments demon-
strate that this established generalization ex-
hibits a surprising lack of universality; namely,
lower perplexity is not always human-like.
Moreover, this discrepancy between English
and Japanese is further explored from the
perspective of (non-)uniform information den-
sity. Overall, our results suggest that a cross-
lingual evaluation will be necessary to con-
struct human-like computational models.
1 Introduction
It is well known that the probability of a word
in context (i.e., surprisal) impacts its processing
human language processing. For example, recent
studies reported that LMs with better performance
for next-word prediction could also better predict
the human reading behavior (i.e. more human-
like) (Fossum and Levy, 2012; Goodkind and Bick-
nell, 2018; Wilcox et al., 2020).
In this paper, we re-examine whether the re-
cent findings on human-like computational mod-
els can be generalized across languages. Despite
the community’s ongoing search for a language-
independent model (Bender, 2011), existing stud-
ies have focused almost exclusively on the English
language. Having said that, broad-coverage cross-
linguistic evaluation of the existing reports is pro-
hibitively difficult. In fact, data on human reading
behavior (e.g., eye movement) is available only in
limited languages. As an initial foray, this study
focuses on the Japanese language as a representa-
tive of languages that have typologically different
characteristics from the English language. If the ob-
servation is different between English and Japanese,
the current findings on English data might lack a
universality across languages.
We specifically revisit the recent report—the
lower perplexity a LM has, the more human-like the
LM is—in the English and Japanese languages (Fos-
sum and Levy, 2012; Goodkind and Bicknell, 2018;
Wilcox et al., 2020). In addition to the importance
~300M params. ݴޠґଘͰഁ
Lower perplexity is not always human-like
(Kuribayashi+,21)
~1.5B params. ӳͰഁ
Context limitations Make Neural Language
Models More Human-Like (Kuribayashi+,22)
~4.5B 13ݴޠͰഁ
Scaling in Cognitive Modelling: a Multilingual
Approach to Human Reading Times (Varda+,23)
~13B? 11ݴޠͰഁ
Testing the Predictions of Surprisal Theory in 11
Languages (Wilcox+23)
ୠ͠ಉҰݴޠͰෳͷϞσϧΛൺֱ͍ͯ͠ΔΘ͚Ͱͳ͍
ਓؒͦ͜·Ͱਖ਼֬ͳ࣍୯ޠͷ༧ଌ͕
Ͱ͖͍ͯͳͦ͞͏ʢαϓϥΠβϧܭࢉํ๏͕͋Δछශ͍͠ʣ
was more equivocal. The relationship between online and offline measures of comprehension difficulty is cur-
rently poorly understood, and we leave this discrepancy to future investigation. With respect to Hoover et al.
(2022), their claims of superlogarithmicity are based on visual estimates (and descriptive statistics derived
from those estimates) from models fitted only to the Natural Stories SPR dataset. Our results in fact partially
replicate theirs, since estimates tend to be visually superlogarithmic in Natural Stories SPR (especially over
the long right tail of surprisal values, see Supplementary Figure A5), and a slightly superlogarithmic model
(SURP4/3) outperforms a logarithmic one on that dataset, aggregating over all language models. However,
this outcome appears to be largely restricted to Natural Stories SPR and does not generalize to a broader
sample of reading data. In the absence of reasons to think that Natural Stories SPR is an especially reliable
source of evidence on this question (see Supplementary Information G for counterarguments), our results
suggest that the Hoover et al. (2022) pattern may not be characteristic of reading in general.
3.2 Implications for Statistical Modeling of Human Subjective Word Proba-
bilities
Our results additionally differentiate computational models of human next-word prediction. Surprisal esti-
mates from GPT-2(-small) (Radford et al., 2019) substantially outperform surprisal estimates from n-gram,
PCFG, GPT-J, and GPT-3 models. GPT-2 therefore appears to reside in a “Goldilocks” region of psycho-
metric performance between language models that are too constrained on the one hand (n-gram and PCFG
models) and too powerful on the other (GPT-J and GPT-3). This outcome challenges the notion that previ-
ously reported correlations between the linguistic and psychometric performance of language models (e.g.,
Goodkind and Bicknell, 2018; Hao et al., 2020; Wilcox et al., 2020) will extrapolate to models of ever-
increasing size, complexity, and quantity of training data (see also Oh, Clark, and Schuler, 2022). Instead,
the task of using language model predictions to estimate human reading times may be akin to tasks in natural
language processing that show an “inverse scaling” property, whereby task performance is inversely related
to model size (McKenzie et al., 2022b,a, 2023). This result has both methodological and scientific implica-
tions. From a methodological standpoint, bigger is not always better; the selection of a language model for
psycholinguistic research may need to consider additional dimensions (beyond perplexity). From a scientific
standpoint, homing in on classes of models that best mimic human processing patterns offers the opportu-
nity for new insights into the learning and processing mechanisms that underlie human language abilities
(Schrimpf et al., 2020; Heilbron et al., 2022), a direction that we leave to future work.
~175B ӳޠͰഁ
Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time (Shain+,23)
each story or article did not fit into a single
context window for the LMs, the second half
of the previous context window served as the
first half of a new context window to calculate
surprisal estimates for the remaining tokens. In
practice, most stories and articles fit completely
within two context windows for the GPT-2 mod-
els that have a context size of 1,024 tokens, and
within one context window for the GPT-Neo and
OPT models that have a context size of 2,048
tokens. Additionally, when a single word wt
was
tokenized into multiple subword tokens, nega-
tive log probabilities of subword tokens corre-
sponding to wt
were added together to calculate
S(wt) = − log P(wt | w1..t−1).
3.3 Regression Modeling
Subsequently, following the methods of Oh et al.
(2022), a ‘baseline’ LME model that contains
baseline predictors capturing low-level cognitive
processing and seventeen ‘full’ LME models that
contain the baseline predictors and each LM sur-
prisal predictor were fit to the exploratory set of
self-paced reading times and go-past durations
using lme4 (Bates et al., 2015). The baseline pre-
dictors include word length measured in characters
and index of word position within each sentence
(both self-paced reading and eye-tracking), as well
as saccade length and whether or not the previous
word was fixated (eye-tracking only).
All predictors were centered and scaled prior
to model fitting, and the LME models included
by-subject random slopes for all fixed effects as
well as random intercepts for each subject and
each word type. Additionally, for self-paced read-
ing times collected from 181 subjects, a random
intercept for each subject-sentence interaction was
Figure 1: Perplexity measures from each LM variant,
and improvements in regression model log-likelihood
from including each surprisal estimate on the ex-
ploratory set of Natural Stories (top) and Dundee data
(bottom). Dotted lines indicate the least-squares re-
gression line for each LM family.
125M, and OPT 125M) made the biggest contri-
bution to regression model fit on both self-paced
reading times and eye-gaze durations for the
three LM families. More notably, surprisal esti-
mates from larger LM variants within each family
first half of a new context window to calculate
surprisal estimates for the remaining tokens. In
practice, most stories and articles fit completely
within two context windows for the GPT-2 mod-
els that have a context size of 1,024 tokens, and
within one context window for the GPT-Neo and
OPT models that have a context size of 2,048
tokens. Additionally, when a single word wt
was
tokenized into multiple subword tokens, nega-
tive log probabilities of subword tokens corre-
sponding to wt
were added together to calculate
S(wt) = − log P(wt | w1..t−1).
3.3 Regression Modeling
Subsequently, following the methods of Oh et al.
(2022), a ‘baseline’ LME model that contains
baseline predictors capturing low-level cognitive
processing and seventeen ‘full’ LME models that
contain the baseline predictors and each LM sur-
prisal predictor were fit to the exploratory set of
self-paced reading times and go-past durations
using lme4 (Bates et al., 2015). The baseline pre-
dictors include word length measured in characters
and index of word position within each sentence
(both self-paced reading and eye-tracking), as well
as saccade length and whether or not the previous
word was fixated (eye-tracking only).
All predictors were centered and scaled prior
to model fitting, and the LME models included
by-subject random slopes for all fixed effects as
well as random intercepts for each subject and
each word type. Additionally, for self-paced read-
ing times collected from 181 subjects, a random
intercept for each subject-sentence interaction was
included. For eye-gaze durations collected from
a much smaller number of 10 subjects, a random
intercept for each sentence was included.
After the regression models were fit, the ∆LL
values were first calculated for each regression
model by subtracting the log-likelihood of the
baseline model from that of a full regression
model. Moreover, to examine the trend between
LM perplexity and predictive power of surprisal
estimates, the perplexity of each LM variant was
calcuated on the two corpora.
3.4 Results
Figure 1: Perplexity measures from each LM variant,
and improvements in regression model log-likelihood
from including each surprisal estimate on the ex-
ploratory set of Natural Stories (top) and Dundee data
(bottom). Dotted lines indicate the least-squares re-
gression line for each LM family.
125M, and OPT 125M) made the biggest contri-
bution to regression model fit on both self-paced
reading times and eye-gaze durations for the
three LM families. More notably, surprisal esti-
mates from larger LM variants within each family
yielded strictly poorer fits to reading times, ro-
bustly replicating the trend observed by Oh et al.
(2022). Interestingly, the three LM families also
seem to demonstrate a strong log-linear relation-
ship between perplexity and ∆LL, as can be seen
by the least-squares regression lines. All regres-
sion lines had a slope significantly greater than 0
at p < 0.05 level according to a one-tailed t-test,
with the exception of the regression line for GPT-2
on Natural Stories (p = 0.07). This trend is highly
significant overall by a binomial test (five results
with p < 0.05 out of six trials), and directly con-
tradicts the findings of recent studies that report a
ਤհจΑΓ