最先端NLP論文紹介：Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Why Does Surprisal From Larger Transformer-Based Language Models Provide a
Poorer Fit to Human Reading Times? Byung-Doh Oh, William Schuler (TACL2023) ঺հऀɿ܀ྛथੜʢMBZUAIʣ 2023/8/26 ࠷ઌ୺NLPษڧձ

2023/8/26 ࠷ઌ୺NLPษڧձ ਓؒͷݴޠ֫ಘɾॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ ೉͠͞ l ਓ͕ؒਓؒʢࣗ෼ʣʹ͍ͭͯ಺লͯ͠͠·͏ͱՊֶͷ٬؍ੑ͕ࣦΘΕΔ l ͠͹͠͹௚઀ͳԾઆͷݕূ͕Ͱ͖ͳ͍ - ಄Λ։͍ͯ೴Λ௚઀؍࡯ͯ͠΋จ๏͸ॻ͍ͯͳ͍
- ࢠڙΛ2άϧʔϓʹ෼͚ͯ౷੍ͯ͠ҭͯΔ…ʢྙཧతʹࠔ೉ʣ ଥڠ఺ l ߏ੒࿦తΞϓϩʔν - ਓؒͱಉ͡ৼΔ෣͍Λ͢Δ΋ͷΛ࡞Δɽͦͷ࡞Γํ͔ΒՄೳͳԾઆΛఏࣔ͢Δɽ - ͦͷԾઆ͕ਓؒʹର͢Δઆ໌ͱͯ͠ଥ౰͔͸ผͷϨΠϠ͔Βٞ࿦͕ඞཁ

ਓؒͷݴޠॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ l ܭࢉཧ࿦ͷϨϕϧ - ԿΛܭࢉ͍ͯ͠Δ͔ʁܭࢉͷ໨త͸ͳʹ͔ʁʢ໨తؔ਺ʣ l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ l
෺ཧత࣮૷ͷϨϕϧ - ʢ೴ͰʣͲͷΑ͏ʹ࣮ݱɾ࣮૷͞Ε͍ͯΔ͔ʁʢϋʔυ΢ΣΞϨϕϧͷ࣮૷ʣ 2023/8/26 ࠷ઌ୺NLPษڧձ [Marr, 1982]

ਓ͕ؒจΛಡΜͰ͍Δͱ͖ʹԿΛܭࢉ͍ͯ͠Δ͔ʁ ʢܭࢉཧ࿦ͷϨϕϧʣ l αϓϥΠβϧཧ࿦ [Levy08, Simth&Levy13, Shain+23] - ਓؒ͸จΛಡΉͱ͖ʹઌͷ୯ޠΛ༧ଌ͓ͯ͠Γɼ༧ଌ͕֎ΕΔͱॲཧʹෛՙ͕͔͔Δ -
֤୯ޠ𝑤! ͷॲཧෛՙ͸− log 𝑝(𝑤! |𝒘"! )ʹൺྫ͢Δ 2023/8/26 ࠷ઌ୺NLPษڧձ Dutch English Finnish German Greek Hebrew Italian Korean Russian Spanish Turkish mGPT (long) mGPT (short) monoT (30m) monoT (all) 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 30 60 90 0 30 60 90 0 30 60 90 0 30 60 90 Surprisal of Word (bits) Slowdown due to Surprisal (ms) Figure 5: Surprisal vs. Reading Time Relationship: Non-linear GAMs are in green; linear control GAMs are in dotted blue. Shaded regions represent bootstrapped 95% conﬁdence intervals. Grey subplots indicate the distribution of surprisal values. We ﬁnd that GAMs recover a linear relationship between surprisal and reading-time slowdown. we train models on ten different folds of our data ಡΈ࣌ؒ αϓϥΠβϧ ۙ೥ɼΤϯτϩϐʔʢαϓϥΠβϧͷظ଴஋ʣͷઆ໌ྗͷߴ͞ ΋վΊͯใࠂ͞Ε͍ͯΔ͕ɼ޿͘ݟΕ͹͜Ε΋ αϓϥΠβϧཧ࿦ͩͱׅΒΕ͍ͯΔ ௚ઢతͳؔ܎Ͱྑ͍ͷ͔ͱ͍͏఺͸ɼ ࡢ೥ͷษڧձࢿྉ΋ࢀর Testing the Predictions of Surprisal Theory in 11 Languages (Wilcox+23) On the Effect of Anticipation on Reading Times (Pimentel+23) Testing the Predictions of Surprisal Theory in 11 Languages (Wilcox+23) https://speakerdeck.com/kuribayashi4/zui-xian-duan-nlplun-wen-shao- jie-revisiting-the-uniform-information-density-hypothesis-emnlp2021- linguistic-dependencies-and-statistical-dependence-emnlp2021

l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ - Ͳͷఔ౓ਖ਼֬ͳݴޠϞσϧͰਓؒͷจॲཧΛઆ໌Ͱ͖Δʁ ͲͷΑ͏ͳϞσϧͰαϓϥΠβϧΛܭࢉ͢ΔͱΑ͍͔ʁ ʢܭࢉํ๏ͷϨϕϧʣ 2023/8/26 ࠷ઌ୺NLPษڧձ
Figure 1: Improvements in log likelihood for linear models, charted against decreases in perplexity. Distance from the central trend line is indicative of larger departures in log likelihood as a function of perplexity. The blue line represents a linear best fit, with a coefficient of 1.66 and R2 = 0.94 Figure 2: Changes in the current word’s coefficient for linear models, charted against increases in perplexity. Distances from the central trend line are indicative of larger departures of the current word coefficient from the expected trend. Regardless of perplexity, the coefficient is stable. The blue line represents a linear best fit, with a coefficient of 2.79 and R2 = 0.007. dominant case for today’s neural language models, b p is defined as the product of conditional probability distributions: b p(y) = Q|y| t=1 b p(yt | y<t), where each b p(· | y<t) is a distribution with support over linguistic units y (typically words) from a set vocabulary V, which includes a special end-of-sequence token. Consequently, we can use b p to estimate in- dividual word probabilities. Model parameters are typically estimated by minimizing the negative log- likelihood of a corpus of natural language strings C, i.e., minimizing L(b p) = P y2C log b p(y). One widely embraced technique in information- theoretic psycholinguistics is the use of these language models to estimate the probabilities required for computing surprisal (Hale, 2001; Demberg and Keller, 2008; Mitchell et al., 2010; Fernandez Mon- salve et al., 2012). It has even been observed that a language model’s perplexity4 correlates negatively with the psychometric predictive power provided by its surprisal estimates (Frank and Bod, 2011; Goodkind and Bicknell, 2018; Wilcox et al., 2020). If these language models keep improving at their current fast pace (Radford et al., 2019; Brown et al., 3Importantly, the research questions we ask are not con- cerned with describing the full set of cognitive processes that occur at the end of a clause or sentence—or even whether there is a causal relationship between information content and sentence- and clause-final RTs. 4Perplexity is a monotonic function of the average surprisal of linguistic units in-context under a model. relationship is typically s models: RTs are predicted u (along with other attributes acters) for the current word of these models, together w model itself (which define between RTs and surpri evidence of the studied effe is successful in modeling (Smith and Levy, 2013; G 2018; Wilcox et al., 2020 modeling sentence- and largely unknown due to th from the majority of RT an A priori, we might expe be a similarly powerful pr clause-final RTs.5 Yet in F our baseline linear model (d in §4) is fit to sentence-me for predictions of clause- neither normally distribute Further, these trends appea tracking and SPR data, whe towards lower values for 5Several works (e.g., Stowe cognitive processes involved in c words are exactly the same as tho 6The opposite is true for regr data; see App. B. 22 Concretely, we posit the relationship between text’s information-theoretic attributes and its observed wrap-up times can provide an indication of the presence (or lack) of several cognitive processes that are potentially a part of sentence wrap-up. For example, high-surprisal words in the preceding context may correlate with the presence of ambiguities in the text; they may also correlate with complex linguistic relationships of the current text with prior sentences—which are two driving forces in the theories given above. Consequently, in this work, we ask whether the reading behavior observed at the end of a sentence or clause can be described (at least partially) by the distribution of information content in the preceding context,3 as this may give insights for several prior hypotheses about wrap-up effects. 3 Language Models as Predictors of Psychometric Data Formally, a language model b p is a probability distribution over natural language sentences. In the case when b p is locally normalized, which is the pre- dominant case for today’s neural language models, b p is defined as the product of conditional probability distributions: b p(y) = Q|y| t=1 b p(yt | y<t), where each b p(· | y<t) is a distribution with support over Figure 1: Distributions of residuals when predicting either clause-final or non clause-final times using our baseline linear models. Models are fit to (the log-transform of) non clause-final average RTs. Outlier times (according to log-normal distribution) are ex- cluded. The top level datasets contain eye-tracking data while the bottom contain SPR data. Full distributions of RTs are shown in App. B, where we also show models fit to regression times, rather than full reading times. 2020), exciting new results in computational psycholinguistics may follow, connecting reading behavior to the statistics of natural language. Predicting Reading Times. In the computational psycholinguistics literature, the RT–surprisal relationship is typically studied using predictive models: RTs are predicted using surprisal estimates (along with other attributes such as number of characters) for the current word. The predictive power Predictive power of word surprisal for reading times is a linear function of language model quality (Goodkind+,18) Analyzing Wrap-Up Effects through an Information-Theoretic Lens (Meister+,22) ݴޠϞσϧͷੑೳ͕޲্͢ΔͱਓؒͷৼΔ෣͍ͷઆ໌΋͏·͘Ͱ͖Δͱܦݧతʹ৴͡ΒΕ͍ͯͨʢ~2022ʣ

ݴޠϞσϧͷੑೳ޲্Ͱݟ͖͑ͯͨɼೝ஌ϞσϦϯά ʹ͓͚ΔεέʔϦϯάଇʢϞσϧͷPPL∝ಡΈ࣌ؒͷઆ໌ྗʣͷഁ୼ 2023/8/26 Figure 3: Model Coefficients: Coefficients for a
linear model that i Coefficients are shown for each regressor word individually. Zero each row; error bars are 95% CIs across folds of data. Figure 4: Test Perplexity vs. llh (mGPT): We do not find a significant correlation between the llh and mGPT’s perplexity for a language or language family. et al., 2020). However, studies on Japanese have failed to replicate these results, suggesting that the relationship does not hold for all languages (Kuribayashi et al., 2021). Further, Oh and Schuler (2023) and Shain et al. (2022) show that this relationship may not hold even in English for the most recent language models. To investigate this, because erage s is only The do find across (⇢ = dence o Althoug gests th perplex guistic with th Nota this an train a a singl as opp multili we do share a Figure 2: Increase in PPP (from the full-gram to 2-gram settings) in each model type (ordered by their parameter size). The bar colors correspond to those in Figure 1. xl) does not imply that the difference is valueless, but this is just because the score is divided by the number of data points (e.g., 212,649 in the Dundee corpus) to facilitate inter-corpora compar- ison. As a statistical test, we compared the by- token squared residual errors from 2-gram models with those from full-context models using paired permutation tests (p=0.05). The short context, 2- gram models had significantly smaller fitting errors than the full context models (p < 0.001) in using relatively large LMs (GPT2-md-Wiki, GPT2-sm, GPT2-md, GPT2-lg, and GPT2-xl); smaller LMs (LSTM-xs-Wiki, and GPT2-xs-Wiki) have no significant differences (p ⇠ 0.4). Notably, we also observed that larger GPT-2s have less human-like behavior in the full setting (right-most column in Table 4). This trend was weakened by introducing our context limitation. Cross-linguistic consistency. Figure 1 and Ta- text limitation (full-context v.s. bigram) was larger in the largest LMs (GPT2-md in Japanese and GPT2-xl in English) than in the smallest LMs (LSTM-xs). Specifically, we compared the by- token decrease in squared residual errors; the large model exhibited a larger error decrease than the small model (p = 0.024 < 0.05 in Japanese, and p < 0.001 in English). In addition, the rank correlation between model size and PPP gain by context limitation was 0.50 in Japanese and 0.96 in English. General effectiveness of surprisal. Note that, in all the LMs, the PPP scores (equivalent to logLik) were significantly higher than 0 with the chi-square test (p < 10 31 even in the worst case); surprisal was an effective factor as existing studies reported. On top of this, we newly showed that their effect size differs due to the context limitation levels. 5.2 Does the potential training-inference mismatch bias our results? Vanilla LMs slightly underestimate the short- context advantage. We additionally trained Wiki- LMs (LSTM-xs-Wiki, GPT2-xs-Wiki, and GPT2- sm-Wiki) without the data modification handling the training-inference gap (Section 4.1) (hence- forth; vanilla LMs). Figure 3 shows the results of the models with and without the training modification. The vanilla LMs slightly underestimated the short-context advantage; the PPP of 2-gram surprisal improved when we adopted the modified training. That is, mitigating the train-inference gap made clearer the trend that context limitation increases PPP. Carefully training n-gram neural be properly compared only in the context of a fixed reference vocabulary (Wilcox et al., 2020). Techni- cally, XGLM models produce a conditional probability distribution over the same whole vocabulary, regardless of the language of the specific text they are processing. However, the models have received strong evidence during pre-training that some sub- portions of the vocabulary (e.g. Cyrillic tokens) should be essentially ignored while processing text in some languages (e.g. English), thus reducing their actual reference vocabulary. Hence, while we report the perplexity-based results in Appendix B, we focused on the link between the linguistic and psychological accuracy of the models by observing how the LogLik was affected by the parameter size of the model. The choice of employing parameter size as a proxy of linguistic accuracy is sup- ported by the results in the original XGLM paper, where the authors reported better results in almost all downstream tasks with the bigger versions of the XGLM model family (Lin et al., 2021). The code employed in this study is publicly avail- able2. 5 Results The first main finding of our study is that surprisal is a solid predictor of reading times across the languages considered, confirming the previous observation that context-dependent probabilistic processing generalizes beyond the Germanic language sample typically considered in the literature (de Varda and Marelli, 2022). The XGLM-based Appendix A). The increase in goodness of fit that could be attributed to surprisal is displayed in Figure 1, grouped by model type and fixation measure. Con- cerning FF (1a), we reported a general decrease in LogLik when increasing the number of parameters, with the smallest XGLM564M variant outper- forming the bigger models in terms of psychological accuracy. A similar trend can be observed in GD (1b), although the difference in psychological accuracy between XGLM564M and XGLM1.7B appears to be rather small3. The results are different when considering TT as the dependent variable (1c), as in this case the model that provided the highest average increase in goodness of fit was XGLM1.7B 4. 6 Discussion In this experiment, we showed that large multilingual Transformer-based models were outperformed by their smaller variants in predicting early eye movement measurements of processing difficulty. These measurements are thought to reflect predictive processes, lexical access, and early semantic integration. This result corroborates the previous claims that cognitive modelling might constitute an exception to empirical scaling laws in NLP (Oh and Schuler, 2022). However, predictability estimates computed by relatively larger variants of the same architecture – but not the largest – provided surprisal estimates that better captured late Abstract In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different struc- tures from English. Our experiments demonstrate that this established generalization ex- hibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross- lingual evaluation will be necessary to con- struct human-like computational models. 1 Introduction It is well known that the probability of a word in context (i.e., surprisal) impacts its processing human language processing. For example, recent studies reported that LMs with better performance for next-word prediction could also better predict the human reading behavior (i.e. more human- like) (Fossum and Levy, 2012; Goodkind and Bick- nell, 2018; Wilcox et al., 2020). In this paper, we re-examine whether the recent findings on human-like computational models can be generalized across languages. Despite the community’s ongoing search for a language- independent model (Bender, 2011), existing studies have focused almost exclusively on the English language. Having said that, broad-coverage cross- linguistic evaluation of the existing reports is pro- hibitively difficult. In fact, data on human reading behavior (e.g., eye movement) is available only in limited languages. As an initial foray, this study focuses on the Japanese language as a representa- tive of languages that have typologically different characteristics from the English language. If the observation is different between English and Japanese, the current findings on English data might lack a universality across languages. We specifically revisit the recent report—the lower perplexity a LM has, the more human-like the LM is—in the English and Japanese languages (Fos- sum and Levy, 2012; Goodkind and Bicknell, 2018; Wilcox et al., 2020). In addition to the importance ~300M params. ݴޠґଘͰഁ୼ Lower perplexity is not always human-like (Kuribayashi+,21) ~1.5B params. ೔ӳͰഁ୼ Context limitations Make Neural Language Models More Human-Like (Kuribayashi+,22) ~4.5B 13ݴޠͰഁ୼ Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times (Varda+,23) ~13B? 11ݴޠͰഁ୼ Testing the Predictions of Surprisal Theory in 11 Languages (Wilcox+23) ୠ͠ಉҰݴޠͰෳ਺ͷϞσϧΛൺֱ͍ͯ͠ΔΘ͚Ͱ͸ͳ͍ ਓؒ͸ͦ͜·Ͱਖ਼֬ͳ࣍୯ޠͷ༧ଌ͕ Ͱ͖͍ͯͳͦ͞͏ʢαϓϥΠβϧܭࢉํ๏͕͋Δछශ͍͠ʣ was more equivocal. The relationship between online and offline measures of comprehension difficulty is cur- rently poorly understood, and we leave this discrepancy to future investigation. With respect to Hoover et al. (2022), their claims of superlogarithmicity are based on visual estimates (and descriptive statistics derived from those estimates) from models fitted only to the Natural Stories SPR dataset. Our results in fact partially replicate theirs, since estimates tend to be visually superlogarithmic in Natural Stories SPR (especially over the long right tail of surprisal values, see Supplementary Figure A5), and a slightly superlogarithmic model (SURP4/3) outperforms a logarithmic one on that dataset, aggregating over all language models. However, this outcome appears to be largely restricted to Natural Stories SPR and does not generalize to a broader sample of reading data. In the absence of reasons to think that Natural Stories SPR is an especially reliable source of evidence on this question (see Supplementary Information G for counterarguments), our results suggest that the Hoover et al. (2022) pattern may not be characteristic of reading in general. 3.2 Implications for Statistical Modeling of Human Subjective Word Proba- bilities Our results additionally differentiate computational models of human next-word prediction. Surprisal estimates from GPT-2(-small) (Radford et al., 2019) substantially outperform surprisal estimates from n-gram, PCFG, GPT-J, and GPT-3 models. GPT-2 therefore appears to reside in a “Goldilocks” region of psychometric performance between language models that are too constrained on the one hand (n-gram and PCFG models) and too powerful on the other (GPT-J and GPT-3). This outcome challenges the notion that previ- ously reported correlations between the linguistic and psychometric performance of language models (e.g., Goodkind and Bicknell, 2018; Hao et al., 2020; Wilcox et al., 2020) will extrapolate to models of ever- increasing size, complexity, and quantity of training data (see also Oh, Clark, and Schuler, 2022). Instead, the task of using language model predictions to estimate human reading times may be akin to tasks in natural language processing that show an “inverse scaling” property, whereby task performance is inversely related to model size (McKenzie et al., 2022b,a, 2023). This result has both methodological and scientific implications. From a methodological standpoint, bigger is not always better; the selection of a language model for psycholinguistic research may need to consider additional dimensions (beyond perplexity). From a scientific standpoint, homing in on classes of models that best mimic human processing patterns offers the opportu- nity for new insights into the learning and processing mechanisms that underlie human language abilities (Schrimpf et al., 2020; Heilbron et al., 2022), a direction that we leave to future work. ~175B ӳޠͰഁ୼ Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time (Shain+,23) each story or article did not fit into a single context window for the LMs, the second half of the previous context window served as the first half of a new context window to calculate surprisal estimates for the remaining tokens. In practice, most stories and articles fit completely within two context windows for the GPT-2 models that have a context size of 1,024 tokens, and within one context window for the GPT-Neo and OPT models that have a context size of 2,048 tokens. Additionally, when a single word wt was tokenized into multiple subword tokens, negative log probabilities of subword tokens corre- sponding to wt were added together to calculate S(wt) = − log P(wt | w1..t−1). 3.3 Regression Modeling Subsequently, following the methods of Oh et al. (2022), a ‘baseline’ LME model that contains baseline predictors capturing low-level cognitive processing and seventeen ‘full’ LME models that contain the baseline predictors and each LM surprisal predictor were fit to the exploratory set of self-paced reading times and go-past durations using lme4 (Bates et al., 2015). The baseline predictors include word length measured in characters and index of word position within each sentence (both self-paced reading and eye-tracking), as well as saccade length and whether or not the previous word was fixated (eye-tracking only). All predictors were centered and scaled prior to model fitting, and the LME models included by-subject random slopes for all fixed effects as well as random intercepts for each subject and each word type. Additionally, for self-paced reading times collected from 181 subjects, a random intercept for each subject-sentence interaction was Figure 1: Perplexity measures from each LM variant, and improvements in regression model log-likelihood from including each surprisal estimate on the exploratory set of Natural Stories (top) and Dundee data (bottom). Dotted lines indicate the least-squares regression line for each LM family. 125M, and OPT 125M) made the biggest contri- bution to regression model fit on both self-paced reading times and eye-gaze durations for the three LM families. More notably, surprisal estimates from larger LM variants within each family first half of a new context window to calculate surprisal estimates for the remaining tokens. In practice, most stories and articles fit completely within two context windows for the GPT-2 models that have a context size of 1,024 tokens, and within one context window for the GPT-Neo and OPT models that have a context size of 2,048 tokens. Additionally, when a single word wt was tokenized into multiple subword tokens, negative log probabilities of subword tokens corre- sponding to wt were added together to calculate S(wt) = − log P(wt | w1..t−1). 3.3 Regression Modeling Subsequently, following the methods of Oh et al. (2022), a ‘baseline’ LME model that contains baseline predictors capturing low-level cognitive processing and seventeen ‘full’ LME models that contain the baseline predictors and each LM surprisal predictor were fit to the exploratory set of self-paced reading times and go-past durations using lme4 (Bates et al., 2015). The baseline predictors include word length measured in characters and index of word position within each sentence (both self-paced reading and eye-tracking), as well as saccade length and whether or not the previous word was fixated (eye-tracking only). All predictors were centered and scaled prior to model fitting, and the LME models included by-subject random slopes for all fixed effects as well as random intercepts for each subject and each word type. Additionally, for self-paced reading times collected from 181 subjects, a random intercept for each subject-sentence interaction was included. For eye-gaze durations collected from a much smaller number of 10 subjects, a random intercept for each sentence was included. After the regression models were fit, the ∆LL values were first calculated for each regression model by subtracting the log-likelihood of the baseline model from that of a full regression model. Moreover, to examine the trend between LM perplexity and predictive power of surprisal estimates, the perplexity of each LM variant was calcuated on the two corpora. 3.4 Results Figure 1: Perplexity measures from each LM variant, and improvements in regression model log-likelihood from including each surprisal estimate on the exploratory set of Natural Stories (top) and Dundee data (bottom). Dotted lines indicate the least-squares regression line for each LM family. 125M, and OPT 125M) made the biggest contri- bution to regression model fit on both self-paced reading times and eye-gaze durations for the three LM families. More notably, surprisal estimates from larger LM variants within each family yielded strictly poorer fits to reading times, ro- bustly replicating the trend observed by Oh et al. (2022). Interestingly, the three LM families also seem to demonstrate a strong log-linear relationship between perplexity and ∆LL, as can be seen by the least-squares regression lines. All regression lines had a slope significantly greater than 0 at p < 0.05 level according to a one-tailed t-test, with the exception of the regression line for GPT-2 on Natural Stories (p = 0.07). This trend is highly significant overall by a binomial test (five results with p < 0.05 out of six trials), and directly con- tradicts the findings of recent studies that report a ਤ͸঺հ࿦จΑΓ

ຊݚڀɿͳͥݴޠϞσϧͷܭࢉ͢Δ֬཰͸ ਓؒͷಡΈৼΔ෣͍͔Βҳ୤͍ͯ͘͠ʁ l جຊํ਑ɿεέʔϦϯά͕ഁ୼͍ͯ͠Δίʔύεͷ෦෼ू߹Λݟ͚ͭɼ ͦͷݴޠతੑ࣭ΛோΊΔ - ઢܗϞσϧɿreading time ~ surprisal
+ baseline features (word frequency, word length…) - ಛఆͷݴޠಛੑʢྫ͑͹ಛఆͷPOSʣΛ΋ͭ෦෼ίʔύε͝ͱʹMSEΛ؍࡯ - αϓϥΠβϧΛܭࢉ͢ΔݴޠϞσϧΛม͑ͯಉ༷ʹઢܗϞσϧΛ܇࿅͠ɼͲͷ෦෼ίʔύε ͰεέʔϦϯάଇʢϞσϧͷPPLͱMSEͷਖ਼ͷ૬ؔʣ͕ഁ୼͍ͯ͠Δ͔Λௐ΂Δ 2023/8/26 ࠷ઌ୺NLPษڧձ

ओͳ෼ੳ݁Ռ l Ϟσϧºίʔύεԣஅతʹಛఆ σʔλϙΠϯτͰεέʔϦϯά ͕ഁ୼ - Named Entity - Predicative
ADJ - Nouns before REL (e.g., that) l جຊతʹϞσϧ͕ಡΈෛՙΛ աখʹਪఆ͍ͯ͠Δ - ਓؒʹൺ΂ͯʮڻ͖ʯ͕খ͗͢͞Δ - ͜ΕΛࣔ͢ਤ͸লུ - ݴޠϞσϧ͕ڻ͔ͳ͗͢͞Δ؍࡯͸ ౷ޠ෼ੳͰ΋͋Γ [Wilcox+,21] 2023/8/26 ࠷ઌ୺NLPษڧձ ਤ͸঺հ࿦จΑΓ NE NE NE NE NE NE PrAdj PrAdj PrAdj PrAdj PrAdj ࢲ͕΋͏গ͠ แׅతͳ આ໌Λ͠ʹ͍͜͏ ͱߟ͑தʜ

ײ૝ l ΍΍ϙδγϣϯϖʔύͬΆ͍ - ͜ͷจ຺ͰεέʔϦϯά͕ഁ୼͢Δ͜ͱΛ໌ݴͯ͘͠Εͨ l ஌ݟ͸ඇৗʹ؍࡯త - ΦʔϓϯΫϥεʢ໊ࢺɼಈࢺɼܗ༰ࢺ౳ʣͷޠͰݴޠϞσϧͷαϓϥΠβϧ͕௿͗͢ΔҎ্ͷ ҰൠԽ͸ͳ͞Ε͍ͯͳ͍
- ͜ͷ෼໺ʹ͓͍ͯɼॳखҰ൪େ͖͍ϞσϧΛࢼ͢΂͖Ͱ͸ͳ͍ͱ͍͏ڭ܇ʹ͸ͳΔ l ೝ஌తɾݴޠֶతͳղऍͱͦͷཪ෇͚͕ࠓޙͷ՝୊ - ਓؒͷݴޠॲཧͷޮ཰ʢgood-enough processingʣ΍༧ଌ͚ͩͰ͸આ໌ͷ͔ͭͳ͍ ͋Δछͷ੍໿ʢe.g., memory access, lexical accessʣͷઆ໌ʹܨ͕Δͱظ଴ l εέʔϦϯάͰղ͚ͳ͍໰୊͔ͭɼʮ࣍୯ޠ༧ଌʹؔ͢ΔՊֶʯͷ Ұํ޲ͱͯ͠ɼೝ஌ϞσϦϯάʹ͞Βʹ஫໨͕ू·Δ͜ͱΛظ଴ - ࠷ۙTACLͰΞΫςΟϒͳҹ৅ 2023/8/26 ࠷ઌ୺NLPษڧձ

最先端NLP論文紹介：Why Does Surprisal From Larger Trans...

最先端NLP論文紹介：Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

tatsuki kuribayashi

More Decks by tatsuki kuribayashi

Other Decks in Research

Featured

Transcript

Why Does Surprisal From Larger Transformer-Based Language Models Provide a

2023/8/26 ࠷ઌ୺NLPษڧձ ਓؒͷݴޠ֫ಘɾॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ ೉͠͞ l ਓ͕ؒਓؒʢࣗ෼ʣʹ͍ͭͯ಺লͯ͠͠·͏ͱՊֶͷ٬؍ੑ͕ࣦΘΕΔ l ͠͹͠͹௚઀ͳԾઆͷݕূ͕Ͱ͖ͳ͍ - ಄Λ։͍ͯ೴Λ௚઀؍࡯ͯ͠΋จ๏͸ॻ͍ͯͳ͍

ਓؒͷݴޠॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ l ܭࢉཧ࿦ͷϨϕϧ - ԿΛܭࢉ͍ͯ͠Δ͔ʁܭࢉͷ໨త͸ͳʹ͔ʁʢ໨తؔ਺ʣ l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ l

ਓ͕ؒจΛಡΜͰ͍Δͱ͖ʹԿΛܭࢉ͍ͯ͠Δ͔ʁ ʢܭࢉཧ࿦ͷϨϕϧʣ l αϓϥΠβϧཧ࿦ [Levy08, Simth&Levy13, Shain+23] - ਓؒ͸จΛಡΉͱ͖ʹઌͷ୯ޠΛ༧ଌ͓ͯ͠Γɼ༧ଌ͕֎ΕΔͱॲཧʹෛՙ͕͔͔Δ -

l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ - Ͳͷఔ౓ਖ਼֬ͳݴޠϞσϧͰਓؒͷจॲཧΛઆ໌Ͱ͖Δʁ ͲͷΑ͏ͳϞσϧͰαϓϥΠβϧΛܭࢉ͢ΔͱΑ͍͔ʁ ʢܭࢉํ๏ͷϨϕϧʣ 2023/8/26 ࠷ઌ୺NLPษڧձ

ݴޠϞσϧͷੑೳ޲্Ͱݟ͖͑ͯͨɼೝ஌ϞσϦϯά ʹ͓͚ΔεέʔϦϯάଇʢϞσϧͷPPL∝ಡΈ࣌ؒͷઆ໌ྗʣͷഁ୼ 2023/8/26 Figure 3: Model Coefﬁcients: Coefﬁcients for a

ຊݚڀɿͳͥݴޠϞσϧͷܭࢉ͢Δ֬཰͸ ਓؒͷಡΈৼΔ෣͍͔Βҳ୤͍ͯ͘͠ʁ l جຊํ਑ɿεέʔϦϯά͕ഁ୼͍ͯ͠Δίʔύεͷ෦෼ू߹Λݟ͚ͭɼ ͦͷݴޠతੑ࣭ΛோΊΔ - ઢܗϞσϧɿreading time ~ surprisal

ओͳ෼ੳ݁Ռ l Ϟσϧºίʔύεԣஅతʹಛఆ σʔλϙΠϯτͰεέʔϦϯά ͕ഁ୼ - Named Entity - Predicative

ײ૝ l ΍΍ϙδγϣϯϖʔύͬΆ͍ - ͜ͷจ຺ͰεέʔϦϯά͕ഁ୼͢Δ͜ͱΛ໌ݴͯ͘͠Εͨ l ஌ݟ͸ඇৗʹ؍࡯త - ΦʔϓϯΫϥεʢ໊ࢺɼಈࢺɼܗ༰ࢺ౳ʣͷޠͰݴޠϞσϧͷαϓϥΠβϧ͕௿͗͢ΔҎ্ͷ ҰൠԽ͸ͳ͞Ε͍ͯͳ͍