$30 off During Our Annual Pro Sale. View Details »

最先端NLP論文紹介:Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

最先端NLP論文紹介:Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

tatsuki kuribayashi

August 27, 2023
Tweet

More Decks by tatsuki kuribayashi

Other Decks in Research

Transcript

  1. Why Does Surprisal From Larger Transformer-Based
    Language Models Provide a Poorer Fit to Human
    Reading Times?
    Byung-Doh Oh, William Schuler (TACL2023)
    ঺հऀɿ܀ྛथੜʢMBZUAIʣ
    2023/8/26 ࠷ઌ୺NLPษڧձ

    View Slide

  2. 2023/8/26 ࠷ઌ୺NLPษڧձ
    ਓؒͷݴޠ֫ಘɾॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ
    ೉͠͞
    l ਓ͕ؒਓؒʢࣗ෼ʣʹ͍ͭͯ಺লͯ͠͠·͏ͱՊֶͷ٬؍ੑ͕ࣦΘΕΔ
    l ͠͹͠͹௚઀ͳԾઆͷݕূ͕Ͱ͖ͳ͍
    - ಄Λ։͍ͯ೴Λ௚઀؍࡯ͯ͠΋จ๏͸ॻ͍ͯͳ͍
    - ࢠڙΛ2άϧʔϓʹ෼͚ͯ౷੍ͯ͠ҭͯΔ…ʢྙཧతʹࠔ೉ʣ
    ଥڠ఺
    l ߏ੒࿦తΞϓϩʔν
    - ਓؒͱಉ͡ৼΔ෣͍Λ͢Δ΋ͷΛ࡞Δɽͦͷ࡞Γํ͔ΒՄೳͳԾઆΛఏࣔ͢Δɽ
    - ͦͷԾઆ͕ਓؒʹର͢Δઆ໌ͱͯ͠ଥ౰͔͸ผͷϨΠϠ͔Βٞ࿦͕ඞཁ

    View Slide

  3. ਓؒͷݴޠॲཧ͕஌Γ͍ͨʢೝ஌Պֶ͔Βͷཁ੥ʣ
    l ܭࢉཧ࿦ͷϨϕϧ
    - ԿΛܭࢉ͍ͯ͠Δ͔ʁܭࢉͷ໨త͸ͳʹ͔ʁʢ໨తؔ਺ʣ
    l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ
    - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ
    l ෺ཧత࣮૷ͷϨϕϧ
    - ʢ೴ͰʣͲͷΑ͏ʹ࣮ݱɾ࣮૷͞Ε͍ͯΔ͔ʁʢϋʔυ΢ΣΞϨϕϧͷ࣮૷ʣ
    2023/8/26 ࠷ઌ୺NLPษڧձ
    [Marr, 1982]

    View Slide

  4. ਓ͕ؒจΛಡΜͰ͍Δͱ͖ʹԿΛܭࢉ͍ͯ͠Δ͔ʁ
    ʢܭࢉཧ࿦ͷϨϕϧʣ
    l αϓϥΠβϧཧ࿦ [Levy08, Simth&Levy13, Shain+23]
    - ਓؒ͸จΛಡΉͱ͖ʹઌͷ୯ޠΛ༧ଌ͓ͯ͠Γɼ༧ଌ͕֎ΕΔͱॲཧʹෛՙ͕͔͔Δ
    - ֤୯ޠ𝑤!
    ͷॲཧෛՙ͸− log 𝑝(𝑤!
    |𝒘"!
    )ʹൺྫ͢Δ
    2023/8/26 ࠷ઌ୺NLPษڧձ
    Dutch English Finnish German Greek Hebrew Italian Korean Russian Spanish Turkish
    mGPT (long) mGPT (short) monoT (30m) monoT (all)
    0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20
    0
    30
    60
    90
    0
    30
    60
    90
    0
    30
    60
    90
    0
    30
    60
    90
    Surprisal of Word (bits)
    Slowdown due to Surprisal (ms)
    Figure 5: Surprisal vs. Reading Time Relationship: Non-linear GAMs are in green; linear control GAMs are in
    dotted blue. Shaded regions represent bootstrapped 95% confidence intervals. Grey subplots indicate the distribution
    of surprisal values. We find that GAMs recover a linear relationship between surprisal and reading-time slowdown.
    we train models on ten different folds of our data
    ಡΈ࣌ؒ
    αϓϥΠβϧ
    ۙ೥ɼΤϯτϩϐʔʢαϓϥΠβϧͷظ଴஋ʣͷઆ໌ྗͷߴ͞
    ΋վΊͯใࠂ͞Ε͍ͯΔ͕ɼ޿͘ݟΕ͹͜Ε΋
    αϓϥΠβϧཧ࿦ͩͱׅΒΕ͍ͯΔ
    ௚ઢతͳؔ܎Ͱྑ͍ͷ͔ͱ͍͏఺͸ɼ
    ࡢ೥ͷษڧձࢿྉ΋ࢀর
    Testing the Predictions of Surprisal Theory in 11 Languages (Wilcox+23)
    On the Effect of Anticipation on Reading Times (Pimentel+23)
    Testing the Predictions of Surprisal Theory in 11
    Languages (Wilcox+23)
    https://speakerdeck.com/kuribayashi4/zui-xian-duan-nlplun-wen-shao-
    jie-revisiting-the-uniform-information-density-hypothesis-emnlp2021-
    linguistic-dependencies-and-statistical-dependence-emnlp2021

    View Slide

  5. l σʔλߏ଄ɾܭࢉํ๏ͷϨϕϧ
    - ͲͷΑ͏ʹܭࢉ͍ͯ͠Δ͔ʁʢϞσϧΞʔΩςΫνϟɼ಺෦දݱʣ
    - Ͳͷఔ౓ਖ਼֬ͳݴޠϞσϧͰਓؒͷจॲཧΛઆ໌Ͱ͖Δʁ
    ͲͷΑ͏ͳϞσϧͰαϓϥΠβϧΛܭࢉ͢ΔͱΑ͍͔ʁ
    ʢܭࢉํ๏ͷϨϕϧʣ
    2023/8/26 ࠷ઌ୺NLPษڧձ
    Figure 1: Improvements in log likelihood for lin-
    ear models, charted against decreases in perplex-
    ity. Distance from the central trend line is indica-
    tive of larger departures in log likelihood as a func-
    tion of perplexity. The blue line represents a linear
    best fit, with a coefficient of 1.66 and R2 = 0.94
    Figure 2: Changes in the current word’s coefficient
    for linear models, charted against increases in per-
    plexity. Distances from the central trend line are
    indicative of larger departures of the current word
    coefficient from the expected trend. Regardless of
    perplexity, the coefficient is stable. The blue line
    represents a linear best fit, with a coefficient of
    2.79 and R2 = 0.007.
    dominant case for today’s neural language models,
    b
    p is defined as the product of conditional probabil-
    ity distributions: b
    p(y) =
    Q|y|
    t=1
    b
    p(yt | yeach b
    p(· | ylinguistic units y (typically words) from a set vocab-
    ulary V, which includes a special end-of-sequence
    token. Consequently, we can use b
    p to estimate in-
    dividual word probabilities. Model parameters are
    typically estimated by minimizing the negative log-
    likelihood of a corpus of natural language strings
    C, i.e., minimizing L(b
    p) =
    P
    y2C
    log b
    p(y).
    One widely embraced technique in information-
    theoretic psycholinguistics is the use of these lan-
    guage models to estimate the probabilities required
    for computing surprisal (Hale, 2001; Demberg and
    Keller, 2008; Mitchell et al., 2010; Fernandez Mon-
    salve et al., 2012). It has even been observed that a
    language model’s perplexity4 correlates negatively
    with the psychometric predictive power provided
    by its surprisal estimates (Frank and Bod, 2011;
    Goodkind and Bicknell, 2018; Wilcox et al., 2020).
    If these language models keep improving at their
    current fast pace (Radford et al., 2019; Brown et al.,
    3Importantly, the research questions we ask are not con-
    cerned with describing the full set of cognitive processes that
    occur at the end of a clause or sentence—or even whether
    there is a causal relationship between information content and
    sentence- and clause-final RTs.
    4Perplexity is a monotonic function of the average
    surprisal of linguistic units in-context under a model.
    relationship is typically s
    models: RTs are predicted u
    (along with other attributes
    acters) for the current word
    of these models, together w
    model itself (which define
    between RTs and surpri
    evidence of the studied effe
    is successful in modeling
    (Smith and Levy, 2013; G
    2018; Wilcox et al., 2020
    modeling sentence- and
    largely unknown due to th
    from the majority of RT an
    A priori, we might expe
    be a similarly powerful pr
    clause-final RTs.5 Yet in F
    our baseline linear model (d
    in §4) is fit to sentence-me
    for predictions of clause-
    neither normally distribute
    Further, these trends appea
    tracking and SPR data, whe
    towards lower values for
    5Several works (e.g., Stowe
    cognitive processes involved in c
    words are exactly the same as tho
    6The opposite is true for regr
    data; see App. B.
    22
    Concretely, we posit the relationship between
    text’s information-theoretic attributes and its
    observed wrap-up times can provide an indication
    of the presence (or lack) of several cognitive
    processes that are potentially a part of sentence
    wrap-up. For example, high-surprisal words in the
    preceding context may correlate with the presence
    of ambiguities in the text; they may also correlate
    with complex linguistic relationships of the current
    text with prior sentences—which are two driving
    forces in the theories given above. Consequently,
    in this work, we ask whether the reading behavior
    observed at the end of a sentence or clause can be
    described (at least partially) by the distribution of
    information content in the preceding context,3 as
    this may give insights for several prior hypotheses
    about wrap-up effects.
    3 Language Models as Predictors of
    Psychometric Data
    Formally, a language model b
    p is a probability dis-
    tribution over natural language sentences. In the
    case when b
    p is locally normalized, which is the pre-
    dominant case for today’s neural language models,
    b
    p is defined as the product of conditional probabil-
    ity distributions: b
    p(y) =
    Q|y|
    t=1
    b
    p(yt | yeach b
    p(· | yFigure 1: Distributions of residuals when predicting
    either clause-final or non clause-final times using
    our baseline linear models. Models are fit to (the
    log-transform of) non clause-final average RTs. Outlier
    times (according to log-normal distribution) are ex-
    cluded. The top level datasets contain eye-tracking data
    while the bottom contain SPR data. Full distributions of
    RTs are shown in App. B, where we also show models
    fit to regression times, rather than full reading times.
    2020), exciting new results in computational psy-
    cholinguistics may follow, connecting reading be-
    havior to the statistics of natural language.
    Predicting Reading Times. In the computa-
    tional psycholinguistics literature, the RT–surprisal
    relationship is typically studied using predictive
    models: RTs are predicted using surprisal estimates
    (along with other attributes such as number of char-
    acters) for the current word. The predictive power
    Predictive power of word surprisal for reading times
    is a linear function of language model quality (Goodkind+,18)
    Analyzing Wrap-Up Effects through an
    Information-Theoretic Lens (Meister+,22)
    ݴޠϞσϧͷੑೳ͕޲্͢ΔͱਓؒͷৼΔ෣͍ͷઆ໌΋͏·͘Ͱ͖Δͱܦݧతʹ৴͡ΒΕ͍ͯͨʢ~2022ʣ

    View Slide

  6. ݴޠϞσϧͷੑೳ޲্Ͱݟ͖͑ͯͨɼೝ஌ϞσϦϯά
    ʹ͓͚ΔεέʔϦϯάଇʢϞσϧͷPPL∝ಡΈ࣌ؒͷઆ໌ྗʣͷഁ୼
    2023/8/26
    Figure 3: Model Coefficients: Coefficients for a linear model that i
    Coefficients are shown for each regressor word individually. Zero
    each row; error bars are 95% CIs across folds of data.
    Figure 4: Test Perplexity vs.
    llh
    (mGPT): We do
    not find a significant correlation between the llh
    and
    mGPT’s perplexity for a language or language family.
    et al., 2020). However, studies on Japanese have
    failed to replicate these results, suggesting that
    the relationship does not hold for all languages
    (Kuribayashi et al., 2021). Further, Oh and Schuler
    (2023) and Shain et al. (2022) show that this
    relationship may not hold even in English for the
    most recent language models. To investigate this,
    because
    erage s
    is only
    The
    do find
    across
    (⇢ =
    dence o
    Althoug
    gests th
    perplex
    guistic
    with th
    Nota
    this an
    train a
    a singl
    as opp
    multili
    we do
    share a
    Figure 2: Increase in PPP (from the full-gram to 2-gram
    settings) in each model type (ordered by their parameter
    size). The bar colors correspond to those in Figure 1.
    xl) does not imply that the difference is valueless,
    but this is just because the score is divided by
    the number of data points (e.g., 212,649 in the
    Dundee corpus) to facilitate inter-corpora compar-
    ison. As a statistical test, we compared the by-
    token squared residual errors from 2-gram models
    with those from full-context models using paired
    permutation tests (p=0.05). The short context, 2-
    gram models had significantly smaller fitting errors
    than the full context models (p < 0.001) in using
    relatively large LMs (GPT2-md-Wiki, GPT2-sm,
    GPT2-md, GPT2-lg, and GPT2-xl); smaller LMs
    (LSTM-xs-Wiki, and GPT2-xs-Wiki) have no sig-
    nificant differences (p ⇠ 0.4).
    Notably, we also observed that larger GPT-2s
    have less human-like behavior in the full setting
    (right-most column in Table 4). This trend was
    weakened by introducing our context limitation.
    Cross-linguistic consistency. Figure 1 and Ta-
    text limitation (full-context v.s. bigram) was larger
    in the largest LMs (GPT2-md in Japanese and
    GPT2-xl in English) than in the smallest LMs
    (LSTM-xs). Specifically, we compared the by-
    token decrease in squared residual errors; the large
    model exhibited a larger error decrease than the
    small model (p = 0.024 < 0.05 in Japanese, and
    p < 0.001 in English). In addition, the rank corre-
    lation between model size and PPP gain by context
    limitation was 0.50 in Japanese and 0.96 in English.
    General effectiveness of surprisal. Note that, in
    all the LMs, the PPP scores (equivalent to logLik)
    were significantly higher than 0 with the chi-square
    test (p < 10 31 even in the worst case); surprisal
    was an effective factor as existing studies reported.
    On top of this, we newly showed that their effect
    size differs due to the context limitation levels.
    5.2 Does the potential training-inference
    mismatch bias our results?
    Vanilla LMs slightly underestimate the short-
    context advantage. We additionally trained Wiki-
    LMs (LSTM-xs-Wiki, GPT2-xs-Wiki, and GPT2-
    sm-Wiki) without the data modification handling
    the training-inference gap (Section 4.1) (hence-
    forth; vanilla LMs). Figure 3 shows the results
    of the models with and without the training modi-
    fication. The vanilla LMs slightly underestimated
    the short-context advantage; the PPP of 2-gram
    surprisal improved when we adopted the modified
    training. That is, mitigating the train-inference
    gap made clearer the trend that context limitation
    increases PPP. Carefully training n-gram neural
    be properly compared only in the context of a fixed
    reference vocabulary (Wilcox et al., 2020). Techni-
    cally, XGLM models produce a conditional proba-
    bility distribution over the same whole vocabulary,
    regardless of the language of the specific text they
    are processing. However, the models have received
    strong evidence during pre-training that some sub-
    portions of the vocabulary (e.g. Cyrillic tokens)
    should be essentially ignored while processing text
    in some languages (e.g. English), thus reducing
    their actual reference vocabulary. Hence, while we
    report the perplexity-based results in Appendix B,
    we focused on the link between the linguistic and
    psychological accuracy of the models by observing
    how the LogLik was affected by the parameter
    size of the model. The choice of employing param-
    eter size as a proxy of linguistic accuracy is sup-
    ported by the results in the original XGLM paper,
    where the authors reported better results in almost
    all downstream tasks with the bigger versions of
    the XGLM model family (Lin et al., 2021).
    The code employed in this study is publicly avail-
    able2.
    5 Results
    The first main finding of our study is that sur-
    prisal is a solid predictor of reading times across
    the languages considered, confirming the previous
    observation that context-dependent probabilistic
    processing generalizes beyond the Germanic lan-
    guage sample typically considered in the literature
    (de Varda and Marelli, 2022). The XGLM-based
    Appendix A).
    The increase in goodness of fit that could be
    attributed to surprisal is displayed in Figure 1,
    grouped by model type and fixation measure. Con-
    cerning FF (1a), we reported a general decrease in
    LogLik when increasing the number of parame-
    ters, with the smallest XGLM564M variant outper-
    forming the bigger models in terms of psycholog-
    ical accuracy. A similar trend can be observed in
    GD (1b), although the difference in psychologi-
    cal accuracy between XGLM564M and XGLM1.7B
    appears to be rather small3. The results are differ-
    ent when considering TT as the dependent variable
    (1c), as in this case the model that provided the
    highest average increase in goodness of fit was
    XGLM1.7B
    4.
    6 Discussion
    In this experiment, we showed that large multilin-
    gual Transformer-based models were outperformed
    by their smaller variants in predicting early eye
    movement measurements of processing difficulty.
    These measurements are thought to reflect predic-
    tive processes, lexical access, and early semantic
    integration. This result corroborates the previous
    claims that cognitive modelling might constitute
    an exception to empirical scaling laws in NLP (Oh
    and Schuler, 2022). However, predictability es-
    timates computed by relatively larger variants of
    the same architecture – but not the largest – pro-
    vided surprisal estimates that better captured late
    Abstract
    In computational psycholinguistics, various
    language models have been evaluated against
    human reading behavior (e.g., eye movement)
    to build human-like computational models.
    However, most previous efforts have focused
    almost exclusively on English, despite the re-
    cent trend towards linguistic universal within
    the general community. In order to fill the gap,
    this paper investigates whether the established
    results in computational psycholinguistics can
    be generalized across languages. Specifically,
    we re-examine an established generalization
    —the lower perplexity a language model has,
    the more human-like the language model is—
    in Japanese with typologically different struc-
    tures from English. Our experiments demon-
    strate that this established generalization ex-
    hibits a surprising lack of universality; namely,
    lower perplexity is not always human-like.
    Moreover, this discrepancy between English
    and Japanese is further explored from the
    perspective of (non-)uniform information den-
    sity. Overall, our results suggest that a cross-
    lingual evaluation will be necessary to con-
    struct human-like computational models.
    1 Introduction
    It is well known that the probability of a word
    in context (i.e., surprisal) impacts its processing
    human language processing. For example, recent
    studies reported that LMs with better performance
    for next-word prediction could also better predict
    the human reading behavior (i.e. more human-
    like) (Fossum and Levy, 2012; Goodkind and Bick-
    nell, 2018; Wilcox et al., 2020).
    In this paper, we re-examine whether the re-
    cent findings on human-like computational mod-
    els can be generalized across languages. Despite
    the community’s ongoing search for a language-
    independent model (Bender, 2011), existing stud-
    ies have focused almost exclusively on the English
    language. Having said that, broad-coverage cross-
    linguistic evaluation of the existing reports is pro-
    hibitively difficult. In fact, data on human reading
    behavior (e.g., eye movement) is available only in
    limited languages. As an initial foray, this study
    focuses on the Japanese language as a representa-
    tive of languages that have typologically different
    characteristics from the English language. If the ob-
    servation is different between English and Japanese,
    the current findings on English data might lack a
    universality across languages.
    We specifically revisit the recent report—the
    lower perplexity a LM has, the more human-like the
    LM is—in the English and Japanese languages (Fos-
    sum and Levy, 2012; Goodkind and Bicknell, 2018;
    Wilcox et al., 2020). In addition to the importance
    ~300M params. ݴޠґଘͰഁ୼
    Lower perplexity is not always human-like
    (Kuribayashi+,21)
    ~1.5B params. ೔ӳͰഁ୼
    Context limitations Make Neural Language
    Models More Human-Like (Kuribayashi+,22)
    ~4.5B 13ݴޠͰഁ୼
    Scaling in Cognitive Modelling: a Multilingual
    Approach to Human Reading Times (Varda+,23)
    ~13B? 11ݴޠͰഁ୼
    Testing the Predictions of Surprisal Theory in 11
    Languages (Wilcox+23)
    ୠ͠ಉҰݴޠͰෳ਺ͷϞσϧΛൺֱ͍ͯ͠ΔΘ͚Ͱ͸ͳ͍
    ਓؒ͸ͦ͜·Ͱਖ਼֬ͳ࣍୯ޠͷ༧ଌ͕
    Ͱ͖͍ͯͳͦ͞͏ʢαϓϥΠβϧܭࢉํ๏͕͋Δछශ͍͠ʣ
    was more equivocal. The relationship between online and offline measures of comprehension difficulty is cur-
    rently poorly understood, and we leave this discrepancy to future investigation. With respect to Hoover et al.
    (2022), their claims of superlogarithmicity are based on visual estimates (and descriptive statistics derived
    from those estimates) from models fitted only to the Natural Stories SPR dataset. Our results in fact partially
    replicate theirs, since estimates tend to be visually superlogarithmic in Natural Stories SPR (especially over
    the long right tail of surprisal values, see Supplementary Figure A5), and a slightly superlogarithmic model
    (SURP4/3) outperforms a logarithmic one on that dataset, aggregating over all language models. However,
    this outcome appears to be largely restricted to Natural Stories SPR and does not generalize to a broader
    sample of reading data. In the absence of reasons to think that Natural Stories SPR is an especially reliable
    source of evidence on this question (see Supplementary Information G for counterarguments), our results
    suggest that the Hoover et al. (2022) pattern may not be characteristic of reading in general.
    3.2 Implications for Statistical Modeling of Human Subjective Word Proba-
    bilities
    Our results additionally differentiate computational models of human next-word prediction. Surprisal esti-
    mates from GPT-2(-small) (Radford et al., 2019) substantially outperform surprisal estimates from n-gram,
    PCFG, GPT-J, and GPT-3 models. GPT-2 therefore appears to reside in a “Goldilocks” region of psycho-
    metric performance between language models that are too constrained on the one hand (n-gram and PCFG
    models) and too powerful on the other (GPT-J and GPT-3). This outcome challenges the notion that previ-
    ously reported correlations between the linguistic and psychometric performance of language models (e.g.,
    Goodkind and Bicknell, 2018; Hao et al., 2020; Wilcox et al., 2020) will extrapolate to models of ever-
    increasing size, complexity, and quantity of training data (see also Oh, Clark, and Schuler, 2022). Instead,
    the task of using language model predictions to estimate human reading times may be akin to tasks in natural
    language processing that show an “inverse scaling” property, whereby task performance is inversely related
    to model size (McKenzie et al., 2022b,a, 2023). This result has both methodological and scientific implica-
    tions. From a methodological standpoint, bigger is not always better; the selection of a language model for
    psycholinguistic research may need to consider additional dimensions (beyond perplexity). From a scientific
    standpoint, homing in on classes of models that best mimic human processing patterns offers the opportu-
    nity for new insights into the learning and processing mechanisms that underlie human language abilities
    (Schrimpf et al., 2020; Heilbron et al., 2022), a direction that we leave to future work.
    ~175B ӳޠͰഁ୼
    Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time (Shain+,23)
    each story or article did not fit into a single
    context window for the LMs, the second half
    of the previous context window served as the
    first half of a new context window to calculate
    surprisal estimates for the remaining tokens. In
    practice, most stories and articles fit completely
    within two context windows for the GPT-2 mod-
    els that have a context size of 1,024 tokens, and
    within one context window for the GPT-Neo and
    OPT models that have a context size of 2,048
    tokens. Additionally, when a single word wt
    was
    tokenized into multiple subword tokens, nega-
    tive log probabilities of subword tokens corre-
    sponding to wt
    were added together to calculate
    S(wt) = − log P(wt | w1..t−1).
    3.3 Regression Modeling
    Subsequently, following the methods of Oh et al.
    (2022), a ‘baseline’ LME model that contains
    baseline predictors capturing low-level cognitive
    processing and seventeen ‘full’ LME models that
    contain the baseline predictors and each LM sur-
    prisal predictor were fit to the exploratory set of
    self-paced reading times and go-past durations
    using lme4 (Bates et al., 2015). The baseline pre-
    dictors include word length measured in characters
    and index of word position within each sentence
    (both self-paced reading and eye-tracking), as well
    as saccade length and whether or not the previous
    word was fixated (eye-tracking only).
    All predictors were centered and scaled prior
    to model fitting, and the LME models included
    by-subject random slopes for all fixed effects as
    well as random intercepts for each subject and
    each word type. Additionally, for self-paced read-
    ing times collected from 181 subjects, a random
    intercept for each subject-sentence interaction was
    Figure 1: Perplexity measures from each LM variant,
    and improvements in regression model log-likelihood
    from including each surprisal estimate on the ex-
    ploratory set of Natural Stories (top) and Dundee data
    (bottom). Dotted lines indicate the least-squares re-
    gression line for each LM family.
    125M, and OPT 125M) made the biggest contri-
    bution to regression model fit on both self-paced
    reading times and eye-gaze durations for the
    three LM families. More notably, surprisal esti-
    mates from larger LM variants within each family
    first half of a new context window to calculate
    surprisal estimates for the remaining tokens. In
    practice, most stories and articles fit completely
    within two context windows for the GPT-2 mod-
    els that have a context size of 1,024 tokens, and
    within one context window for the GPT-Neo and
    OPT models that have a context size of 2,048
    tokens. Additionally, when a single word wt
    was
    tokenized into multiple subword tokens, nega-
    tive log probabilities of subword tokens corre-
    sponding to wt
    were added together to calculate
    S(wt) = − log P(wt | w1..t−1).
    3.3 Regression Modeling
    Subsequently, following the methods of Oh et al.
    (2022), a ‘baseline’ LME model that contains
    baseline predictors capturing low-level cognitive
    processing and seventeen ‘full’ LME models that
    contain the baseline predictors and each LM sur-
    prisal predictor were fit to the exploratory set of
    self-paced reading times and go-past durations
    using lme4 (Bates et al., 2015). The baseline pre-
    dictors include word length measured in characters
    and index of word position within each sentence
    (both self-paced reading and eye-tracking), as well
    as saccade length and whether or not the previous
    word was fixated (eye-tracking only).
    All predictors were centered and scaled prior
    to model fitting, and the LME models included
    by-subject random slopes for all fixed effects as
    well as random intercepts for each subject and
    each word type. Additionally, for self-paced read-
    ing times collected from 181 subjects, a random
    intercept for each subject-sentence interaction was
    included. For eye-gaze durations collected from
    a much smaller number of 10 subjects, a random
    intercept for each sentence was included.
    After the regression models were fit, the ∆LL
    values were first calculated for each regression
    model by subtracting the log-likelihood of the
    baseline model from that of a full regression
    model. Moreover, to examine the trend between
    LM perplexity and predictive power of surprisal
    estimates, the perplexity of each LM variant was
    calcuated on the two corpora.
    3.4 Results
    Figure 1: Perplexity measures from each LM variant,
    and improvements in regression model log-likelihood
    from including each surprisal estimate on the ex-
    ploratory set of Natural Stories (top) and Dundee data
    (bottom). Dotted lines indicate the least-squares re-
    gression line for each LM family.
    125M, and OPT 125M) made the biggest contri-
    bution to regression model fit on both self-paced
    reading times and eye-gaze durations for the
    three LM families. More notably, surprisal esti-
    mates from larger LM variants within each family
    yielded strictly poorer fits to reading times, ro-
    bustly replicating the trend observed by Oh et al.
    (2022). Interestingly, the three LM families also
    seem to demonstrate a strong log-linear relation-
    ship between perplexity and ∆LL, as can be seen
    by the least-squares regression lines. All regres-
    sion lines had a slope significantly greater than 0
    at p < 0.05 level according to a one-tailed t-test,
    with the exception of the regression line for GPT-2
    on Natural Stories (p = 0.07). This trend is highly
    significant overall by a binomial test (five results
    with p < 0.05 out of six trials), and directly con-
    tradicts the findings of recent studies that report a
    ਤ͸঺հ࿦จΑΓ

    View Slide

  7. ຊݚڀɿͳͥݴޠϞσϧͷܭࢉ͢Δ֬཰͸
    ਓؒͷಡΈৼΔ෣͍͔Βҳ୤͍ͯ͘͠ʁ
    l جຊํ਑ɿεέʔϦϯά͕ഁ୼͍ͯ͠Δίʔύεͷ෦෼ू߹Λݟ͚ͭɼ
    ͦͷݴޠతੑ࣭ΛோΊΔ
    - ઢܗϞσϧɿreading time ~ surprisal + baseline features (word frequency, word length…)
    - ಛఆͷݴޠಛੑʢྫ͑͹ಛఆͷPOSʣΛ΋ͭ෦෼ίʔύε͝ͱʹMSEΛ؍࡯
    - αϓϥΠβϧΛܭࢉ͢ΔݴޠϞσϧΛม͑ͯಉ༷ʹઢܗϞσϧΛ܇࿅͠ɼͲͷ෦෼ίʔύε
    ͰεέʔϦϯάଇʢϞσϧͷPPLͱMSEͷਖ਼ͷ૬ؔʣ͕ഁ୼͍ͯ͠Δ͔Λௐ΂Δ
    2023/8/26 ࠷ઌ୺NLPษڧձ

    View Slide

  8. ओͳ෼ੳ݁Ռ
    l Ϟσϧºίʔύεԣஅతʹಛఆ
    σʔλϙΠϯτͰεέʔϦϯά
    ͕ഁ୼
    - Named Entity
    - Predicative ADJ
    - Nouns before REL (e.g., that)
    l جຊతʹϞσϧ͕ಡΈෛՙΛ
    աখʹਪఆ͍ͯ͠Δ
    - ਓؒʹൺ΂ͯʮڻ͖ʯ͕খ͗͢͞Δ
    - ͜ΕΛࣔ͢ਤ͸লུ
    - ݴޠϞσϧ͕ڻ͔ͳ͗͢͞Δ؍࡯͸
    ౷ޠ෼ੳͰ΋͋Γ [Wilcox+,21]
    2023/8/26 ࠷ઌ୺NLPษڧձ
    ਤ͸঺հ࿦จΑΓ
    NE
    NE
    NE
    NE
    NE
    NE
    PrAdj
    PrAdj
    PrAdj
    PrAdj
    PrAdj
    ࢲ͕΋͏গ͠
    แׅతͳ
    આ໌Λ͠ʹ͍͜͏
    ͱߟ͑தʜ

    View Slide

  9. ײ૝
    l ΍΍ϙδγϣϯϖʔύͬΆ͍
    - ͜ͷจ຺ͰεέʔϦϯά͕ഁ୼͢Δ͜ͱΛ໌ݴͯ͘͠Εͨ
    l ஌ݟ͸ඇৗʹ؍࡯త
    - ΦʔϓϯΫϥεʢ໊ࢺɼಈࢺɼܗ༰ࢺ౳ʣͷޠͰݴޠϞσϧͷαϓϥΠβϧ͕௿͗͢ΔҎ্ͷ
    ҰൠԽ͸ͳ͞Ε͍ͯͳ͍
    - ͜ͷ෼໺ʹ͓͍ͯɼॳखҰ൪େ͖͍ϞσϧΛࢼ͢΂͖Ͱ͸ͳ͍ͱ͍͏ڭ܇ʹ͸ͳΔ
    l ೝ஌తɾݴޠֶతͳղऍͱͦͷཪ෇͚͕ࠓޙͷ՝୊
    - ਓؒͷݴޠॲཧͷޮ཰ʢgood-enough processingʣ΍༧ଌ͚ͩͰ͸આ໌ͷ͔ͭͳ͍
    ͋Δछͷ੍໿ʢe.g., memory access, lexical accessʣͷઆ໌ʹܨ͕Δͱظ଴
    l εέʔϦϯάͰղ͚ͳ͍໰୊͔ͭɼʮ࣍୯ޠ༧ଌʹؔ͢ΔՊֶʯͷ
    Ұํ޲ͱͯ͠ɼೝ஌ϞσϦϯάʹ͞Βʹ஫໨͕ू·Δ͜ͱΛظ଴
    - ࠷ۙTACLͰΞΫςΟϒͳҹ৅
    2023/8/26 ࠷ઌ୺NLPษڧձ

    View Slide