Cognitive (Im)plausibility of Large Language Models

Cognitive (Im)plausibility of Large Language Models Tatsuki Kuribayashi (MBZUAI) 1
@CBS seminar in Hong Kong Polytechnic Univ.

Hello! • Tatsuki Kuribayashi • Postdoc in NLP department MBZUAI@Abu
Dhabi • Rapidly growing international NLP team! (ranked 18th in the world in NLP; 14 faculties) • Ph.D. at Tohoku University, Japan • Check Tohoku NLP Group • Organizer of CMCL 2024@ALC 2024 (workshop on cognitive modeling and computational linguistics) • Emmanuele was the past organizer • Visited Hong Kong in EMNLP 2019! 2

• Cognitive modeling w/ NLP techniques • Lower perplexity is
not always human-like [Kuribayashi+,ACL2021] • Context limitations make neural language models more human-like [Kuribayash+,EMNLP2022] • Psychometric predictive power of large language models [Kuribayashi+,Findings NAACL2024] • Emeregent word order universals from cognitively-motivated language models [Kuribayashi+,arXiv] • Writing assistance • Human-machine collaborative writing tool [Ito+,EMNLP 2019 demo][Ito+,UIST2023] • Parsing argumentative texts [Kuribayashi+,ACL2019] • Model interpretability • Mechanistic understandings of Transformers [Kobayashi+,EMNLP2020][Kobayashi+,EMNLP2021][Kobayashi+,ICLR2024] • Japanese-focused research • Word-order preferences [Kuribayashi+,ACL2020] • Topicalization preferences [Fujihara+,COLING2022] • Ellipsis preferences [Ishiduki+,COLING2024] My research https://kuribayashi4.github.io/ 3 Linguistics NLP What is reading cost? What do humans compute during reading? of NLP models and humans ------

Motivatin from artificial intelligence field • Going back to “artificial
intelligence field” in dictionary [Shapiro, 2008] 4 1. Machine intelligence---…push outwards the frontier of what we know how to program on computers, especially in the direction of tasks that, although we don’t know how to program them, people can perform… progressed e.g., better machine translation, chatbot

intelligence field” in dictionary [Shapiro, 2008] 5 1. Machine intelligence---…push outwards the frontier of what we know how to program on computers, especially in the direction of tasks that, although we don’t know how to program them, people can perform… progressed progressed e.g., better machine translation, chatbot 2. Computational philosophy---…form a computational understanding of human- level intelligent behavior, without being restricted to the algorithms and data structures that the human mind actually does (or conceivably might) use... e.g., scaling Transformer LMs

intelligence field” in dictionary [Shapiro, 2008] 6 1. Machine intelligence---…push outwards the frontier of what we know how to program on computers, especially in the direction of tasks that, although we don’t know how to program them, people can perform… progressed progressed e.g., better machine translation, chatbot 2. Computational philosophy---…form a computational understanding of human- level intelligent behavior, without being restricted to the algorithms and data structures that the human mind actually does (or conceivably might) use... 3. Computational psychology---…understand human intelligent behavior by creating computer programs that behave in the same way that people do. For this goal it is important that the algorithm expressed by the program be the same algorithm that people actually use, and the data structure… often unstated, but pivotal goal e.g., scaling Transformer LMs (orthogonal merits to other approaches such as introspection e.g., ensuring objectivity, quantitativity…)

Human sentence processing • Sentence processing (i.e., sentence comprehension, online
reading) • What do humans compute during reading and how (ultimately, why)? • What model/metric can exaplin the word-by-word cogntive load? 7 If you were to journey to the North of England, … If you were to journey to the North of Cognitive load humans 𝒙=tokens 𝒚=reading behavior (typically token-by-token) 𝒚 = 𝑓! (𝒙)

Reading behavior data (𝒙, 𝒚) • Self-paced reading time [Smith&Levy,13][Futrell,+18]…
• Eye-tracking data [Kennedy+,03] [Luke&Christianson,18][Hollestein+,20]… • Longer reading time indicates heavier cogntive load 8 If you were to journey to the North of Reading time Simply referred as “reading time” in this talk Out of scope today Electrocorticography (ECoG) [Fedrorenko+,16]… Magnetoencephalography (MEG) [Brennan&Pylkkanen,17] Electroencephalogram (EEG) [Thornhill+,12][Frank+,13][Frank+,15][Hale+,18][Hollestein+,20][Michaelov+,23]… Functional magnetic resonance imaging (FMRI) [Whebe+,14][Blank+,14][Brennan+,16][Pereira+,18][Shain+,20][Schrimpf+,21]…

Computational approach • Testing hypotheses via computaional simulation (implementation&evaluation) •
i.e., exploring a computational model that simulates humans well 9 Measure. A Measure. B If you were to journey to the North of England, … Model 1 If you were to journey to the North of England, Reading time humans More similar Model 2 Humans would compute the measurement B in the way computes it Pred. Pred. what how

What do humans compute?―Surprisal theory • Processing cost of a
word is propotional to its surprisal [Levy,08][Smith&Levy,13][Shain+,22] • When a word is unpredictable from context, humans exhibit more cognitive loads • What=surprisal, how=? 10 Although my friends left the party I enjoyed Although my friends left the party continues to… My hobby is reading a book, and… My hobby is reading a music sheet, and… ❗ ❗ (NP/Z ambiguity) − log 𝑝(word|context) [Smith&Levy,13] surprisal surprisal (tentatively) Model A Model B Next-word prediction If you were to

What models can compute human-like surprisals? • Probabilistic earley parser
[Hale,01] • Linear/hierarchical model [Frank&Bob,11]… • Simple RNN vs. PCFG estimation • Lexicalized/unelxicalized surprisal [Fossum&Levy,12]… • PoS-based vs. token-based estimation • RNN?LSTM?Transformer? [Aurnhammer&Frank,19] [Wilcox+,20] [Merkx&Frank,21] • Simply accurate language models (LMs) are more human-like…? [Frank&Bob,11] [Fossum&Levy,12] [Goodkind&Bicknell,18] 11 [Goodkind&Bicknell,18] A model trained to predict the next word given a context Paris is the capital of ___ France Algeria Morocco Haiti [Merkx&Frank,21] better Next-word prediction accuracy better

Accurate next-word preditcion ∝ cognitively plausibility? • Large (>GPT2-small) LMs
poorly explain human reading behavior • Language-dependent results (scaling does not appear in Japanese) [Kuribayashi+,21] • Even in English, way large LMs (e.g., OPT, GPT2-Neo, GPT-3) are less human-like [Oh&Schler,23][Shain+,23] • Probably, superhuman prediction ability of LMs (i.e., human expectation is noisy) 12 💡Scaling does hold cross-lingually at least when using a small (6layers) Transformer and varying the training data size [Wilcox+,23] [Kuribayashi+,21] [Oh&Schler,23] Better PPL Less human-like Better PPL Less human-like How accurate the LM’s prediction is How well surprisal explains human reading behavior

Exploring “human-likeness” through filling the LM-human gaps Actively explored: •
Superhuman prediction in specific words (named entity [Oh&Schler,23], low-frequent words [Oh+,24]) • Superhuman context-access of LMs? [Futrell+,20][Kuribayashi+,22] • Cognitive plausibility of LM tokenization? [Nair&Resnik,23] • Contaminatin of reading time corpus? [Wilcox+,23] • Need for re-analysis (slow, syntactic) system? [van Schijndel&Linzen,21][Wilcox+,21][Huang+,24] 13 humans LMs Diff = distinctiveness of human langauge processing How can one close LMs to humans? AI-alignment problem Orthogonal/related theories (will give a hint?) - Dependency locality theory (DLT) [Gibson,98] - Long dependency incurs more costs - Lossy-context surprisal [Futrell+,20][Kuribayashi+,22] - Anti-locality theory [Konieczny,00] - Cue-based retrieval theory [Lewis+,06] - Direct retrieval of context information during reading - Connection to Transformer architecture [Merkx&Frank,21]

Psychometric predictive power of large language models [Kuribayashi+,24] • More
exploring human-LLM gaps in sentence processing (with some curiosities) • Do instruction-tuned LMs offer human-like surprisals? • Do some prompts alleviate this gap? • Is a particular LLM family more human-like? • No evidence these advancement (or specific modern LMs) provide better measuremts for cognitive modeling than bare probability from base LMs • Human expectation-based reading seems to be simply tuned to corpus statistics • Kind of position paper towards AI-human alignment • Stimulate (typically engineering-oriented, young) LLM community to have interests to cognitive modeling 14 (within our experiments) humans LLMs AI-alignment problem

Experiment 1: instruction tuning • Does instruction tuning of LMs
improve the fit of their surprisal to human reading behavior? • Yes probably because • Human readres predict upcoming texts generally preferred by humans (e.g., less hallucination) • No probably because • Human reading is just tuned to corpus statistics (surprisal, frequency); additional tuning may collapse the LMs’ next-word distribution • Instruction-tuning objective is to create a superhuman chatbot, not aligned with the goal of cognitive modeling 15 Which is likely? Surprisal Surprisal Base LMs Instruction-tuned LMs https://openai.com/research/instruction-following

Experimental setting • Explain reading time with surprisal and baseline
factors • Metrics • Increase in loglikelihood (model fit; psychometric predictive power; PPP) between the regression models with and without the surprisal factor • 2 corpora • Dundee corpus: eye-tracking (first-pass gaze duration) [Kennedy,+03] • Natural Stories corpus: self-paced reading [Futrell,+18] • 3 measurements • Surprisal, shannon entropy, and renyi rentropy (α=0.5) [Wilcox+,23][Liu+,23] • 26 models • GPT2 (177M-1.5B), OPT (125M-66B), GPT3 (bebbage-002, davinci-002), GPT-3.5 (text-davinci-002/003), Llama-2 (7-70B), Llama-2-instruct (7B-70B), Falcon (7B, 40B), Falcon-instruct (7B, 40B) 16 Reading_time (word) ~ surprisal (word) + baseline_factors(word) Word frequency and length of t, t-1, and t-2 tokens [Wilcox+,23]

Experiment 1: instruction tuning results • Instrucion tuning frequently hurt
the PPP (not always, though) • No clear trends of specific LLM family having a high PPP 17 No evidence for the positive effects of instruction tuning in cog. modeling information-theoretic values using intra-sentential context since we are interested in sentence-level syntactic processing in this study. 2.2 Experimental settings Models: We examined 26 LLMs as candidate models ✓ to compute information-theoretic values: four GPT-2 (Radford et al., 2019), four GPT- 3/3.5 (Ouyang et al., 2022)5, six LLaMA-2 (Tou- vron et al., 2023), four Falcon (Almazrouei et al., 2023), and eight OPT (Zhang et al., 2022) models with different sizes and instruction tuning settings (see Appendix A for details). Among them, two GPT-3.5, two LLaMA-2, and two Falcon models were ﬁne-tuned with instruction tuning (models with X in the “IT” column in Table 1 are IT-LLMs), and the others are “base LLMs.” Entropy metrics are omitted from the GPT-3/3.5 results since their APIs do not provide the probability distribution across their entire vocabulary. Data: We use the Dundee Corpus (DC) (Kennedy et al., 2003) and Natural Stories Corpus (NS) (Futrell et al., 2018) for reading time data.6 3 Experiment 1: PPP of LLMs We ﬁrst observe the PPP of base LLMs (§3.1) and then analyze the PPP of IT-LLMs (§3.2). We explore prompting in §4 and §5. DC NS Model IT h " H "H0.5 " PPL # h " H "H0.5 " PPL # GPT-2 177M 15.2312.32 15.55209.3715.6110.20 18.19 93.81 GPT-2 355M 9.6311.20 15.37222.1713.62 8.91 16.96 75.67 GPT-2 774M 10.98 9.66 14.79165.8112.04 7.01 14.52 66.87 GPT-2 1.5B 10.18 - 14.15158.7510.94 6.99 14.69 65.14 GPT-3 B2 12.47 - -108.7710.58 - - 57.91 GPT-3 D2 9.93 - - 79.65 6.45 - - 44.79 GPT-3.5 D2 X 9.35 - - 72.95 5.30 - - 38.23 GPT-3.5 D3 X 8.91 - - 84.17 5.83 - - 44.38 LLaMA-2 7B 10.33 8.58 13.45 76.40 6.41 3.06 9.97 45.21 LLaMA-2 7B X 8.97 5.57 12.03153.46 7.07 2.42 8.33 63.74 LLaMA-2 13B 9.44 8.04 13.77 75.28 5.44 2.44 9.23 41.62 LLaMA-2 13BX 9.13 5.30 11.97123.35 5.93 1.99 7.53 56.05 LLaMA-2 70B 8.21 5.14 10.47 78.28 4.51 1.80 6.79 37.61 LLaMA-2 70BX 8.67 4.53 10.67112.07 5.60 1.75 7.34 52.05 Falcon 7B 9.08 7.75 11.81 97.86 7.61 3.95 12.17 49.64 Falcon 7B X11.18 8.57 12.31131.53 8.54 4.38 12.63 62.99 Falcon 40B 8.53 6.93 10.99 77.72 5.35 2.41 9.36 41.46 Falcon 40B X 9.06 6.76 10.43 92.53 5.49 2.89 8.49 47.27 OPT 125M 15.6513.72 17.18231.8015.5412.27 19.41109.11 OPT 350M 14.8111.89 16.07196.0214.8610.35 18.11 94.51 OPT 1.3B 10.5110.16 15.55160.9511.81 7.43 16.53 67.59 OPT 2.7B 9.52 9.65 14.38150.7811.66 6.60 15.51 63.98 OPT 6.7B 9.43 9.06 13.63130.01 9.59 5.56 13.64 57.86 OPT 13B 9.06 8.57 13.15130.44 9.51 4.96 12.84 56.74 OPT 30B 9.62 8.58 13.17119.42 8.55 4.16 10.39 54.91 OPT 66B 10.30 7.42 12.73 94.15 7.78 4.33 11.92 49.11 Table 1: The PPL and PPP scores of tested LMs. The “IT” column denotes whether the instruction tuning is applied. The columns h, H, and H0.5 indicate surprisal, Shannon entropy, and Rényi entropy (↵ = 0.5) settings, respectively. The colors of cells for instruction tuning indicate if the PPP increased or decreased compared to its non-instruction-tuned version (GPT-3.5 models are compared to GPT-3s).

Experiment 1: results • Instruction-tuned models can not balance the
PPL and PPP • Replicated the inverse PPL-PPP scaling • Instruction-tuned LLMs are always below the PPL—PPP trade-off line in base models 18 Dundee Corpus Natural Stories Corpus LLaMA-2 Falcon GPT-3/3.5 OPT Model family Instruction-tuning Model size Tuned (IT) Not-tuned (Base) smaller larger GPT-2 worse better better worse Figure 2: The relationship between PPL and PPP (see exact scores in Table 1). Each point corresponds to each LM, and those with a black edge line are IT-LLMs. The regression line is estimated by base LLMs, and the colored area presents a 95% conﬁdence interval. IT-LLMs were relatively poor (below the line) at balancing PPL and PPP. Better PPL Less human-like

Experiment 2: prompting • How is human sentence processing biased?
• What kind of prompts fill the LM-human gap? 19 RT∝− log 𝑝(word|context, human_bias) RT∝− log 𝑝(word|context, prompt) insights humans Instruction-tuned LMs How can we close LMs to humans? Generate a grammatically simple sentence Prompt-conditioned surprisal [Brown+20]

Experiment 2: prompting 20 Please complete the following sentence to
make it as grammatically simple as possible: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence with a careful on grammar: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence to make it as grammatically complex as possible: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence using the simplest vocabulary possible: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence with a careful focus on word choice: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence using the most difficult vocabulary possible: 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence in a human-like manner. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors. 𝑤! , 𝑤" … , 𝑤#$" Please complete the following sentence. We are trying to reproduce human reading times with the word prediction probabilities you calculate, so please predict the next word like a human. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors. 𝑤! , 𝑤" … , 𝑤#$" RT∝− log 𝑝(word|context, prompt) Grammar (syntax) Vocabulary Task-oriented

Preliminary analysis: prompting • Does prompting properly bias the generation?
21 Dependency length Sentence length Word frequency ---at least within our observation, yes

Experiment 2: prompting • Particular prompts improve the fit to
reading time • Mentioning about grammar and/or simplicity 22 DC NS ID Prompt h " H " H0.5 " h " H " H0.5 " 1 Please complete the following sentence to make it as grammatically simple as possible:\n w0, · · · , wt 1 8.23 7.46 12.26 6.55 2.62 8.26 2 Please complete the following sentence with a careful focus on grammar: \n w0, · · · , wt 1 8.24 7.19 11.99 6.20 2.99 8.72 3 Please complete the following sentence to make it as grammatically complex as possible: \n w0, · · · , wt 1 7.77 6.99 11.74 5.66 2.54 7.75 4 Please complete the following sentence using the simplest vocabulary possible: \n w0, · · · , wt 1 7.82 7.48 12.15 5.70 3.11 8.90 5 Please complete the following sentence with a careful focus on word choice: \n w0, · · · , wt 1 7.87 6.86 11.50 6.06 2.94 8.60 6 Please complete the following sentence using the most difﬁcult vocabulary possible: \n w0, · · · , wt 1 7.31 6.71 11.38 4.73 2.43 7.57 7 Please complete the following sentence in a human-like manner. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors.\n w0, · · · , wt 1 7.86 7.30 12.34 4.60 3.03 8.78 8 Please complete the following sentence. We are trying to reproduce human reading times with the word prediction probabilities you calculate, so please predict the next word like a human. It has been reported that human ability to predict next words is weaker than language models and that humans often make noisy predictions, such as careless grammatical errors.\n w0, · · · , wt 1 8.17 7.36 12.42 4.83 3.11 8.73 9 Please complete the following sentence: \n w0, · · · , wt 1 8.34 7.12 11.88 5.77 3.01 8.74 10 w/o prompting 9.32 6.15 11.48 6.25 2.69 8.86 Table 2: The PPP scores when using each prompt for different LLMs (the highest scores other than baseline ones for each corpus/metric are in boldface). Scores are averaged across the seven IT-LLMs. The columns h, H, and H0.5 indicate surprisal, Shannon entropy, and Rényi entropy (↵ = 0.5) settings, respectively. Grammar (syntax) Vocabulary Task-oriented simplicity simplicity (consistent with theories such as good-enough processing) Averaged PPPs

LLaMA-2 Falcon GPT-3/3.5 OPT Model family Instruction-tuning Model size Tuned
(IT) Not-tuned (Base) smaller larger Tuned&Prompt Dundee Corpus Natural Stories Corpus GPT-2 worse better better worse Experiment 2: prompting • Prompt-conditioned surprisal (red-lined) can not outperform base LLMs with a similar PPL 23

Experiment 3: meta-linguistic prompting Hey LLMs, tell me the reading
time/suprisal of this word in this sentence 24 (this may not work, though…) Suppose humans read the following sentence: "’No, it’s fine. I love it,’ said Lucy knowing that affording the phone had been no small thing for her mother." List the tokens and their IDs in order of their reading cost (high to low) during sentence processing. Token ID: 0: ’No„ 1: it’s, 2: fine., 3: I, 4: love, 5: it,’, 6: said, 7: Lucy, 8: knowing, 9: that, 10: affording, 11: the, 12: phone, 13: had, 14: been, 15: no, 16: small, 17: thing, 18: for, 19: her, 20: mother., Answer: 20: mother., 10: affording, 6: said, 11: the, 0: ’No„ 7: Lucy, 1: it’s, 9: that, 17: thing, 5: it,’, 2: fine., 15: no, 14: been, 3: I, 13: had, 8: knowing, 12: phone, 19: her, 16: small, 4: love, 18: for, Suppose humans read the following sentence: "A clear and joyous day it was and out on the wide open sea, thousands upon thousands of sparkling water drops, excited by getting to play in the ocean, danced all around." List the tokens and their IDs in order of their reading cost (high to low) during sentence processing. Token ID: 0: A, 1: clear, 2: and, 3: joyous, 4: day, 5: it, 6: was, 7: and, 8: out, 9: on, 10: the, 11: wide, 12: open, 13: sea„ 14: thousands, 15: 3-shot setting … [Hu&Levy,23] Simplified as a token-sorting problem in the order of their processing costs 1-shot

Experiment 3: meta-linguistic prompting • Results: no correlation between model’s
prediction and actual reading time • Spearman’s r • Even only using the first-three tokens listed by the models yielded no correlation 25 Method (simpliﬁed prompts) Model DC " NS " Suppose humans read the following sentence: [SENT]. List the tokens in order of their reading cost (high to low) during sentence processing. LLaMA-2 7B 0.09±0.02 -0.04±0.06 LLaMA-2 13B 0.06±0.02 -0.03±0.06 Falcon 7B 0.12±0.01 0.01±0.09 Falcon 40B 0.03±0.04 -0.03±0.11 GPT3.5 D2 0.05±0.03 0.05±0.03 GPT3.5 D3 0.08±0.03 0.03±0.02 Suppose you read the following sentence: [SENT]. List the tokens in order of their probability in context (low to high). LLaMA-2 7B 0.05±0.06 0.00±0.02 LLaMA-2 13B 0.04±0.03 0.06±0.04 Falcon 7B 0.08±0.05 0.05±0.02 Falcon 40B 0.02±0.07 0.13±0.10 GPT3.5 D2 0.03±0.00 0.02±0.00 GPT3.5 D3 -0.01±0.02 0.06±0.03 Surprisal-based estimation LLaMa-2 7B 0.28 0.19 LLaMa-2 13B 0.27 0.19 Falcon 7B 0.32 0.18 Falcon 40B 0.28 0.17 GPT3.5 D2 0.28 0.16 GPT3.5 D3 0.25 0.17 Table 3: Rank correlations between estimated cognitive load and reading time of words. body o of IT-L human Promp in LL paradi LLMs et al., 2 guš et 2023). ability promp and Le ically ancy t tinctio lem is

Experiment 3: meta-linguistic prompting Hey LLMs, tell me the surprisal
of this word in this sentence Analysis: weak correlation between generated and their actual surprisals • Lack of meta-cognition of their own surprisal 26 Surprisal-based estimation LLaMa-2 13B 0.27 0.19 Falcon 7B 0.32 0.18 Falcon 40B 0.28 0.17 GPT3.5 D2 0.28 0.16 GPT3.5 D3 0.25 0.17 Table 3: Rank correlations between estimated cognitive load and reading time of words. Model DC " NS " LLaMA-2 7B 0.12±0.13 0.15±0.08 LLaMA-2 13B 0.02±0.10 0.06±0.07 Falcon 7B 0.15±0.08 0.30±0.09 Falcon 40B 0.09±0.09 0.17±0.00 GPT3.5 D2 0.15±0.02 0.22±0.07 GPT3.5 D3 0.18±0.05 0.24±0.02 Table 4: Rank correlations between the word probability (rank) estimated by the prompt and the actual surprisal values computed by the corresponding model. about word probability is again not an accurate measure of actual surprisal.

Summary • Current advancement of LLMs does not offer a
better measurements for cognitive modeling than simple bare word probability • Human-AI alignment has been argued, but the perspective of cognitive modeling was overlooked • Cognitive plausibility of direct probability measurement by base LM has been supported • Humans seem to be simply tuned to lanaguge statistics in corpus • At least within our experiments (naturallistic reading) 27 In other words (accumulated linguistic exposures)

Open questions • What is additionally needed to fill the
gap between accurate surprisal and human reading behavior? • What type of instruction-tuning affects which words’ surprisal? • Is this gap truly from the inherent limitations of surprisal theory+LLMs or perhaps from some technical issues? (e.g., tokenization) • Representational alignment vs. behavioral alignment [Aw+,23] 28 Let’s explore the intersection of NLP and cognitive modeling

CMCL 2024 • CMCL 2024 will be co-located with the
62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). • The research interests/questions include, but are not limited to: • Human-like language acquisition/learning: How is language acquisition of language models (LMs) (dis)similar to humans, and why? • Contrasting/aligning NLP models with human behavior data: What do humans compute during language comprehension/production, and how/why? • Linguistic probing of NLP models: How well do current language models understand/represent/generalize language behaviorally/internally? • Linguistically-motivated data modeling/analysis: How can one quantify a particular aspect of language? • Emergent communication/language: What are the sufficient conditions for the emergence of language? • Important Dates • May 17, 2024: Paper submission/commitment deadline (cf. May 15, 2024: notification of ACL 2024) • June 17, 2024: Notification of acceptance • July 1, 2024: Camera-ready paper due • August 15, 2024: Workshop dates Deadlines are at 11:59 pm AOE 29 https://cmclorg.github.io/

References • Shapiro, Stuart C. 2003. “Artificial Intelligence (AI).” In
Encyclopedia of Computer Science, 89–93. GBR: John Wiley and Sons Ltd. • Smith, Nathaniel J., and Roger Levy. 2013. “The effect of word predictability on reading time is logarithmic.” Journal of Cognition 128 (3): 302–19. • Futrell, Richard, Edward Gibson, Harry J. Tily, Idan Blank, Anastasia Vishnevetsky, Steven T. Piantadosi, and Evelina Fedorenko. n.d. “The Natural Stories Corpus.” http://github.com/languageMIT/naturalstories. • Kennedy, Alan, Robin Hill, and Joël Pynte. 2003. “The dundee corpus.” In Proceedings of the 12th European Conference on Eye Movement. • Luke, Steven G., and Kiel Christianson. 2018. “The Provo Corpus: A Large Eye-Tracking Corpus with Predictability Norms.” Behavior Research Methods 50 (2): 826–33. • Hollenstein, Nora, Marius Troendle, Ce Zhang, and Nicolas Langer. 2020. “ZUCo 2.0: A Dataset of Physiological Recordings During Natural Reading and Annotation.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 138–46. Marseille, France: European Language Resources Association. • Fedorenko, Evelina, Terri L. Scott, Peter Brunner, William G. Coon, Brianna Pritchett, Gerwin Schalk, and Nancy Kanwisher. 2016. “Neural Correlate of the Construction of Sentence Meaning.” Proceedings of the National Academy of Sciences of the United States of America 113 (41): E6256–62. • Brennan, Jonathan R., and Liina Pylkkänen. 2017. “MEG Evidence for Incremental Sentence Composition in the Anterior Temporal Lobe.” Cognitive Science 41 Suppl 6 (May): 1515–31. • Thornhill, Dianne E., and Cyma Van Petten. 2012. “Lexical versus Conceptual Anticipation during Sentence Processing: Frontal Positivity and N400 ERP Components.” International Journal of Psychophysiology: Official Journal of the International Organization of Psychophysiology 83 (3): 382–92. • Frank, Stefan L., Leun J. Otten, Giulia Galli, and Gabriella Vigliocco. 2013. “Word Surprisal Predicts N400 Amplitude during Reading.” In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), edited by Hinrich Schuetze, Pascale Fung, and Massimo Poesio, 878–83. Sofia, Bulgaria: Association for Computational Linguistics. • Frank, Stefan L., Leun J. Otten, Giulia Galli, and Gabriella Vigliocco. 2015. “The ERP Response to the Amount of Information Conveyed by Words in Sentences.” Brain and Language 140 (January): 1–11. • Hale, John, Chris Dyer, Adhiguna Kuncoro, and Jonathan Brennan. 2018. “Finding Syntax in Human Encephalography with Beam Search.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2727–36. Melbourne, Australia: Association for Computational Linguistics. 30

References • Michaelov, James A., Megan D. Bardolph, Cyma K.
Van Petten, Benjamin K. Bergen, and Seana Coulson. 2023. “Strong Prediction: Language Model Surprisal Explains Multiple N400 Effects.” Neurobiology of Language (Cambridge, Mass.), June, 1–29. • Wehbe, Leila, Brian Murphy, Partha Talukdar, Alona Fyshe, Aaditya Ramdas, and Tom Mitchell. 2014. “Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses.” PloS One 9 (11): e112575. • Blank, Idan, Nancy Kanwisher, and Evelina Fedorenko. 2014. “A Functional Dissociation between Language and Multiple-Demand Systems Revealed in Patterns of BOLD Signal Fluctuations.” Journal of Neurophysiology 112 (5): 1105–18. • Brennan, Jonathan R., Edward P. Stabler, Sarah E. Van Wagenen, Wen-Ming Luh, and John T. Hale. 2016. “Abstract Linguistic Structure Correlates with Temporal Activity during Naturalistic Comprehension.” Brain and Language 157-158 (June): 81–94. • Pereira, Francisco, Bin Lou, Brianna Pritchett, Samuel Ritter, Samuel J. Gershman, Nancy Kanwisher, Matthew Botvinick, and Evelina Fedorenko. 2018. “Toward a Universal Decoder of Linguistic Meaning from Brain Activation.” Nature Communications 9 (1): 963. • Shain, Cory, Idan Asher Blank, Marten van Schijndel, William Schuler, and Evelina Fedorenko. 2020. “fMRI Reveals Language- Specific Predictive Coding during Naturalistic Sentence Comprehension.” Neuropsychologia 138 (February): 107307. • Schrimpf, Martin, Idan Blank, Greta Tuckute, Carina Kauf, Eghbal A. Hosseini, Nancy Kanwisher, Joshua Tenenbaum, and Evelina Fedorenko. 2020. “The Neural Architecture of Language: Integrative Modeling Converges on Predictive Processing.” bioRxiv. bioRxiv. https://doi.org/10.1101/2020.06.26.174482. • Levy, Roger. 2008. “Expectation-based syntactic comprehension.” Journal of Cognition 106 (3): 1126–77. • Shain, Cory, Clara Meister, Tiago Pimentel, Ryan Cotterell, and Roger P. Levy. 2022. “Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time.” https://doi.org/10.31234/osf.io/4hyna. • Hale, John. 2001. “A Probabilistic Earley Parser as a Psycholinguistic Model.” In Proceedings of NAACL, 159–66. • Frank, Stefan L., and Rens Bod. 2011. “Insensitivity of the human sentence-processing system to hierarchical structure.” Psychological Science 22 (6): 829–34. 31

References • Fossum, Victoria, and Roger Levy. 2012. “Sequential vs.
Hierarchical Syntactic Models of Human Incremental Sentence Processing.” In Proceedings of CMCL, 61–69. Montréal, Canada. • Merkx, Danny, and Stefan L. Frank. 2020. “Comparing Transformers and RNNs on predicting human sentence processing data.” arXiv Preprint arXiv:2005. 09471,2020. http://arxiv.org/abs/2005.09471. • Merkx, Danny, and Stefan L. Frank. 2021. “Human Sentence Processing: Recurrence or Attention?” In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 12–22. Online: Association for Computational Linguistics. • Wilcox, Ethan Gotlieb, Jon Gauthier, Jennifer Hu, Peng Qian, and Roger Levy. 2020. “On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior.” In Proceedings of CogSci, 1707–13. • Goodkind, Adam, and Klinton Bicknell. 2018. “Predictive power of word surprisal for reading times is a linear function of language model quality.” In Proceedings of CMCL2018, 10–18. • Kuribayashi, Tatsuki, Yohei Oseki, Takumi Ito, Ryo Yoshida, Masayuki Asahara, and Kentaro Inui. 2021. “Lower Perplexity Is Not Always Human-Like.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5203–17. Online: Association for Computational Linguistics. • Oh, Byung-Doh, and William Schuler. 2023. “Why Does Surprisal from Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?” Transactions of the Association for Computational Linguistics 11 (March): 336–50. • Wilcox, Ethan, Clara Meister, Ryan Cotterell, and Tiago Pimentel. 2023. “Language Model Quality Correlates with Psychometric Predictive Power in Multiple Languages.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 7503–11. Singapore: Association for Computational Linguistics. • Oh, Byung-Doh, Shisen Yue, and William Schuler. 2024. “Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Yvette Graham and Matthew Purver, 2644–63. St. Julian’s, Malta: Association for Computational Linguistics. 32

References • Futrell, Richard, Edward Gibson, and Roger P. Levy.
2020. “Lossy-Context Surprisal: An Information-Theoretic Model of Memory Effects in Sentence Processing.” Journal of Cognitive Science. • Kuribayashi, Tatsuki, Yohei Oseki, Ana Brassard, and Kentaro Inui. 2022. “Context Limitations Make Neural Language Models More Human-Like.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, edited by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, 10421–36. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. • Nair, Sathvik, and Philip Resnik. 2023. “Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?” In Findings of the Association for Computational Linguistics: EMNLP 2023, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 11251–60. Singapore: Association for Computational Linguistics. • Schijndel, Marten van, and Tal Linzen. 2020. “Single-Stage Prediction Models Do Not Explain the Magnitude of Syntactic Disambiguation Difficulty.” https://doi.org/10.31234/osf.io/sgbqy. • Wilcox, Ethan, Pranali Vani, and Roger Levy. 2021. “A Targeted Assessment of Incremental Processing in Neural Language Models and Humans.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 939–52. Online: Association for Computational Linguistics. • Huang, Kuan-Jung, Suhas Arehalli, Mari Kugemoto, Christian Muxica, Grusha Prasad, Brian Dillon, and Tal Linzen. 2024. “Large- Scale Benchmark Yields No Evidence That Language Model Surprisal Explains Syntactic Disambiguation Difficulty.” Journal of Memory and Language 137 (August): 104510. • Gibson, Edward. 1998. “Linguistic complexity: Locality of syntactic dependencies.” Journal of Cognition 68 (1): 1–76. • Lewis, Richard L., Shravan Vasishth, and Julie A. Van Dyke. 2006. “Computational Principles of Working Memory in Sentence Comprehension.” Trends in Cognitive Sciences 10 (10): 447–54. • Pimentel, Tiago, Clara Meister, Ethan G. Wilcox, Roger P. Levy, and Ryan Cotterell. 2023. “On the Effect of Anticipation on Reading Times.” Transactions of the Association for Computational Linguistics 11 (December): 1624–42. 33

References • Liu, Tong, Iza Škrjanec, and Vera Demberg. 2023.
“Improving Fit to Human Reading Times via Temperature-Scaled Surprisal.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2311.09325 • Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv [cs.CL]. arXiv. https://proceedings.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac 142f64a-Paper.pdf. • Hu, Jennifer, and Roger Levy. 2023. “Prompt-Based Methods May Underestimate Large Language Models’ Linguistic Generalizations.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2305.13264. • Aw, Khai Loong, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, and Antoine Bosselut. 2023. “Instruction-Tuning Aligns LLMs to the Human Brain.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2312.00575. 34

Cognitive (Im)plausibility of Large Language Mo...

Cognitive (Im)plausibility of Large Language Models

tatsuki kuribayashi

More Decks by tatsuki kuribayashi

Featured

Transcript

Cognitive (Im)plausibility of Large Language Models Tatsuki Kuribayashi (MBZUAI) 1

Hello! • Tatsuki Kuribayashi • Postdoc in NLP department MBZUAI@Abu

• Cognitive modeling w/ NLP techniques • Lower perplexity is

Motivatin from artificial intelligence field • Going back to “artificial

Motivatin from artificial intelligence field • Going back to “artificial

Motivatin from artificial intelligence field • Going back to “artificial

Human sentence processing • Sentence processing (i.e., sentence comprehension, online

Reading behavior data (𝒙, 𝒚) • Self-paced reading time [Smith&Levy,13][Futrell,+18]…

Computational approach • Testing hypotheses via computaional simulation (implementation&evaluation) •

What do humans compute?―Surprisal theory • Processing cost of a

What models can compute human-like surprisals? • Probabilistic earley parser

Accurate next-word preditcion ∝ cognitively plausibility? • Large (>GPT2-small) LMs

Exploring “human-likeness” through filling the LM-human gaps Actively explored: •

Psychometric predictive power of large language models [Kuribayashi+,24] • More

Experiment 1: instruction tuning • Does instruction tuning of LMs

Experimental setting • Explain reading time with surprisal and baseline

Experiment 1: instruction tuning results • Instrucion tuning frequently hurt

Experiment 1: results • Instruction-tuned models can not balance the

Experiment 2: prompting • How is human sentence processing biased?

Experiment 2: prompting 20 Please complete the following sentence to

Preliminary analysis: prompting • Does prompting properly bias the generation?

Experiment 2: prompting • Particular prompts improve the fit to

LLaMA-2 Falcon GPT-3/3.5 OPT Model family Instruction-tuning Model size Tuned

Experiment 3: meta-linguistic prompting Hey LLMs, tell me the reading

Experiment 3: meta-linguistic prompting • Results: no correlation between model’s

Experiment 3: meta-linguistic prompting Hey LLMs, tell me the surprisal

Summary • Current advancement of LLMs does not offer a

Open questions • What is additionally needed to fill the

CMCL 2024 • CMCL 2024 will be co-located with the

References • Shapiro, Stuart C. 2003. “Artificial Intelligence (AI).” In

References • Michaelov, James A., Megan D. Bardolph, Cyma K.

References • Fossum, Victoria, and Roger Levy. 2012. “Sequential vs.

References • Futrell, Richard, Edward Gibson, and Roger P. Levy.

References • Liu, Tong, Iza Škrjanec, and Vera Demberg. 2023.