Slide 1

Slide 1 text

Time Travel with Large Language Models Danushka Bollegala

Slide 2

Slide 2 text

2 Mad scientist

Slide 3

Slide 3 text

3 Xiaohang Tang Yi Zhou Yoichi Ishibashi Taichi Aida Mad scientist with Large Language Models

Slide 4

Slide 4 text

Time and Meaning — Cell 4 Robe rt Hook (1665) Ma rt in Cooper (1973)

Slide 5

Slide 5 text

Time and Meaning — Corona 5

Slide 6

Slide 6 text

Why do word meaning change? • New concepts/entities are associated with existing words (e.g. cell ) • Word re-usage promotes e ffi ciency in human communications [cf. Polysemy, Ravin+Leacock’00] • 40% of words in Webster dictionary have more than two senses, while run has 29! • Totally new words (neologisms) are coined to describe previously non-existent concepts/entities (e.g. chatGPT) • Semantics, morphology and syntax are strongly interrelated [Langacker+87, Hock+Joseph 19] • what count as coherent, grammatical changes over time [Giulianelli+21] 6 Grammatical Profiling for Semantic Change Detection Mario Giulianelli⇤ ILLC, University of Amsterdam [email protected] Andrey Kutuzov⇤ University of Oslo [email protected] Lidia Pivovarova⇤ University of Helsinki [email protected] Abstract Semantics, morphology and syntax are strongly interdependent. However, the major- ity of computational methods for semantic change detection use distributional word rep- resentations which encode mostly semantics. We investigate an alternative method, gram- matical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present lass Young Woman → sweethea rt drop in the plural form (lasses) Pockemon trainer class (girl in mini-ski rt )

Slide 7

Slide 7 text

A Brief History of Word Embeddings 7 Static Word Embeddings word2vec [Mikolov+13], GloVe [Pennington+14], fastText [Bojanowski+17],… Contextualised Word Embeddings BERT [Devlin+19], RoBERTa [Liu+19], ALBERT [Lan+20], … Dynamic Word Embeddings Bernoulli embeddings [Rudolph+Blei 17], Diachronic word embeddings [Hamilton+16], … Dynamic Contextualised Word Embeddings TempoBERT [Rosin+22], HistBERT [Qiu+22], TimeLMs [Loureiro+22], …

Slide 8

Slide 8 text

Diachronic Word Embeddings • Given multiple snapshots of corpora collected at di ff erent time steps, we could separately learn word embeddings from each snapshot. [Hamilton+16, Kulkarni+15, Loureiro+22] • Pros: Any word embedding learning method can be used • Cons: • Many models trained at di ff erent snapshots. • Di ffi cult to compare word embeddings learnt from di ff erent corpora because no natural alignment exists (cf. even the sets of word embeddings obtained from di ff erent runs of the same algorithm cannot be compared due to random initialisations) 8 nificant Detection of Linguistic Change arni ersity, USA nybrook.edu Rami Al-Rfou Stony Brook University, USA [email protected] ozzi ersity, USA ybrook.edu Steven Skiena Stony Brook University, USA [email protected] oach for tracking and tic shifts in the mean- c shifts are especially pid exchange of ideas . Our meta-analysis es of word usage, and point detection algo- shifts. roaches of increasing property time series, tional characteristics ng recently proposed train vector represen- talkative profligate courageous apparitional dapper sublimely unembarrassed courteous sorcerers metonymy religious adolescents philanthropist illiterate transgendered artisans healthy gays homosexual transgender lesbian statesman hispanic uneducated gay 1900 gay 1950 gay 1975 gay 1990 gay 2005 cheerful Kulkarni+15

Slide 9

Slide 9 text

Learning Alignments • Di ff erent methods can be used to learn alignments between separately learnt vector spaces • Canonical Correlation Analysis (CCA) was used by Pražák+20 (ranked 1st for the SemEval 2020 Task 1 binary semantic change detection task) • Projecting source to target embeddings: • CCA: • • Fu rt her o rt hogonal constraints can be used on • However, aligning contextualised word embeddings is hard [Takahashi+Bollegala’22] ambiguity; the decisions to add a POS tag to English target words and retain German noun capita shows that the organizers were aware of this problem. 3 System Description First, we train two semantic spaces from corpus C1 and C2. We represent the semantic spac matrix Xs (i.e., a source space s) and a matrix Xt (i.e, a target space t)2 using word2vec Skip-gr negative sampling (Mikolov et al., 2013). We perform a cross-lingual mapping of the two vector getting two matrices ˆ Xs and ˆ Xt projected into a shared space. We select two methods for th lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b of these methods are linear transformations. In our case, the transformation can be written as fol ˆ Xs = Ws!tXs where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does n to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o source space and Wt!o for the target space. The transformation matrices are computed by min the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) where cov the covariance, var is the variance and n is a number of vectors. In our implement CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (ma into the target space t from the common shared space with a pseudo-inversion, and the target spa s!t lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc´ ın et al 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b). Bot of these methods are linear transformations. In our case, the transformation can be written as follows: ˆ Xs = Ws!tXs (1 where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs) into target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does not hav to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared space (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for th source space and Wt!o for the target space. The transformation matrices are computed by minimizin the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the share space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2 where cov the covariance, var is the variance and n is a number of vectors. In our implementation o CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (matrix Xs into the target space t from the common shared space with a pseudo-inversion, and the target space doe not change. The matrix Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3 The submissions that use CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the - part means that the source and target spaces are reversed, see Section 4. The -nn and -bin parts refer to type of threshold used only in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no differenc ˆ Xs = Ws!tXs (1) that performs linear transformation from the source space s (matrix Xs) into a he source space transformed into the target space t (the matrix Xt does not have e Xt is already in the target space t and Xt = ˆ Xt). ansformation transforms both spaces Xs and Xt into a third shared space o Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for the for the target space. The transformation matrices are computed by minimizing between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the shared relation is defined as follows: X 1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2) e, var is the variance and n is a number of vectors. In our implementation of ual to the matrix Xt because it transforms only the source space s (matrix Xs) m the common shared space with a pseudo-inversion, and the target space does Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3) se CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -r e and target spaces are reversed, see Section 4. The -nn and -bin parts refer to a ly in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no difference submissions: cca-nn – cca-bin and cca-nn-r – cca-bin-r. ogonal Transformation, the submissions are referred to as ort & uns. We use on with a supervised seed dictionary consisting of all words common to both s!t Ws→t 9

Slide 10

Slide 10 text

Dynamic Embeddings • Exponential Family Embeddings [Rudolph+16] • • Bernoulli Embeddings [Rudolph+Blei 17] • , where • Dynamic Embeddings • embedding vectors are time-speci fi c, while context vectors (parametrised by ) are shared over time xi |xci ∼ ExpFam(ηi (xci ), t(xi )) xiv |xci ∼ Bern(ρ(t) iv ) ηiv = ρ(ti ) v ⊤ ∑ j∈cj ∑ v′  αv′  xjv′  ρ(t) v αv 10 use a Gaussian random walk to capture drift in the underlying language model; for example, see Blei and Laerty [8], Wang et al. [43], Gerrish and Blei [13] and Frermann and Lapata [12]. Though topic models and word embeddings are related, they are ultimately dierent approaches to language analysis. Topic models capture co-occurrence of words at the document level and focus on heterogeneity, i.e., that a document can exhibit multiple topics [9]. Word embeddings capture co-occurrence in terms of proximity in the text, usually focusing on small neighborhoods around each word [26]. Combining dynamic topic models and dynamic word embeddings is an area for future study. 2 DYNAMIC EMBEDDINGS We develop dynamic embeddings (), a type of exponential family embedding () [35] that captures sequential changes in the representation of the data. We focus on text data and the Bernoulli embedding model. In this section, we review Bernoulli embeddings for text and show how to include dynamics into the model. We then derive the objective function for dynamic embeddings and develop stochastic gradients to optimize it on large collections of text. Bernoulli embeddings for text. An is a conditional model [2]. It has three ingredients: The context, the conditional distribution of each data point, and the parameter sharing structure. In an for text, the data is a corpus of text, a sequence of words (x1, . . . ,xN ) from a vocabulary of size V . Each word xi 2 {0, 1}V is an indicator vector (also called a “one-hot” vector). It has one Figure 2: Graphical representation of a for text data in T time slices, X (1), · · · ,X (T ). The embedding vectors of each term evolve over time. The context vectors are shared across all time slices. embedding vectors context vectors

Slide 11

Slide 11 text

Dynamic Embeddings 11 (a) in ACM abstracts (1951–2014) (b) in U.S. Senate speeches (1858–2009) The dynamic embedding of the 
 word “intelligence” computed from (a) the ACM abstracts and (b) U.S. Senate speeches, projected to a single dimension 
 (y-axis).

Slide 12

Slide 12 text

Time Masking (TempoBERT) [Rosin+22] • Prepend the time stamp to each sentence in a corpus wri tt en at a speci fi c time. • Mask out the time token similar to other tokens during MLM training • Masking time tokens with a higher probability (e.g. 0.2) pe rf orms be tt er • Predicting time of a sentence • [MASK] Joe Biden is the President of the USA • Probability distributions of the predicted time-tokens can be used to compute semantic change scores for words 12 <2021> Joe Biden is the President of the USA Table 3: Semantic change detection results on LiverpoolFC, SemEval-English, and SemEval-Latin. Method LiverpoolFC SemEval-Eng SemEval-Lat Pearson Spearman Pearson Spearman Pearson Spearman Del Tredici et al. [5] 0.490 – – – – – Schlechtweg et al. [37] 0.428 0.425 0.512 0.321 0.458 0.372 Gonen et al. [10] – – 0.504 0.277 0.417 0.273 Martinc et al. [26] 0.473 0.492 – 0.315 – 0.496 Montariol et al. [28] 0.378 0.376 0.566 0.437 – 0.448 TempoBERT 0.637 0.620 0.538 0.467 0.485 0.512 works surprisingly well on multiple datasets and languages!

Slide 13

Slide 13 text

Temporal A tt ention [Rosin+Radinsky 22] • Instead of changing the input text, change the a tt ention mechanism in the Transformer to incorporate time. • Input sequence , input embeddings arranged as rows in • Query , Key , Value , Time (Here, ) • • Increases the parameters (memory) but empirical results show this overhead is negligible. xt 1 , xt 2 , …, xt n xt i ∈ ℝD Xt ∈ ℝn×D Q = XtWQ K = XtWK V = XtWV T = XtWT Q, K, V, T ∈ ℝn×dk TemporalAttention(Q, K, V, T) = softmax Q T⊤T ||T|| K⊤ dk V 13 and then compared between different time points (Jatowt and Duh, 2014; Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016; Dubossarsky et al., 2019; Del Tredici et al., 2019). Gonen et al. (2020) used a simple nearest-neighbors-based ap- proach to detect semantically-changed words. Oth- ers learned time-aware embeddings simultaneously over all time points to resolve the alignment prob- lem, by regularization (Yao et al., 2018), mod- eling word usage as a function of time (Rosen- feld and Erk, 2018), Bayesian skip-gram (Bamler and Mandt, 2017), or exponential family embed- dings (Rudolph and Blei, 2018). All aforementioned methods limit the representa- tion of each word to a single meaning, ignoring the ambiguity in language and limiting their sensitivity. Figure 2: Illustration of our propose tion mechanism. between each pair of tokens. In

Slide 14

Slide 14 text

Dynamic Contextualised Word Embeddings • First, incorporate time and social context into static word embedding of the -th word. • • : BERT input embeddings • : Learnt using a Gated A tt ention Network (GAT) [Veliˇckovi´c+18] applied to the social network • : Sampled from a zero-mean diagonal Gaussian • Next, use these dynamic non-contextualised embeddings with BERT to create a contextualised version of them • tj si x(k) k e(k) ij = d(x(k), si , tj ) x(k) si tj h(k) ij = BERT(e(k) ij , si , tj ) 14 Hofmann+21

Slide 15

Slide 15 text

Learn vs. Adapt • Temporal Adaptation: • Instead of training separate word embedding models from each snapshot taken at di ff erent time stamps, adapt a model from one point (current/past) in time to another (future) point in time. [Kulkarni+15, Hamilton+16, Loureiro+22] • Bene fi ts • Parameter e ff i ciency • Models trained on di ff erent snapshots share the same set of parameters, leading to smaller total model sizes. • Data e ffi ciency • We might not have su ffi cient data at each snapshot (especially when the time intervals are sho rt ) to accurately train large models 15

Slide 16

Slide 16 text

Problem Se tt ing • Given an Masked Language Model (MLM), M, and two corpora (snap shots) and , taken at two di ff erent times and , adapt M from to such that it can represent the meanings of words at . • Remarks • M does not have to be trained on (or ). • We do not care whether M can accurately represent the meanings of words at . • M is both contextualised as well as dynamic (time-sensitive) • Hence, Dynamic Contextualised Word Embedding (DCWE)! C1 C2 T1 T2 ( > T1 ) T1 T2 T2 C1 C2 T1 16

Slide 17

Slide 17 text

Prompt-based Temporal Adaptation • How to connect two corpora collected in two di ff erent points in time? • Pivots ( ): — words that occur in both as well as • Anchors ( ): — words that associated with pivots in either or , but not both. • is associated with in , whereas is associated with in • Temporal Prompt: • is associated with in , whereas it is associated with in • Example: (mask, hide, vaccine) , • mask is associated with hide in 2010, whereas it is associated with vaccine in 2020 w C1 C2 u, v C1 C2 u w C1 v w C2 w u T1 v T2 T1 = 2010 T2 = 2020 17

Slide 18

Slide 18 text

Frequency-based Tuple Selection • Pivot selection: If a word occurs a lot in both corpora, it is likely to be time-invariant (domain-independent) [Bollegala+15] • • : frequency of in corpus • Anchor selection: words in each corpus that have high pointwise mutual information with pivots are likely to be good anchors • score(w) = min(f(w, C1 ), f(w, C2 )) f(w, C) w C PMI(w, x; C) = log ( p(w, x) p(w)p(x) ) 18

Slide 19

Slide 19 text

Diversity-based Tuple Selection • The anchors that have high PMI with pivots in both domain could be similar, resulting in useless prompts for temporal adaptation • Add a diversity penalty on pivots… • • : Set of anchors associated with in • : Set of anchors associated with in • Select that scores high on diversity and create tuples ( ) by selecting the corresponding anchors. diversity(w) = 1 − | 𝒰 (w) ∩ 𝒱 (w)| | 𝒰 (w) ∪ 𝒱 (w)| 𝒰 (w) w C1 𝒱 (w) w C2 w w, u, v 19

Slide 20

Slide 20 text

Context-based Tuple Selection • Two issues in frequency- and diversity-based tuple selection methods • co-occurrences can be sparse (esp. in small corpora), and can make PMI overestimate the association between words. • contexts of the co-occurrences are not considered. • Solution — use contextualised word embeddings • A word is represented by averaging its token embedding over all occurrences • • Compute two embeddings for , and , respectively from and • x M(x, d) d ∈ 𝒟 (x) x = 1 | 𝒟 (x)| ∑ d∈ 𝒟 (x) M(x, d) x x1 x2 C1 C2 score(w, u, v) = g(w1 , u1 ) + g(w2 , v2 ) − g(w2 , u2 ) − g(w1 , v1 ) 20

Slide 21

Slide 21 text

Automatic Template Learning • Given a tuple (extracted by any of the previously described methods), can we generate the templates? • mask is associated with hide in 2010 and associated with vaccine in 2020 • Find two sentences and containing and , and use T5 [Ra ff el+ ’20] to generate the slots Z1, Z2, Z3, and Z4. • 
 • Select the templates that have high likelihood with all tuples. [Gao+ ’21] • Use beam search with a large (e.g. 100) beam width to generate a diverse set of templates. • We substitute tuples in the generated templates to create Automatic prompts S1 S2 u v Tg(u, v, T1, T2) shown in (6). S1, S2 ! S1 hZ1 i u hZ2 i T1 hZ3 i v hZ4 i T2 S2 (6) The length of each slot to be generated is not required to be predefined, and we generate one token at a time until we encounter the next non-slot token (i.e. u, T1, v, T2). The templates we generate must cover all tu- ples in S. Therefore, when decoding we prefer 21 mask hide 2010 vaccine 2020

Slide 22

Slide 22 text

Examples of Prompts 22 Template Type hwi is associated with hui in hT1 i, whereas it is associated with hvi in hT2 i. Manual Unlike in hT1i, where hui was associated with hwi, in hT2i hvi is associated with hwi. Manual The meaning of hwi changed from hT1 i to hT2 i respectively from hui to hvi. Manual hui in hT1 i hvi in hT2 i Automatic hui in hT1 i and hvi in hT2 i Automatic The hui in hT1 i and hvi in hT2 i Automatic le 1: Experimented templates. “Manual” denotes that the template is manually-written, whereas “Automa otes that the template is automatically-generated. mpts such that M captures the semantic varia- n of a word w from T1 to T2. For this purpose, add a language modelling head on top of M, domly mask out one token at a time from each mpt, and require that M correctly predicts those sked out tokens from the remainder of the to- s in the context. We also experimented with ariant where we masked out only the anchor BERT(T1): We fine-tune the Original BE model on the training data sampled at T1. BERT(T2): We fine-tune the Original BE model on the training data sampled at T2. No that this is the same training data that was used selecting tuples in §3.2 FT: The BERT models fine-tuned by the propos method. We use the notation FT(model, templat - Automatic prompts tend to be sho rt and less diverse. 
 - Emphasising on high likelihood results in sho rt er prompts

Slide 23

Slide 23 text

Fine-tuning on Temporal Prompts • Add a language modelling head to the pre-trained MLM and fi ne-tune it such that it can correctly predict the masked-out tokens in a prompt. 23 mask is associated with hide in 2010, whereas it is associated with vaccine in 2020 • We mask all tokens at random during fi ne-tuning. • Masking only anchors did not improve pe rf ormance signi fi cantly

Slide 24

Slide 24 text

Experiments • Datasets • Yelp: We select publicly available reviews covering the years 2010 (T1) and 2020 (T2). • Reddit: We take all comments from September 2019 (T1) and April 2020 (T2), which re fl ects the e ff ects of the COVID-19 pandemic. • ArXiv: We obtain abstracts of papers published at years 2010 (T1) and 2020 (T2) • Ciao: We select reviews from the years 2010 (T1) and 2020 (T2) [Tang+’12] • Baselines • Original BERT: pre-trained BERT-base-uncased • BERT(T1): fi ne-tune the original BERT on the training data sampled at T1. • BERT(T2): fi ne-tune the original BERT on the training data sampled at T2. • Proposed: FT(model, template) 24

Slide 25

Slide 25 text

Results — Temporal Adaptation • Evaluation Metric: Perplexity scores (lower the be tt er) for generating test sentences in T2 is used as the evaluation metric. • Best result in each block is in bold, while overall best is indicated by † 25 MLM Yelp Reddit ArXiv Ciao Original BERT 15.125 25.277 11.142 12.669 FT (BERT, Manual) 14.562 24.109 10.849 12.371 FT (BERT, Auto) 14.458 23.382 10.903 12.394 BERT (T1) 5.543 9.287 5.854 7.423 FT (BERT(T1), Manual) 5.534 9.327 5.817 7.334 FT (BERT(T1), Auto) 5.541 9.303 5.818 7.347 BERT(T2) 4.718 8.927 3.500 5.840 FT (BERT(T2), Manual) 4.714 8.906† 3.500 5.813† FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827

Slide 26

Slide 26 text

Results — Comparisons against SoTA • FT (Proposed) has the lowest perplexities across all datasets. • CWE (Contextualised Word Embeddings) used by Hofmann+21 [BERT] • DCWE (Dynamic CWE) proposed by Hofmann+21 26 MLM Yelp Reddit ArXiv Ciao FT (BERT(T2), Manual) 4.714 8.906† 3.499 5.813† FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827 TempoBERT [Rosin+2022] 5.516 12.561 3.709 6.126 CWE [Hofmann+2021] 4.723 9.555 3.530 5.910 DCWE [temp. only] [Hofmann+2021] 4.723 9.631 3.515 5.899 DCWE [temp. + social] [Hofmann+2021] 4.720 9.596 3.513 5.902

Slide 27

Slide 27 text

Pivots and Anchors • Anecdote: • buergerville and joes are restaurants, which were popular in 2010 but due to the lockdowns takeaways such as dominos have been associated with place in 2020. • clerk is less used now and is ge tt ing replaced by administrator, operator etc. 27 Pivot (w) Anchors (u, v) place (burgerville, takeaway), (burgerville, dominos), (joes, dominos) service (doorman, sta ff s), (clerks, personnel), (clerks, administration) phone (nokia, iphone), (nokia, ipod), (nokia, blackberry) service (clerk, administrator), (doorman, sta ff ), (clerk, operator)

Slide 28

Slide 28 text

We ❤ Prompts 28

Slide 29

Slide 29 text

Lets talk about Prompting • There are many types of prompts currently in use • Few-shot prompting • Give some examples and ask the LLM to generalise from them (cf. in-context learning) • e.g. If man is to woman then king is to what? • Zero-shot/instruction prompting • Describe the task that needs to be pe rf ormed by the LLM • e.g. Translate the following sentence from Japanese to English: ݴޠϞσϧ͸͍͢͝Ͱ͢ɽ 29

Slide 30

Slide 30 text

Robustness of Prompting? • Humans have a latent intent that they want to express using a sho rt text snippet to an LLM and a prompt is a su rf ace realisations of this latent intent • Prompting is a many-to-one mapping, with multiple su rf ace realisations possible for a single latent intent inside the human brain • It is OK for prompts to be di ff erent as long as they all align to the same latent intent (and hopefully give the same level of pe rf ormance) • Robustness of a Prompt Learning Method 
 [Ishibashi+ h tt ps://aclanthology.org/2023.eacl-main.174/] • If the pe rf ormance of an MLM ( ), measured by a metric , on a task , with prompts learnt by a method remains stable under a small random pe rt urbation , then is de fi ned to be robust w.r.t. on for . • M g T Γ δ Γ g T M 𝔼 d∼Γ [|g(T, M(d)) − g(T, M(d + δ)|] < ϵ 30

Slide 31

Slide 31 text

AutoPrompts are not Robust! • Prompts learnt by AutoPrompt [Shin+2020] for fact extraction (on T-REx) using BERT and RoBERTa. • Compared to Manual prompts, AP BERT/RoBERTa have much be tt er pe rf ormance. • However, AutoPrompts are di ff i cult to interpret (cf. humans would never write this stu ff ) 31

Slide 32

Slide 32 text

Token ordering • Randomly re-order tokens in a prompt and measure the drop in pe rf ormance 32

Slide 33

Slide 33 text

Cross-dataset Evaluation • If the prompts learnt from one dataset can also pe rf orm well on another dataset, annotated for the same task, then the prompts generalise well 33

Slide 34

Slide 34 text

Lexical Semantic Changes • Instead of adapting an entire LLM (costly), can we just predict the semantic change of a single word over a time period? 34 Danushka Bollegala Amazon, University of Liverpool [email protected] (a) gay associated with words constantly h time. Detecting the semantic vari- ords is an important task for var- applications that must make time- edictions. Existing work on seman- n prediction have predominantly fo- comparing some form of an aver- xtualised representation of a target uted from a given corpus. However, previously associated meanings of rd can become obsolete over time ing of gay as happy), while novel existing words are observed (e.g. f cell as a mobile phone). We ar- ean representations alone cannot ac- pture such semantic variations and method that uses the entire cohort extualised embeddings of the target h we refer to as the sibling distribu- rimental results on SemEval-2020 chmark dataset for semantic varia- tion show that our method outper- work that consider only the mean s, and is comparable to the current (a) gay (b) cell Figure 1: t-SNE projections of BERT token vectors (dotted) in two time periods and the average vector (starred) for each period. (a) the word gay has lost its original meaning related to happy and is now used to gay cell Unsupervised Semantic Variation Prediction using the Distribution of 
 Sibling Embeddings h tt ps://arxiv.org/abs/2305.08654 [Aida+Bollegala Findings of ACL 2023]

Slide 35

Slide 35 text

Siblings are all you need • Challenges • How to model the meaning of a word in a corpus? • Meaning depends on the context [Harris 1954] • How to compare meaning of a word across corpora? • Depends on the representations learnt. • Lack of large-scale labelled datasets to learn semantic change prediction models • Must reso rt to unsupervised methods • Solution • Each occurrence of a target word in a corpus can be represented by its own contextualised token embedding, obtained from a pre-trained/ fi ne-tuned MLM. • Set of vector embeddings can be approximated by a multivariate Gaussian (full covariance is expensive, can be approximated well with the diagonal) • We can sample from two Gaussians representing the meaning of the target word in each corpus and then use any distance/divergence measure 35

Slide 36

Slide 36 text

Comparisons against SoTA 36 Model Spearman Word2Gausslight (averages word2vec, KL) 0.358 Word2Gauss (learnt from scratch, rotation, KL) 0.399 MLMtemp, Cosine (FT by time masking BERT, avg. cosine distance) 0.467 MLMtemp, APD (avg. pairwise cosine distance over all siblings) 0.479 MLMpre w/ Temp. A tt . (pretrained BERT + temporal-a tt ention) 0.520 MLMtemp w/ Temp. A tt . (FT by time masking BERT + temporal-a tt ention) 0.548 Proposed (Sibling embeddings, Multivariate Full cov., Chebyshev) 0.529

Slide 37

Slide 37 text

Word Senses and Semantic Changes • Hypothesis: If the distribution of word senses associated with a pa rt icular word has changed between two corpora, that word’s meaning has changed. 37 plane pin sense distributions in corpus-1 sense distributions in corpus-2 Jensen-Shannon divergence = 0.221 Jensen-Shannon divergence = 0.027

Slide 38

Slide 38 text

Swapping is all you need! • Hypothesis: If the meaning of a word has not changed between two corpora, sibling distributions will be similar to that in the original corpora upon a random swapping of sentences. 38 Corpus 1 D1 Corpus 2 D2 s1 s2 D1,swap D2,swap Corpus 1 D1 Corpus 2 D2 s1 s2 D1,swap D2,swap

Slide 39

Slide 39 text

What is next… • LLMs are trained to predict only the single choice by the human writer and is unaware of the alternatives considered • Can we use LLMs to predict the output distributions considered by the human writer instead of the selected one? • Time adaptation still requires fi ne-tuning, which is costly for LLMs. • Parameter e ff i cient Fine-tuning (PEFT) methods (e.g. Adapters, LoRA etc.) should be considered. • Most words do not change their meaning (at least within sho rt er time intervals) • On-demand updates — only update words (and their contexts) that changed in meaning • Periodic Temporal Shi ft s 
 39

Slide 40

Slide 40 text

Where are we going? 40

Slide 41

Slide 41 text

Where are we should we be going? 41

Slide 42

Slide 42 text

Where are we should we be going? • Danushka’s hot take • LLMs are great and (some amount of) hype is good for the fi eld. We could/should analyse the texts generated by LLMs to see how it di ff ers (or not) from that by humans. • But • I do not believe LLMs are “models” of language (rather models that can generate language) • We need to love the exceptions! and not sweep them under the carpet. The types of mistakes made by a model tells more about what it understands than the ones it get correct. • We are scared our papers will get rejected if we talk more about the mistakes our models make … this is bad science. 42

Slide 43

Slide 43 text

43 Questions Danushka Bollegala h tt ps://danushka.net [email protected] @Bollegala Th ank Y o