Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time Travel with Large Language Models

Time Travel with Large Language Models

The meaning associated with a word is a dynamic phenomenon that varies with time. New meanings are constantly assigned to existing words, while new words are proposed to describe novel concepts. Despite this dynamic nature of lexical semantics, most NLP systems remain agnostic to the temporal effects of meaning change. For example, Large Language Models (LLMs) that act as the backbone of modern-day NLP systems are often trained once, using a fixed snapshot of a corpus collected at some specific point in time. It is both costly and time consuming to retrain LLMs from scratch on recent data. On the other hand, if we can somehow predict which words have their meanings altered over time, we could perform on-demand fine-tuning of LLMs to reflect those changes in a timely manner. In this talk, I will first review various techniques that have been proposed in NLP research to predict the semantic change of words over time. I will then describe a lightweight prompt-based approach for the temporal adaptation of LLMs.

These are the slides from the keynote given at *SEM 2023 [https://sites.google.com/view/starsem2023/speakers]

Danushka Bollegala

August 05, 2023
Tweet

More Decks by Danushka Bollegala

Other Decks in Research

Transcript

  1. 3 Xiaohang Tang Yi Zhou Yoichi Ishibashi Taichi Aida Mad

    scientist with Large Language Models
  2. Why do word meaning change? • New concepts/entities are associated

    with existing words (e.g. cell ) • Word re-usage promotes e ffi ciency in human communications [cf. Polysemy, Ravin+Leacock’00] • 40% of words in Webster dictionary have more than two senses, while run has 29! • Totally new words (neologisms) are coined to describe previously non-existent concepts/entities (e.g. chatGPT) • Semantics, morphology and syntax are strongly interrelated [Langacker+87, Hock+Joseph 19] • what count as coherent, grammatical changes over time [Giulianelli+21] 6 Grammatical Profiling for Semantic Change Detection Mario Giulianelli⇤ ILLC, University of Amsterdam [email protected] Andrey Kutuzov⇤ University of Oslo [email protected] Lidia Pivovarova⇤ University of Helsinki [email protected] Abstract Semantics, morphology and syntax are strongly interdependent. However, the major- ity of computational methods for semantic change detection use distributional word rep- resentations which encode mostly semantics. We investigate an alternative method, gram- matical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present lass Young Woman → sweethea rt drop in the plural form (lasses) Pockemon trainer class (girl in mini-ski rt )
  3. A Brief History of Word Embeddings 7 Static Word Embeddings

    word2vec [Mikolov+13], GloVe [Pennington+14], fastText [Bojanowski+17],… Contextualised Word Embeddings BERT [Devlin+19], RoBERTa [Liu+19], ALBERT [Lan+20], … Dynamic Word Embeddings Bernoulli embeddings [Rudolph+Blei 17], Diachronic word embeddings [Hamilton+16], … Dynamic Contextualised Word Embeddings TempoBERT [Rosin+22], HistBERT [Qiu+22], TimeLMs [Loureiro+22], …
  4. Diachronic Word Embeddings • Given multiple snapshots of corpora collected

    at di ff erent time steps, we could separately learn word embeddings from each snapshot. [Hamilton+16, Kulkarni+15, Loureiro+22] • Pros: Any word embedding learning method can be used • Cons: • Many models trained at di ff erent snapshots. • Di ffi cult to compare word embeddings learnt from di ff erent corpora because no natural alignment exists (cf. even the sets of word embeddings obtained from di ff erent runs of the same algorithm cannot be compared due to random initialisations) 8 nificant Detection of Linguistic Change arni ersity, USA nybrook.edu Rami Al-Rfou Stony Brook University, USA [email protected] ozzi ersity, USA ybrook.edu Steven Skiena Stony Brook University, USA [email protected] oach for tracking and tic shifts in the mean- c shifts are especially pid exchange of ideas . Our meta-analysis es of word usage, and point detection algo- shifts. roaches of increasing property time series, tional characteristics ng recently proposed train vector represen- talkative profligate courageous apparitional dapper sublimely unembarrassed courteous sorcerers metonymy religious adolescents philanthropist illiterate transgendered artisans healthy gays homosexual transgender lesbian statesman hispanic uneducated gay 1900 gay 1950 gay 1975 gay 1990 gay 2005 cheerful Kulkarni+15
  5. Learning Alignments • Di ff erent methods can be used

    to learn alignments between separately learnt vector spaces • Canonical Correlation Analysis (CCA) was used by Pražák+20 (ranked 1st for the SemEval 2020 Task 1 binary semantic change detection task) • Projecting source to target embeddings: • CCA: • • Fu rt her o rt hogonal constraints can be used on • However, aligning contextualised word embeddings is hard [Takahashi+Bollegala’22] ambiguity; the decisions to add a POS tag to English target words and retain German noun capita shows that the organizers were aware of this problem. 3 System Description First, we train two semantic spaces from corpus C1 and C2. We represent the semantic spac matrix Xs (i.e., a source space s) and a matrix Xt (i.e, a target space t)2 using word2vec Skip-gr negative sampling (Mikolov et al., 2013). We perform a cross-lingual mapping of the two vector getting two matrices ˆ Xs and ˆ Xt projected into a shared space. We select two methods for th lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b of these methods are linear transformations. In our case, the transformation can be written as fol ˆ Xs = Ws!tXs where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does n to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o source space and Wt!o for the target space. The transformation matrices are computed by min the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) where cov the covariance, var is the variance and n is a number of vectors. In our implement CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (ma into the target space t from the common shared space with a pseudo-inversion, and the target spa s!t lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc´ ın et al 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b). Bot of these methods are linear transformations. In our case, the transformation can be written as follows: ˆ Xs = Ws!tXs (1 where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs) into target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does not hav to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared space (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for th source space and Wt!o for the target space. The transformation matrices are computed by minimizin the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the share space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2 where cov the covariance, var is the variance and n is a number of vectors. In our implementation o CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (matrix Xs into the target space t from the common shared space with a pseudo-inversion, and the target space doe not change. The matrix Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3 The submissions that use CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the - part means that the source and target spaces are reversed, see Section 4. The -nn and -bin parts refer to type of threshold used only in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no differenc ˆ Xs = Ws!tXs (1) that performs linear transformation from the source space s (matrix Xs) into a he source space transformed into the target space t (the matrix Xt does not have e Xt is already in the target space t and Xt = ˆ Xt). ansformation transforms both spaces Xs and Xt into a third shared space o Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for the for the target space. The transformation matrices are computed by minimizing between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the shared relation is defined as follows: X 1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2) e, var is the variance and n is a number of vectors. In our implementation of ual to the matrix Xt because it transforms only the source space s (matrix Xs) m the common shared space with a pseudo-inversion, and the target space does Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3) se CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -r e and target spaces are reversed, see Section 4. The -nn and -bin parts refer to a ly in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no difference submissions: cca-nn – cca-bin and cca-nn-r – cca-bin-r. ogonal Transformation, the submissions are referred to as ort & uns. We use on with a supervised seed dictionary consisting of all words common to both s!t Ws→t 9
  6. Dynamic Embeddings • Exponential Family Embeddings [Rudolph+16] • • Bernoulli

    Embeddings [Rudolph+Blei 17] • , where • Dynamic Embeddings • embedding vectors are time-speci fi c, while context vectors (parametrised by ) are shared over time xi |xci ∼ ExpFam(ηi (xci ), t(xi )) xiv |xci ∼ Bern(ρ(t) iv ) ηiv = ρ(ti ) v ⊤ ∑ j∈cj ∑ v′  αv′  xjv′  ρ(t) v αv 10 use a Gaussian random walk to capture drift in the underlying language model; for example, see Blei and La￿erty [8], Wang et al. [43], Gerrish and Blei [13] and Frermann and Lapata [12]. Though topic models and word embeddings are related, they are ultimately di￿erent approaches to language analysis. Topic models capture co-occurrence of words at the document level and focus on heterogeneity, i.e., that a document can exhibit multiple topics [9]. Word embeddings capture co-occurrence in terms of proximity in the text, usually focusing on small neighborhoods around each word [26]. Combining dynamic topic models and dynamic word embeddings is an area for future study. 2 DYNAMIC EMBEDDINGS We develop dynamic embeddings (￿￿￿￿￿), a type of exponential family embedding (￿￿￿) [35] that captures sequential changes in the representation of the data. We focus on text data and the Bernoulli embedding model. In this section, we review Bernoulli embeddings for text and show how to include dynamics into the model. We then derive the objective function for dynamic embeddings and develop stochastic gradients to optimize it on large collections of text. Bernoulli embeddings for text. An ￿￿￿ is a conditional model [2]. It has three ingredients: The context, the conditional distribution of each data point, and the parameter sharing structure. In an ￿￿￿ for text, the data is a corpus of text, a sequence of words (x1, . . . ,xN ) from a vocabulary of size V . Each word xi 2 {0, 1}V is an indicator vector (also called a “one-hot” vector). It has one Figure 2: Graphical representation of a ￿￿￿￿￿ for text data in T time slices, X (1), · · · ,X (T ). The embedding vectors of each term evolve over time. The context vectors are shared across all time slices. embedding vectors context vectors
  7. Dynamic Embeddings 11 (a) ￿￿￿￿￿￿￿￿￿￿￿￿ in ACM abstracts (1951–2014) (b)

    ￿￿￿￿￿￿￿￿￿￿￿￿ in U.S. Senate speeches (1858–2009) The dynamic embedding of the 
 word “intelligence” computed from (a) the ACM abstracts and (b) U.S. Senate speeches, projected to a single dimension 
 (y-axis).
  8. Time Masking (TempoBERT) [Rosin+22] • Prepend the time stamp to

    each sentence in a corpus wri tt en at a speci fi c time. • Mask out the time token similar to other tokens during MLM training • Masking time tokens with a higher probability (e.g. 0.2) pe rf orms be tt er • Predicting time of a sentence • [MASK] Joe Biden is the President of the USA • Probability distributions of the predicted time-tokens can be used to compute semantic change scores for words 12 <2021> Joe Biden is the President of the USA Table 3: Semantic change detection results on LiverpoolFC, SemEval-English, and SemEval-Latin. Method LiverpoolFC SemEval-Eng SemEval-Lat Pearson Spearman Pearson Spearman Pearson Spearman Del Tredici et al. [5] 0.490 – – – – – Schlechtweg et al. [37] 0.428 0.425 0.512 0.321 0.458 0.372 Gonen et al. [10] – – 0.504 0.277 0.417 0.273 Martinc et al. [26] 0.473 0.492 – 0.315 – 0.496 Montariol et al. [28] 0.378 0.376 0.566 0.437 – 0.448 TempoBERT 0.637 0.620 0.538 0.467 0.485 0.512 works surprisingly well on multiple datasets and languages!
  9. Temporal A tt ention [Rosin+Radinsky 22] • Instead of changing

    the input text, change the a tt ention mechanism in the Transformer to incorporate time. • Input sequence , input embeddings arranged as rows in • Query , Key , Value , Time (Here, ) • • Increases the parameters (memory) but empirical results show this overhead is negligible. xt 1 , xt 2 , …, xt n xt i ∈ ℝD Xt ∈ ℝn×D Q = XtWQ K = XtWK V = XtWV T = XtWT Q, K, V, T ∈ ℝn×dk TemporalAttention(Q, K, V, T) = softmax Q T⊤T ||T|| K⊤ dk V 13 and then compared between different time points (Jatowt and Duh, 2014; Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016; Dubossarsky et al., 2019; Del Tredici et al., 2019). Gonen et al. (2020) used a simple nearest-neighbors-based ap- proach to detect semantically-changed words. Oth- ers learned time-aware embeddings simultaneously over all time points to resolve the alignment prob- lem, by regularization (Yao et al., 2018), mod- eling word usage as a function of time (Rosen- feld and Erk, 2018), Bayesian skip-gram (Bamler and Mandt, 2017), or exponential family embed- dings (Rudolph and Blei, 2018). All aforementioned methods limit the representa- tion of each word to a single meaning, ignoring the ambiguity in language and limiting their sensitivity. Figure 2: Illustration of our propose tion mechanism. between each pair of tokens. In
  10. Dynamic Contextualised Word Embeddings • First, incorporate time and social

    context into static word embedding of the -th word. • • : BERT input embeddings • : Learnt using a Gated A tt ention Network (GAT) [Veliˇckovi´c+18] applied to the social network • : Sampled from a zero-mean diagonal Gaussian • Next, use these dynamic non-contextualised embeddings with BERT to create a contextualised version of them • tj si x(k) k e(k) ij = d(x(k), si , tj ) x(k) si tj h(k) ij = BERT(e(k) ij , si , tj ) 14 Hofmann+21
  11. Learn vs. Adapt • Temporal Adaptation: • Instead of training

    separate word embedding models from each snapshot taken at di ff erent time stamps, adapt a model from one point (current/past) in time to another (future) point in time. [Kulkarni+15, Hamilton+16, Loureiro+22] • Bene fi ts • Parameter e ff i ciency • Models trained on di ff erent snapshots share the same set of parameters, leading to smaller total model sizes. • Data e ffi ciency • We might not have su ffi cient data at each snapshot (especially when the time intervals are sho rt ) to accurately train large models 15
  12. Problem Se tt ing • Given an Masked Language Model

    (MLM), M, and two corpora (snap shots) and , taken at two di ff erent times and , adapt M from to such that it can represent the meanings of words at . • Remarks • M does not have to be trained on (or ). • We do not care whether M can accurately represent the meanings of words at . • M is both contextualised as well as dynamic (time-sensitive) • Hence, Dynamic Contextualised Word Embedding (DCWE)! C1 C2 T1 T2 ( > T1 ) T1 T2 T2 C1 C2 T1 16
  13. Prompt-based Temporal Adaptation • How to connect two corpora collected

    in two di ff erent points in time? • Pivots ( ): — words that occur in both as well as • Anchors ( ): — words that associated with pivots in either or , but not both. • is associated with in , whereas is associated with in • Temporal Prompt: • is associated with in , whereas it is associated with in • Example: (mask, hide, vaccine) , • mask is associated with hide in 2010, whereas it is associated with vaccine in 2020 w C1 C2 u, v C1 C2 u w C1 v w C2 w u T1 v T2 T1 = 2010 T2 = 2020 17
  14. Frequency-based Tuple Selection • Pivot selection: If a word occurs

    a lot in both corpora, it is likely to be time-invariant (domain-independent) [Bollegala+15] • • : frequency of in corpus • Anchor selection: words in each corpus that have high pointwise mutual information with pivots are likely to be good anchors • score(w) = min(f(w, C1 ), f(w, C2 )) f(w, C) w C PMI(w, x; C) = log ( p(w, x) p(w)p(x) ) 18
  15. Diversity-based Tuple Selection • The anchors that have high PMI

    with pivots in both domain could be similar, resulting in useless prompts for temporal adaptation • Add a diversity penalty on pivots… • • : Set of anchors associated with in • : Set of anchors associated with in • Select that scores high on diversity and create tuples ( ) by selecting the corresponding anchors. diversity(w) = 1 − | 𝒰 (w) ∩ 𝒱 (w)| | 𝒰 (w) ∪ 𝒱 (w)| 𝒰 (w) w C1 𝒱 (w) w C2 w w, u, v 19
  16. Context-based Tuple Selection • Two issues in frequency- and diversity-based

    tuple selection methods • co-occurrences can be sparse (esp. in small corpora), and can make PMI overestimate the association between words. • contexts of the co-occurrences are not considered. • Solution — use contextualised word embeddings • A word is represented by averaging its token embedding over all occurrences • • Compute two embeddings for , and , respectively from and • x M(x, d) d ∈ 𝒟 (x) x = 1 | 𝒟 (x)| ∑ d∈ 𝒟 (x) M(x, d) x x1 x2 C1 C2 score(w, u, v) = g(w1 , u1 ) + g(w2 , v2 ) − g(w2 , u2 ) − g(w1 , v1 ) 20
  17. Automatic Template Learning • Given a tuple (extracted by any

    of the previously described methods), can we generate the templates? • mask is associated with hide in 2010 and associated with vaccine in 2020 • Find two sentences and containing and , and use T5 [Ra ff el+ ’20] to generate the slots Z1, Z2, Z3, and Z4. • 
 • Select the templates that have high likelihood with all tuples. [Gao+ ’21] • Use beam search with a large (e.g. 100) beam width to generate a diverse set of templates. • We substitute tuples in the generated templates to create Automatic prompts S1 S2 u v Tg(u, v, T1, T2) shown in (6). S1, S2 ! S1 hZ1 i u hZ2 i T1 hZ3 i v hZ4 i T2 S2 (6) The length of each slot to be generated is not required to be predefined, and we generate one token at a time until we encounter the next non-slot token (i.e. u, T1, v, T2). The templates we generate must cover all tu- ples in S. Therefore, when decoding we prefer 21 mask <is associated with> hide <in> 2010 <and associated with> vaccine <in> 2020
  18. Examples of Prompts 22 Template Type hwi is associated with

    hui in hT1 i, whereas it is associated with hvi in hT2 i. Manual Unlike in hT1i, where hui was associated with hwi, in hT2i hvi is associated with hwi. Manual The meaning of hwi changed from hT1 i to hT2 i respectively from hui to hvi. Manual hui in hT1 i hvi in hT2 i Automatic hui in hT1 i and hvi in hT2 i Automatic The hui in hT1 i and hvi in hT2 i Automatic le 1: Experimented templates. “Manual” denotes that the template is manually-written, whereas “Automa otes that the template is automatically-generated. mpts such that M captures the semantic varia- n of a word w from T1 to T2. For this purpose, add a language modelling head on top of M, domly mask out one token at a time from each mpt, and require that M correctly predicts those sked out tokens from the remainder of the to- s in the context. We also experimented with ariant where we masked out only the anchor BERT(T1): We fine-tune the Original BE model on the training data sampled at T1. BERT(T2): We fine-tune the Original BE model on the training data sampled at T2. No that this is the same training data that was used selecting tuples in §3.2 FT: The BERT models fine-tuned by the propos method. We use the notation FT(model, templat - Automatic prompts tend to be sho rt and less diverse. 
 - Emphasising on high likelihood results in sho rt er prompts
  19. Fine-tuning on Temporal Prompts • Add a language modelling head

    to the pre-trained MLM and fi ne-tune it such that it can correctly predict the masked-out tokens in a prompt. 23 mask is associated with hide in 2010, whereas it is associated with vaccine in 2020 • We mask all tokens at random during fi ne-tuning. • Masking only anchors did not improve pe rf ormance signi fi cantly
  20. Experiments • Datasets • Yelp: We select publicly available reviews

    covering the years 2010 (T1) and 2020 (T2). • Reddit: We take all comments from September 2019 (T1) and April 2020 (T2), which re fl ects the e ff ects of the COVID-19 pandemic. • ArXiv: We obtain abstracts of papers published at years 2010 (T1) and 2020 (T2) • Ciao: We select reviews from the years 2010 (T1) and 2020 (T2) [Tang+’12] • Baselines • Original BERT: pre-trained BERT-base-uncased • BERT(T1): fi ne-tune the original BERT on the training data sampled at T1. • BERT(T2): fi ne-tune the original BERT on the training data sampled at T2. • Proposed: FT(model, template) 24
  21. Results — Temporal Adaptation • Evaluation Metric: Perplexity scores (lower

    the be tt er) for generating test sentences in T2 is used as the evaluation metric. • Best result in each block is in bold, while overall best is indicated by † 25 MLM Yelp Reddit ArXiv Ciao Original BERT 15.125 25.277 11.142 12.669 FT (BERT, Manual) 14.562 24.109 10.849 12.371 FT (BERT, Auto) 14.458 23.382 10.903 12.394 BERT (T1) 5.543 9.287 5.854 7.423 FT (BERT(T1), Manual) 5.534 9.327 5.817 7.334 FT (BERT(T1), Auto) 5.541 9.303 5.818 7.347 BERT(T2) 4.718 8.927 3.500 5.840 FT (BERT(T2), Manual) 4.714 8.906† 3.500 5.813† FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827
  22. Results — Comparisons against SoTA • FT (Proposed) has the

    lowest perplexities across all datasets. • CWE (Contextualised Word Embeddings) used by Hofmann+21 [BERT] • DCWE (Dynamic CWE) proposed by Hofmann+21 26 MLM Yelp Reddit ArXiv Ciao FT (BERT(T2), Manual) 4.714 8.906† 3.499 5.813† FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827 TempoBERT [Rosin+2022] 5.516 12.561 3.709 6.126 CWE [Hofmann+2021] 4.723 9.555 3.530 5.910 DCWE [temp. only] [Hofmann+2021] 4.723 9.631 3.515 5.899 DCWE [temp. + social] [Hofmann+2021] 4.720 9.596 3.513 5.902
  23. Pivots and Anchors • Anecdote: • buergerville and joes are

    restaurants, which were popular in 2010 but due to the lockdowns takeaways such as dominos have been associated with place in 2020. • clerk is less used now and is ge tt ing replaced by administrator, operator etc. 27 Pivot (w) Anchors (u, v) place (burgerville, takeaway), (burgerville, dominos), (joes, dominos) service (doorman, sta ff s), (clerks, personnel), (clerks, administration) phone (nokia, iphone), (nokia, ipod), (nokia, blackberry) service (clerk, administrator), (doorman, sta ff ), (clerk, operator)
  24. Lets talk about Prompting • There are many types of

    prompts currently in use • Few-shot prompting • Give some examples and ask the LLM to generalise from them (cf. in-context learning) • e.g. If man is to woman then king is to what? • Zero-shot/instruction prompting • Describe the task that needs to be pe rf ormed by the LLM • e.g. Translate the following sentence from Japanese to English: ݴޠϞσϧ͸͍͢͝Ͱ͢ɽ 29
  25. Robustness of Prompting? • Humans have a latent intent that

    they want to express using a sho rt text snippet to an LLM and a prompt is a su rf ace realisations of this latent intent • Prompting is a many-to-one mapping, with multiple su rf ace realisations possible for a single latent intent inside the human brain • It is OK for prompts to be di ff erent as long as they all align to the same latent intent (and hopefully give the same level of pe rf ormance) • Robustness of a Prompt Learning Method 
 [Ishibashi+ h tt ps://aclanthology.org/2023.eacl-main.174/] • If the pe rf ormance of an MLM ( ), measured by a metric , on a task , with prompts learnt by a method remains stable under a small random pe rt urbation , then is de fi ned to be robust w.r.t. on for . • M g T Γ δ Γ g T M 𝔼 d∼Γ [|g(T, M(d)) − g(T, M(d + δ)|] < ϵ 30
  26. AutoPrompts are not Robust! • Prompts learnt by AutoPrompt [Shin+2020]

    for fact extraction (on T-REx) using BERT and RoBERTa. • Compared to Manual prompts, AP BERT/RoBERTa have much be tt er pe rf ormance. • However, AutoPrompts are di ff i cult to interpret (cf. humans would never write this stu ff ) 31
  27. Cross-dataset Evaluation • If the prompts learnt from one dataset

    can also pe rf orm well on another dataset, annotated for the same task, then the prompts generalise well 33
  28. Lexical Semantic Changes • Instead of adapting an entire LLM

    (costly), can we just predict the semantic change of a single word over a time period? 34 Danushka Bollegala Amazon, University of Liverpool [email protected] (a) gay associated with words constantly h time. Detecting the semantic vari- ords is an important task for var- applications that must make time- edictions. Existing work on seman- n prediction have predominantly fo- comparing some form of an aver- xtualised representation of a target uted from a given corpus. However, previously associated meanings of rd can become obsolete over time ing of gay as happy), while novel existing words are observed (e.g. f cell as a mobile phone). We ar- ean representations alone cannot ac- pture such semantic variations and method that uses the entire cohort extualised embeddings of the target h we refer to as the sibling distribu- rimental results on SemEval-2020 chmark dataset for semantic varia- tion show that our method outper- work that consider only the mean s, and is comparable to the current (a) gay (b) cell Figure 1: t-SNE projections of BERT token vectors (dotted) in two time periods and the average vector (starred) for each period. (a) the word gay has lost its original meaning related to happy and is now used to gay cell Unsupervised Semantic Variation Prediction using the Distribution of 
 Sibling Embeddings h tt ps://arxiv.org/abs/2305.08654 [Aida+Bollegala Findings of ACL 2023]
  29. Siblings are all you need • Challenges • How to

    model the meaning of a word in a corpus? • Meaning depends on the context [Harris 1954] • How to compare meaning of a word across corpora? • Depends on the representations learnt. • Lack of large-scale labelled datasets to learn semantic change prediction models • Must reso rt to unsupervised methods • Solution • Each occurrence of a target word in a corpus can be represented by its own contextualised token embedding, obtained from a pre-trained/ fi ne-tuned MLM. • Set of vector embeddings can be approximated by a multivariate Gaussian (full covariance is expensive, can be approximated well with the diagonal) • We can sample from two Gaussians representing the meaning of the target word in each corpus and then use any distance/divergence measure 35
  30. Comparisons against SoTA 36 Model Spearman Word2Gausslight (averages word2vec, KL)

    0.358 Word2Gauss (learnt from scratch, rotation, KL) 0.399 MLMtemp, Cosine (FT by time masking BERT, avg. cosine distance) 0.467 MLMtemp, APD (avg. pairwise cosine distance over all siblings) 0.479 MLMpre w/ Temp. A tt . (pretrained BERT + temporal-a tt ention) 0.520 MLMtemp w/ Temp. A tt . (FT by time masking BERT + temporal-a tt ention) 0.548 Proposed (Sibling embeddings, Multivariate Full cov., Chebyshev) 0.529
  31. Word Senses and Semantic Changes • Hypothesis: If the distribution

    of word senses associated with a pa rt icular word has changed between two corpora, that word’s meaning has changed. 37 plane pin sense distributions in corpus-1 sense distributions in corpus-2 Jensen-Shannon divergence = 0.221 Jensen-Shannon divergence = 0.027
  32. Swapping is all you need! • Hypothesis: If the meaning

    of a word has not changed between two corpora, sibling distributions will be similar to that in the original corpora upon a random swapping of sentences. 38 Corpus 1 D1 Corpus 2 D2 s1 s2 D1,swap D2,swap Corpus 1 D1 Corpus 2 D2 s1 s2 D1,swap D2,swap
  33. What is next… • LLMs are trained to predict only

    the single choice by the human writer and is unaware of the alternatives considered • Can we use LLMs to predict the output distributions considered by the human writer instead of the selected one? • Time adaptation still requires fi ne-tuning, which is costly for LLMs. • Parameter e ff i cient Fine-tuning (PEFT) methods (e.g. Adapters, LoRA etc.) should be considered. • Most words do not change their meaning (at least within sho rt er time intervals) • On-demand updates — only update words (and their contexts) that changed in meaning • Periodic Temporal Shi ft s 
 39
  34. Where are we should we be going? • Danushka’s hot

    take • LLMs are great and (some amount of) hype is good for the fi eld. We could/should analyse the texts generated by LLMs to see how it di ff ers (or not) from that by humans. • But • I do not believe LLMs are “models” of language (rather models that can generate language) • We need to love the exceptions! and not sweep them under the carpet. The types of mistakes made by a model tells more about what it understands than the ones it get correct. • We are scared our papers will get rejected if we talk more about the mistakes our models make … this is bad science. 42