Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time Travel with Large Language Models

Time Travel with Large Language Models

The meaning associated with a word is a dynamic phenomenon that varies with time. New meanings are constantly assigned to existing words, while new words are proposed to describe novel concepts. Despite this dynamic nature of lexical semantics, most NLP systems remain agnostic to the temporal effects of meaning change. For example, Large Language Models (LLMs) that act as the backbone of modern-day NLP systems are often trained once, using a fixed snapshot of a corpus collected at some specific point in time. It is both costly and time consuming to retrain LLMs from scratch on recent data. On the other hand, if we can somehow predict which words have their meanings altered over time, we could perform on-demand fine-tuning of LLMs to reflect those changes in a timely manner. In this talk, I will first review various techniques that have been proposed in NLP research to predict the semantic change of words over time. I will then describe a lightweight prompt-based approach for the temporal adaptation of LLMs.

These are the slides from the keynote given at *SEM 2023 [https://sites.google.com/view/starsem2023/speakers]

Danushka Bollegala

August 05, 2023
Tweet

More Decks by Danushka Bollegala

Other Decks in Research

Transcript

  1. Time Travel with Large Language Models
    Danushka Bollegala

    View full-size slide

  2. 2
    Mad scientist

    View full-size slide

  3. 3
    Xiaohang


    Tang
    Yi Zhou Yoichi


    Ishibashi
    Taichi


    Aida
    Mad scientist
    with Large Language Models

    View full-size slide

  4. Time and Meaning — Cell
    4
    Robe
    rt
    Hook (1665)
    Ma
    rt
    in Cooper (1973)

    View full-size slide

  5. Time and Meaning — Corona
    5

    View full-size slide

  6. Why do word meaning change?
    • New concepts/entities are associated with existing words (e.g. cell )


    • Word re-usage promotes e
    ffi
    ciency in human communications [cf. Polysemy, Ravin+Leacock’00]


    • 40% of words in Webster dictionary have more than two senses, while run has 29!


    • Totally new words (neologisms) are coined to describe previously non-existent concepts/entities (e.g.
    chatGPT)


    • Semantics, morphology and syntax are strongly interrelated [Langacker+87, Hock+Joseph 19]


    • what count as coherent, grammatical changes over time [Giulianelli+21]
    6
    Grammatical Profiling for Semantic Change Detection
    Mario Giulianelli⇤
    ILLC, University of Amsterdam
    [email protected]
    Andrey Kutuzov⇤
    University of Oslo
    [email protected]
    Lidia Pivovarova⇤
    University of Helsinki
    [email protected]
    Abstract
    Semantics, morphology and syntax are
    strongly interdependent. However, the major-
    ity of computational methods for semantic
    change detection use distributional word rep-
    resentations which encode mostly semantics.
    We investigate an alternative method, gram-
    matical profiling, based entirely on changes in
    the morphosyntactic behaviour of words. We
    demonstrate that it can be used for semantic
    change detection and even outperforms some
    distributional semantic methods. We present
    lass


    Young Woman → sweethea
    rt

    drop in the plural form (lasses)
    Pockemon trainer class


    (girl in mini-ski
    rt
    )

    View full-size slide

  7. A Brief History of Word Embeddings
    7
    Static Word Embeddings
    word2vec [Mikolov+13], GloVe [Pennington+14], fastText [Bojanowski+17],…
    Contextualised Word Embeddings
    BERT [Devlin+19], RoBERTa [Liu+19], ALBERT [Lan+20], …
    Dynamic Word Embeddings
    Bernoulli embeddings [Rudolph+Blei 17],


    Diachronic word embeddings [Hamilton+16], …
    Dynamic Contextualised Word Embeddings
    TempoBERT [Rosin+22], HistBERT [Qiu+22], TimeLMs [Loureiro+22], …

    View full-size slide

  8. Diachronic Word Embeddings
    • Given multiple snapshots of corpora collected at di
    ff
    erent time steps, we could separately
    learn word embeddings from each snapshot. [Hamilton+16, Kulkarni+15, Loureiro+22]


    • Pros: Any word embedding learning method can be used


    • Cons:


    • Many models trained at di
    ff
    erent snapshots.


    • Di
    ffi
    cult to compare word embeddings learnt from di
    ff
    erent corpora because no
    natural alignment exists (cf. even the sets of word embeddings obtained from di
    ff
    erent
    runs of the same algorithm cannot be compared due to random initialisations)
    8
    nificant Detection of Linguistic Change
    arni
    ersity, USA
    nybrook.edu
    Rami Al-Rfou
    Stony Brook University, USA
    [email protected]
    ozzi
    ersity, USA
    ybrook.edu
    Steven Skiena
    Stony Brook University, USA
    [email protected]
    oach for tracking and
    tic shifts in the mean-
    c shifts are especially
    pid exchange of ideas
    . Our meta-analysis
    es of word usage, and
    point detection algo-
    shifts.
    roaches of increasing
    property time series,
    tional characteristics
    ng recently proposed
    train vector represen-
    talkative
    profligate
    courageous
    apparitional
    dapper
    sublimely
    unembarrassed
    courteous
    sorcerers
    metonymy
    religious
    adolescents
    philanthropist
    illiterate
    transgendered
    artisans
    healthy
    gays
    homosexual
    transgender
    lesbian
    statesman
    hispanic
    uneducated
    gay
    1900
    gay
    1950
    gay
    1975 gay
    1990
    gay
    2005
    cheerful
    Kulkarni+15

    View full-size slide

  9. Learning Alignments
    • Di
    ff
    erent methods can be used to learn alignments between separately learnt vector
    spaces


    • Canonical Correlation Analysis (CCA) was used by Pražák+20 (ranked 1st for the
    SemEval 2020 Task 1 binary semantic change detection task)


    • Projecting source to target embeddings:


    • CCA:





    • Fu
    rt
    her o
    rt
    hogonal constraints can be used on


    • However, aligning contextualised word embeddings is hard [Takahashi+Bollegala’22]
    ambiguity; the decisions to add a POS tag to English target words and retain German noun capita
    shows that the organizers were aware of this problem.
    3 System Description
    First, we train two semantic spaces from corpus C1 and C2. We represent the semantic spac
    matrix Xs (i.e., a source space s) and a matrix Xt (i.e, a target space t)2 using word2vec Skip-gr
    negative sampling (Mikolov et al., 2013). We perform a cross-lingual mapping of the two vector
    getting two matrices ˆ
    Xs and ˆ
    Xt projected into a shared space. We select two methods for th
    lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc
    2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b
    of these methods are linear transformations. In our case, the transformation can be written as fol
    ˆ
    Xs = Ws!tXs
    where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs
    target space t and ˆ
    Xs is the source space transformed into the target space t (the matrix Xt does n
    to be transformed because Xt is already in the target space t and Xt = ˆ
    Xt).
    Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared
    (where Xs 6= ˆ
    Xs and Xt 6= ˆ
    Xt). Thus, CCA computes two transformation matrices Ws!o
    source space and Wt!o for the target space. The transformation matrices are computed by min
    the negative correlation between the vectors xs
    i
    2 Xs and xt
    i
    2 Xt that are projected into the
    space o. The negative correlation is defined as follows:
    argmin
    Ws!o,Wt!o
    n
    X
    i=1
    ⇢(Ws!oxs
    i
    , Wt!oxt
    i
    ) =
    n
    X
    i=1
    cov(Ws!oxs
    i
    , Wt!oxt
    i
    )
    p
    var(Ws!oxs
    i
    ) ⇥ var(Wt!oxt
    i
    )
    where cov the covariance, var is the variance and n is a number of vectors. In our implement
    CCA, the matrix ˆ
    Xt is equal to the matrix Xt because it transforms only the source space s (ma
    into the target space t from the common shared space with a pseudo-inversion, and the target spa
    s!t
    lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc´
    ın et al
    2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b). Bot
    of these methods are linear transformations. In our case, the transformation can be written as follows:
    ˆ
    Xs = Ws!tXs (1
    where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs) into
    target space t and ˆ
    Xs is the source space transformed into the target space t (the matrix Xt does not hav
    to be transformed because Xt is already in the target space t and Xt = ˆ
    Xt).
    Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared space
    (where Xs 6= ˆ
    Xs and Xt 6= ˆ
    Xt). Thus, CCA computes two transformation matrices Ws!o for th
    source space and Wt!o for the target space. The transformation matrices are computed by minimizin
    the negative correlation between the vectors xs
    i
    2 Xs and xt
    i
    2 Xt that are projected into the share
    space o. The negative correlation is defined as follows:
    argmin
    Ws!o,Wt!o
    n
    X
    i=1
    ⇢(Ws!oxs
    i
    , Wt!oxt
    i
    ) =
    n
    X
    i=1
    cov(Ws!oxs
    i
    , Wt!oxt
    i
    )
    p
    var(Ws!oxs
    i
    ) ⇥ var(Wt!oxt
    i
    )
    (2
    where cov the covariance, var is the variance and n is a number of vectors. In our implementation o
    CCA, the matrix ˆ
    Xt is equal to the matrix Xt because it transforms only the source space s (matrix Xs
    into the target space t from the common shared space with a pseudo-inversion, and the target space doe
    not change. The matrix Ws!t for this transformation is then given by:
    Ws!t = Ws!o(Wt!o) 1 (3
    The submissions that use CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -
    part means that the source and target spaces are reversed, see Section 4. The -nn and -bin parts refer to
    type of threshold used only in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no differenc
    ˆ
    Xs = Ws!tXs (1)
    that performs linear transformation from the source space s (matrix Xs) into a
    he source space transformed into the target space t (the matrix Xt does not have
    e Xt is already in the target space t and Xt = ˆ
    Xt).
    ansformation transforms both spaces Xs and Xt into a third shared space o
    Xt 6= ˆ
    Xt). Thus, CCA computes two transformation matrices Ws!o for the
    for the target space. The transformation matrices are computed by minimizing
    between the vectors xs
    i
    2 Xs and xt
    i
    2 Xt that are projected into the shared
    relation is defined as follows:
    X
    1
    ⇢(Ws!oxs
    i
    , Wt!oxt
    i
    ) =
    n
    X
    i=1
    cov(Ws!oxs
    i
    , Wt!oxt
    i
    )
    p
    var(Ws!oxs
    i
    ) ⇥ var(Wt!oxt
    i
    )
    (2)
    e, var is the variance and n is a number of vectors. In our implementation of
    ual to the matrix Xt because it transforms only the source space s (matrix Xs)
    m the common shared space with a pseudo-inversion, and the target space does
    Ws!t for this transformation is then given by:
    Ws!t = Ws!o(Wt!o) 1 (3)
    se CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -r
    e and target spaces are reversed, see Section 4. The -nn and -bin parts refer to a
    ly in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no difference
    submissions: cca-nn – cca-bin and cca-nn-r – cca-bin-r.
    ogonal Transformation, the submissions are referred to as ort & uns. We use
    on with a supervised seed dictionary consisting of all words common to both
    s!t
    Ws→t
    9

    View full-size slide

  10. Dynamic Embeddings
    • Exponential Family Embeddings [Rudolph+16]





    • Bernoulli Embeddings [Rudolph+Blei 17]


    • , where


    • Dynamic Embeddings


    • embedding vectors are time-speci
    fi
    c, while
    context vectors (parametrised by ) are shared
    over time
    xi
    |xci
    ∼ ExpFam(ηi
    (xci
    ), t(xi
    ))
    xiv
    |xci
    ∼ Bern(ρ(t)
    iv
    )
    ηiv
    = ρ(ti
    )
    v


    j∈cj

    v′

    αv′

    xjv′

    ρ(t)
    v
    αv
    10
    use a Gaussian random walk to capture drift in the underlying
    language model; for example, see Blei and Laerty [8], Wang et
    al. [43], Gerrish and Blei [13] and Frermann and Lapata [12].
    Though topic models and word embeddings are related, they are
    ultimately dierent approaches to language analysis. Topic models
    capture co-occurrence of words at the document level and focus
    on heterogeneity, i.e., that a document can exhibit multiple topics
    [9]. Word embeddings capture co-occurrence in terms of proximity
    in the text, usually focusing on small neighborhoods around each
    word [26]. Combining dynamic topic models and dynamic word
    embeddings is an area for future study.
    2 DYNAMIC EMBEDDINGS
    We develop dynamic embeddings (), a type of exponential
    family embedding () [35] that captures sequential changes in the
    representation of the data. We focus on text data and the Bernoulli
    embedding model. In this section, we review Bernoulli embeddings
    for text and show how to include dynamics into the model. We then
    derive the objective function for dynamic embeddings and develop
    stochastic gradients to optimize it on large collections of text.
    Bernoulli embeddings for text. An is a conditional model
    [2]. It has three ingredients: The context, the conditional distribution
    of each data point, and the parameter sharing structure.
    In an for text, the data is a corpus of text, a sequence of words
    (x1, . . . ,xN ) from a vocabulary of size V . Each word xi 2 {0, 1}V
    is an indicator vector (also called a “one-hot” vector). It has one
    Figure 2: Graphical representation of a for text data
    in T time slices, X (1), · · · ,X (T ). The embedding vectors of
    each term evolve over time. The context vectors are shared
    across all time slices.
    embedding vectors
    context vectors

    View full-size slide

  11. Dynamic Embeddings
    11
    (a) in ACM abstracts (1951–2014)
    (b) in U.S. Senate speeches (1858–2009)
    The dynamic embedding of the

    word “intelligence” computed from


    (a) the ACM abstracts and


    (b) U.S. Senate speeches,


    projected to a single dimension

    (y-axis).

    View full-size slide

  12. Time Masking (TempoBERT) [Rosin+22]
    • Prepend the time stamp to each sentence in a corpus wri
    tt
    en at a speci
    fi
    c time.


    • Mask out the time token similar to other tokens during MLM training


    • Masking time tokens with a higher probability (e.g. 0.2) pe
    rf
    orms be
    tt
    er


    • Predicting time of a sentence


    • [MASK] Joe Biden is the President of the USA


    • Probability distributions of the predicted time-tokens can be used to compute semantic change
    scores for words
    12
    <2021> Joe Biden is the President of the USA
    Table 3: Semantic change detection results on LiverpoolFC, SemEval-English, and SemEval-Latin.
    Method
    LiverpoolFC SemEval-Eng SemEval-Lat
    Pearson Spearman Pearson Spearman Pearson Spearman
    Del Tredici et al. [5] 0.490 – – – – –
    Schlechtweg et al. [37] 0.428 0.425 0.512 0.321 0.458 0.372
    Gonen et al. [10] – – 0.504 0.277 0.417 0.273
    Martinc et al. [26] 0.473 0.492 – 0.315 – 0.496
    Montariol et al. [28] 0.378 0.376 0.566 0.437 – 0.448
    TempoBERT 0.637 0.620 0.538 0.467 0.485 0.512
    works surprisingly well on multiple datasets and languages!

    View full-size slide

  13. Temporal A
    tt
    ention [Rosin+Radinsky 22]
    • Instead of changing the input text, change the
    a
    tt
    ention mechanism in the Transformer to incorporate
    time.


    • Input sequence , input embeddings
    arranged as rows in


    • Query , Key , Value ,
    Time (Here, )





    • Increases the parameters (memory) but empirical
    results show this overhead is negligible.
    xt
    1
    , xt
    2
    , …, xt
    n
    xt
    i
    ∈ ℝD Xt ∈ ℝn×D
    Q = XtWQ
    K = XtWK
    V = XtWV
    T = XtWT
    Q, K, V, T ∈ ℝn×dk
    TemporalAttention(Q, K, V, T) = softmax
    Q T⊤T
    ||T||
    K⊤
    dk
    V
    13
    and then compared between different time points
    (Jatowt and Duh, 2014; Kim et al., 2014; Kulkarni
    et al., 2015; Hamilton et al., 2016; Dubossarsky
    et al., 2019; Del Tredici et al., 2019). Gonen et al.
    (2020) used a simple nearest-neighbors-based ap-
    proach to detect semantically-changed words. Oth-
    ers learned time-aware embeddings simultaneously
    over all time points to resolve the alignment prob-
    lem, by regularization (Yao et al., 2018), mod-
    eling word usage as a function of time (Rosen-
    feld and Erk, 2018), Bayesian skip-gram (Bamler
    and Mandt, 2017), or exponential family embed-
    dings (Rudolph and Blei, 2018).
    All aforementioned methods limit the representa-
    tion of each word to a single meaning, ignoring the
    ambiguity in language and limiting their sensitivity.
    Figure 2: Illustration of our propose
    tion mechanism.
    between each pair of tokens. In

    View full-size slide

  14. Dynamic Contextualised Word Embeddings
    • First, incorporate time and social context into
    static word embedding of the -th word.





    • : BERT input embeddings


    • : Learnt using a Gated A
    tt
    ention Network
    (GAT) [Veliˇckovi´c+18] applied to the social
    network


    • : Sampled from a zero-mean diagonal
    Gaussian


    • Next, use these dynamic non-contextualised
    embeddings with BERT to create a contextualised
    version of them



    tj
    si
    x(k) k
    e(k)
    ij
    = d(x(k), si
    , tj
    )
    x(k)
    si
    tj
    h(k)
    ij
    = BERT(e(k)
    ij
    , si
    , tj
    )
    14
    Hofmann+21

    View full-size slide

  15. Learn vs. Adapt
    • Temporal Adaptation:


    • Instead of training separate word embedding models from each snapshot
    taken at di
    ff
    erent time stamps, adapt a model from one point (current/past)
    in time to another (future) point in time. [Kulkarni+15, Hamilton+16,
    Loureiro+22]


    • Bene
    fi
    ts


    • Parameter e
    ff i
    ciency


    • Models trained on di
    ff
    erent snapshots share the same set of parameters,
    leading to smaller total model sizes.


    • Data e
    ffi
    ciency


    • We might not have su
    ffi
    cient data at each snapshot (especially when the
    time intervals are sho
    rt
    ) to accurately train large models
    15

    View full-size slide

  16. Problem Se
    tt
    ing
    • Given an Masked Language Model (MLM), M, and two corpora (snap
    shots) and , taken at two di
    ff
    erent times and , adapt
    M from to such that it can represent the meanings of words at .


    • Remarks


    • M does not have to be trained on (or ).


    • We do not care whether M can accurately represent the meanings of
    words at .


    • M is both contextualised as well as dynamic (time-sensitive)


    • Hence, Dynamic Contextualised Word Embedding (DCWE)!
    C1
    C2
    T1
    T2
    ( > T1
    )
    T1
    T2
    T2
    C1
    C2
    T1
    16

    View full-size slide

  17. Prompt-based Temporal Adaptation
    • How to connect two corpora collected in two di
    ff
    erent points in time?


    • Pivots ( ): — words that occur in both as well as


    • Anchors ( ): — words that associated with pivots in either or , but
    not both.


    • is associated with in , whereas is associated with in


    • Temporal Prompt:


    • is associated with in , whereas it is associated with in


    • Example: (mask, hide, vaccine) ,


    • mask is associated with hide in 2010, whereas it is associated with vaccine
    in 2020
    w C1
    C2
    u, v C1
    C2
    u w C1
    v w C2
    w u T1
    v T2
    T1
    = 2010 T2
    = 2020
    17

    View full-size slide

  18. Frequency-based Tuple Selection
    • Pivot selection: If a word occurs a lot in both corpora, it is likely to
    be time-invariant (domain-independent) [Bollegala+15]





    • : frequency of in corpus


    • Anchor selection: words in each corpus that have high pointwise
    mutual information with pivots are likely to be good anchors



    score(w) = min(f(w, C1
    ), f(w, C2
    ))
    f(w, C) w C
    PMI(w, x; C) = log (
    p(w, x)
    p(w)p(x) )
    18

    View full-size slide

  19. Diversity-based Tuple Selection
    • The anchors that have high PMI with pivots in both domain could
    be similar, resulting in useless prompts for temporal adaptation


    • Add a diversity penalty on pivots…





    • : Set of anchors associated with in


    • : Set of anchors associated with in


    • Select that scores high on diversity and create tuples ( ) by
    selecting the corresponding anchors.
    diversity(w) = 1 −
    |
    𝒰
    (w) ∩
    𝒱
    (w)|
    |
    𝒰
    (w) ∪
    𝒱
    (w)|
    𝒰
    (w) w C1
    𝒱
    (w) w C2
    w w, u, v
    19

    View full-size slide

  20. Context-based Tuple Selection
    • Two issues in frequency- and diversity-based tuple selection methods


    • co-occurrences can be sparse (esp. in small corpora), and can make PMI
    overestimate the association between words.


    • contexts of the co-occurrences are not considered.


    • Solution — use contextualised word embeddings


    • A word is represented by averaging its token embedding over all
    occurrences





    • Compute two embeddings for , and , respectively from and



    x M(x, d)
    d ∈
    𝒟
    (x)
    x =
    1
    |
    𝒟
    (x)| ∑
    d∈
    𝒟
    (x)
    M(x, d)
    x x1
    x2
    C1
    C2
    score(w, u, v) = g(w1
    , u1
    ) + g(w2
    , v2
    ) − g(w2
    , u2
    ) − g(w1
    , v1
    )
    20

    View full-size slide

  21. Automatic Template Learning
    • Given a tuple (extracted by any of the previously described methods), can we
    generate the templates?


    • mask is associated with hide in 2010 and associated with vaccine in 2020


    • Find two sentences and containing and , and use T5 [Ra
    ff
    el+ ’20] to
    generate the slots Z1, Z2, Z3, and Z4.




    • Select the templates that have high likelihood with all tuples. [Gao+ ’21]


    • Use beam search with a large (e.g. 100) beam width to generate a diverse set
    of templates.


    • We substitute tuples in the generated templates to create Automatic prompts
    S1
    S2
    u v
    Tg(u, v, T1, T2) shown in (6).
    S1, S2
    ! S1
    hZ1
    i u hZ2
    i T1
    hZ3
    i v hZ4
    i T2 S2
    (6)
    The length of each slot to be generated is not
    required to be predefined, and we generate one
    token at a time until we encounter the next non-slot
    token (i.e. u, T1, v, T2).
    The templates we generate must cover all tu-
    ples in S. Therefore, when decoding we prefer
    21
    mask hide 2010 vaccine 2020

    View full-size slide

  22. Examples of Prompts
    22
    Template Type
    hwi is associated with hui in hT1
    i, whereas it is associated with hvi in hT2
    i. Manual
    Unlike in hT1i, where hui was associated with hwi, in hT2i hvi is associated with hwi. Manual
    The meaning of hwi changed from hT1
    i to hT2
    i respectively from hui to hvi. Manual
    hui in hT1
    i hvi in hT2
    i Automatic
    hui in hT1
    i and hvi in hT2
    i Automatic
    The hui in hT1
    i and hvi in hT2
    i Automatic
    le 1: Experimented templates. “Manual” denotes that the template is manually-written, whereas “Automa
    otes that the template is automatically-generated.
    mpts such that M captures the semantic varia-
    n of a word w from T1 to T2. For this purpose,
    add a language modelling head on top of M,
    domly mask out one token at a time from each
    mpt, and require that M correctly predicts those
    sked out tokens from the remainder of the to-
    s in the context. We also experimented with
    ariant where we masked out only the anchor
    BERT(T1): We fine-tune the Original BE
    model on the training data sampled at T1.
    BERT(T2): We fine-tune the Original BE
    model on the training data sampled at T2. No
    that this is the same training data that was used
    selecting tuples in §3.2
    FT: The BERT models fine-tuned by the propos
    method. We use the notation FT(model, templat
    - Automatic prompts tend to be sho
    rt
    and less diverse.

    - Emphasising on high likelihood results in sho
    rt
    er prompts

    View full-size slide

  23. Fine-tuning on Temporal Prompts
    • Add a language modelling head to the pre-trained MLM and
    fi
    ne-tune
    it such that it can correctly predict the masked-out tokens in a prompt.
    23
    mask is associated with hide in 2010, whereas it is associated with vaccine in 2020
    • We mask all tokens at random during
    fi
    ne-tuning.


    • Masking only anchors did not improve pe
    rf
    ormance signi
    fi
    cantly

    View full-size slide

  24. Experiments
    • Datasets


    • Yelp: We select publicly available reviews covering the years 2010 (T1) and 2020 (T2).


    • Reddit: We take all comments from September 2019 (T1) and April 2020 (T2), which
    re
    fl
    ects the e
    ff
    ects of the COVID-19 pandemic.


    • ArXiv: We obtain abstracts of papers published at years 2010 (T1) and 2020 (T2)


    • Ciao: We select reviews from the years 2010 (T1) and 2020 (T2) [Tang+’12]


    • Baselines


    • Original BERT: pre-trained BERT-base-uncased


    • BERT(T1):
    fi
    ne-tune the original BERT on the training data sampled at T1.
    • BERT(T2):
    fi
    ne-tune the original BERT on the training data sampled at T2.
    • Proposed: FT(model, template)
    24

    View full-size slide

  25. Results — Temporal Adaptation
    • Evaluation Metric: Perplexity scores (lower the be
    tt
    er) for
    generating test sentences in T2 is used as the evaluation metric.


    • Best result in each block is in bold, while overall best is indicated by †
    25
    MLM Yelp Reddit ArXiv Ciao
    Original BERT 15.125 25.277 11.142 12.669
    FT (BERT, Manual) 14.562 24.109 10.849 12.371
    FT (BERT, Auto) 14.458 23.382 10.903 12.394
    BERT (T1) 5.543 9.287 5.854 7.423
    FT (BERT(T1), Manual) 5.534 9.327 5.817 7.334
    FT (BERT(T1), Auto) 5.541 9.303 5.818 7.347
    BERT(T2) 4.718 8.927 3.500 5.840
    FT (BERT(T2), Manual) 4.714 8.906† 3.500 5.813†
    FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827

    View full-size slide

  26. Results — Comparisons against SoTA
    • FT (Proposed) has the lowest perplexities across all datasets.


    • CWE (Contextualised Word Embeddings) used by Hofmann+21 [BERT]


    • DCWE (Dynamic CWE) proposed by Hofmann+21
    26
    MLM Yelp Reddit ArXiv Ciao
    FT (BERT(T2), Manual) 4.714 8.906† 3.499 5.813†
    FT (BERT(T2), Auto) 4.708† 8.917 3.499† 5.827
    TempoBERT [Rosin+2022] 5.516 12.561 3.709 6.126
    CWE [Hofmann+2021] 4.723 9.555 3.530 5.910
    DCWE [temp. only] [Hofmann+2021] 4.723 9.631 3.515 5.899
    DCWE [temp. + social] [Hofmann+2021] 4.720 9.596 3.513 5.902

    View full-size slide

  27. Pivots and Anchors
    • Anecdote:


    • buergerville and joes are restaurants, which were popular in 2010 but due to the
    lockdowns takeaways such as dominos have been associated with place in 2020.


    • clerk is less used now and is ge
    tt
    ing replaced by administrator, operator etc.
    27
    Pivot (w) Anchors (u, v)
    place (burgerville, takeaway), (burgerville, dominos), (joes, dominos)
    service (doorman, sta
    ff
    s), (clerks, personnel), (clerks, administration)
    phone (nokia, iphone), (nokia, ipod), (nokia, blackberry)
    service (clerk, administrator), (doorman, sta
    ff
    ), (clerk, operator)

    View full-size slide

  28. We ❤ Prompts
    28

    View full-size slide

  29. Lets talk about Prompting
    • There are many types of prompts currently in use


    • Few-shot prompting


    • Give some examples and ask the LLM to generalise from
    them (cf. in-context learning)


    • e.g. If man is to woman then king is to what?


    • Zero-shot/instruction prompting


    • Describe the task that needs to be pe
    rf
    ormed by the LLM


    • e.g. Translate the following sentence from Japanese to
    English: ݴޠϞσϧ͸͍͢͝Ͱ͢ɽ
    29

    View full-size slide

  30. Robustness of Prompting?
    • Humans have a latent intent that they want to express using a sho
    rt
    text
    snippet to an LLM and a prompt is a su
    rf
    ace realisations of this latent intent


    • Prompting is a many-to-one mapping, with multiple su
    rf
    ace realisations
    possible for a single latent intent inside the human brain


    • It is OK for prompts to be di
    ff
    erent as long as they all align to the same
    latent intent (and hopefully give the same level of pe
    rf
    ormance)


    • Robustness of a Prompt Learning Method

    [Ishibashi+ h
    tt
    ps://aclanthology.org/2023.eacl-main.174/]


    • If the pe
    rf
    ormance of an MLM ( ), measured by a metric , on a task ,
    with prompts learnt by a method remains stable under a small random
    pe
    rt
    urbation , then is de
    fi
    ned to be robust w.r.t. on for .



    M g T
    Γ
    δ Γ g T M
    𝔼
    d∼Γ
    [|g(T, M(d)) − g(T, M(d + δ)|] < ϵ
    30

    View full-size slide

  31. AutoPrompts are not Robust!
    • Prompts learnt by AutoPrompt [Shin+2020] for fact extraction (on T-REx)
    using BERT and RoBERTa.


    • Compared to Manual prompts, AP BERT/RoBERTa have much be
    tt
    er
    pe
    rf
    ormance.


    • However, AutoPrompts are di
    ff
    i
    cult to interpret (cf. humans would never
    write this stu
    ff
    )
    31

    View full-size slide

  32. Token ordering
    • Randomly re-order tokens in a prompt and measure the drop in
    pe
    rf
    ormance
    32

    View full-size slide

  33. Cross-dataset Evaluation
    • If the prompts learnt from one dataset can also pe
    rf
    orm well on
    another dataset, annotated for the same task, then the prompts
    generalise well
    33

    View full-size slide

  34. Lexical Semantic Changes
    • Instead of adapting an entire LLM (costly), can we just predict the
    semantic change of a single word over a time period?
    34
    Danushka Bollegala
    Amazon, University of Liverpool
    [email protected]
    (a) gay
    associated with words constantly
    h time. Detecting the semantic vari-
    ords is an important task for var-
    applications that must make time-
    edictions. Existing work on seman-
    n prediction have predominantly fo-
    comparing some form of an aver-
    xtualised representation of a target
    uted from a given corpus. However,
    previously associated meanings of
    rd can become obsolete over time
    ing of gay as happy), while novel
    existing words are observed (e.g.
    f cell as a mobile phone). We ar-
    ean representations alone cannot ac-
    pture such semantic variations and
    method that uses the entire cohort
    extualised embeddings of the target
    h we refer to as the sibling distribu-
    rimental results on SemEval-2020
    chmark dataset for semantic varia-
    tion show that our method outper-
    work that consider only the mean
    s, and is comparable to the current
    (a) gay
    (b) cell
    Figure 1: t-SNE projections of BERT token vectors
    (dotted) in two time periods and the average vector
    (starred) for each period. (a) the word gay has lost its
    original meaning related to happy and is now used to
    gay cell
    Unsupervised Semantic Variation Prediction using the Distribution of 

    Sibling Embeddings
    h
    tt
    ps://arxiv.org/abs/2305.08654 [Aida+Bollegala Findings of ACL 2023]

    View full-size slide

  35. Siblings are all you need
    • Challenges


    • How to model the meaning of a word in a corpus?


    • Meaning depends on the context [Harris 1954]


    • How to compare meaning of a word across corpora?


    • Depends on the representations learnt.


    • Lack of large-scale labelled datasets to learn semantic change prediction models


    • Must reso
    rt
    to unsupervised methods


    • Solution


    • Each occurrence of a target word in a corpus can be represented by its own contextualised token
    embedding, obtained from a pre-trained/
    fi
    ne-tuned MLM.


    • Set of vector embeddings can be approximated by a multivariate Gaussian (full covariance is
    expensive, can be approximated well with the diagonal)


    • We can sample from two Gaussians representing the meaning of the target word in each corpus and
    then use any distance/divergence measure
    35

    View full-size slide

  36. Comparisons against SoTA
    36
    Model Spearman
    Word2Gausslight (averages word2vec, KL) 0.358
    Word2Gauss (learnt from scratch, rotation, KL) 0.399
    MLMtemp, Cosine (FT by time masking BERT, avg. cosine distance) 0.467
    MLMtemp, APD (avg. pairwise cosine distance over all siblings) 0.479
    MLMpre w/ Temp. A
    tt
    . (pretrained BERT + temporal-a
    tt
    ention) 0.520
    MLMtemp w/ Temp. A
    tt
    . (FT by time masking BERT + temporal-a
    tt
    ention) 0.548
    Proposed (Sibling embeddings, Multivariate Full cov., Chebyshev) 0.529

    View full-size slide

  37. Word Senses and Semantic Changes
    • Hypothesis: If the distribution of word senses associated with a pa
    rt
    icular
    word has changed between two corpora, that word’s meaning has changed.
    37
    plane
    pin
    sense distributions in corpus-1 sense distributions in corpus-2
    Jensen-Shannon


    divergence = 0.221
    Jensen-Shannon


    divergence = 0.027

    View full-size slide

  38. Swapping is all you need!
    • Hypothesis: If the meaning of a word has not changed between two corpora, sibling
    distributions will be similar to that in the original corpora upon a random swapping of
    sentences.
    38
    Corpus 1
    D1
    Corpus 2
    D2
    s1 s2
    D1,swap D2,swap
    Corpus 1
    D1
    Corpus 2
    D2
    s1 s2
    D1,swap
    D2,swap

    View full-size slide

  39. What is next…
    • LLMs are trained to predict only the single choice by the human writer and is unaware
    of the alternatives considered


    • Can we use LLMs to predict the output distributions considered by the human
    writer instead of the selected one?


    • Time adaptation still requires
    fi
    ne-tuning, which is costly for LLMs.


    • Parameter e
    ff
    i
    cient Fine-tuning (PEFT) methods (e.g. Adapters, LoRA etc.) should
    be considered.


    • Most words do not change their meaning (at least within sho
    rt
    er time intervals)


    • On-demand updates — only update words (and their contexts) that changed in
    meaning


    • Periodic Temporal Shi
    ft
    s

    39

    View full-size slide

  40. Where are we going?
    40

    View full-size slide

  41. Where are we should we be going?
    41

    View full-size slide

  42. Where are we should we be going?
    • Danushka’s hot take


    • LLMs are great and (some amount of) hype is good for the
    fi
    eld. We
    could/should analyse the texts generated by LLMs to see how it di
    ff
    ers
    (or not) from that by humans.


    • But


    • I do not believe LLMs are “models” of language (rather models that
    can generate language)


    • We need to love the exceptions! and not sweep them under the
    carpet. The types of mistakes made by a model tells more about
    what it understands than the ones it get correct.


    • We are scared our papers will get rejected if we talk more about the
    mistakes our models make … this is bad science.
    42

    View full-size slide

  43. 43
    Questions
    Danushka Bollegala


    h
    tt
    ps://danushka.net


    [email protected]


    @Bollegala
    Th
    ank Y
    o

    View full-size slide