Time Travel with Large Language Models

Slide 1

Slide 1 text

Time Travel with Large Language Models Danushka Bollegala

Slide 9

Slide 9 text

Learning Alignments • Di ff erent methods can be used to learn alignments between separately learnt vector spaces • Canonical Correlation Analysis (CCA) was used by Pražák+20 (ranked 1st for the SemEval 2020 Task 1 binary semantic change detection task) • Projecting source to target embeddings: • CCA: • • Fu rt her o rt hogonal constraints can be used on • However, aligning contextualised word embeddings is hard [Takahashi+Bollegala’22] ambiguity; the decisions to add a POS tag to English target words and retain German noun capita shows that the organizers were aware of this problem. 3 System Description First, we train two semantic spaces from corpus C1 and C2. We represent the semantic spac matrix Xs (i.e., a source space s) and a matrix Xt (i.e, a target space t)2 using word2vec Skip-gr negative sampling (Mikolov et al., 2013). We perform a cross-lingual mapping of the two vector getting two matrices ˆ Xs and ˆ Xt projected into a shared space. We select two methods for th lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b of these methods are linear transformations. In our case, the transformation can be written as fol ˆ Xs = Ws!tXs where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does n to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o source space and Wt!o for the target space. The transformation matrices are computed by min the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) where cov the covariance, var is the variance and n is a number of vectors. In our implement CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (ma into the target space t from the common shared space with a pseudo-inversion, and the target spa s!t lingual mapping Canonical Correlation Analysis (CCA) using the implementation from (Brychc´ ın et al 2019) and a modification of the Orthogonal Transformation from VecMap (Artetxe et al., 2018b). Bot of these methods are linear transformations. In our case, the transformation can be written as follows: ˆ Xs = Ws!tXs (1 where Ws!t is a matrix that performs linear transformation from the source space s (matrix Xs) into target space t and ˆ Xs is the source space transformed into the target space t (the matrix Xt does not hav to be transformed because Xt is already in the target space t and Xt = ˆ Xt). Generally, the CCA transformation transforms both spaces Xs and Xt into a third shared space (where Xs 6= ˆ Xs and Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for th source space and Wt!o for the target space. The transformation matrices are computed by minimizin the negative correlation between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the share space o. The negative correlation is defined as follows: argmin Ws!o,Wt!o n X i=1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2 where cov the covariance, var is the variance and n is a number of vectors. In our implementation o CCA, the matrix ˆ Xt is equal to the matrix Xt because it transforms only the source space s (matrix Xs into the target space t from the common shared space with a pseudo-inversion, and the target space doe not change. The matrix Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3 The submissions that use CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the - part means that the source and target spaces are reversed, see Section 4. The -nn and -bin parts refer to type of threshold used only in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no differenc ˆ Xs = Ws!tXs (1) that performs linear transformation from the source space s (matrix Xs) into a he source space transformed into the target space t (the matrix Xt does not have e Xt is already in the target space t and Xt = ˆ Xt). ansformation transforms both spaces Xs and Xt into a third shared space o Xt 6= ˆ Xt). Thus, CCA computes two transformation matrices Ws!o for the for the target space. The transformation matrices are computed by minimizing between the vectors xs i 2 Xs and xt i 2 Xt that are projected into the shared relation is defined as follows: X 1 ⇢(Ws!oxs i , Wt!oxt i ) = n X i=1 cov(Ws!oxs i , Wt!oxt i ) p var(Ws!oxs i ) ⇥ var(Wt!oxt i ) (2) e, var is the variance and n is a number of vectors. In our implementation of ual to the matrix Xt because it transforms only the source space s (matrix Xs) m the common shared space with a pseudo-inversion, and the target space does Ws!t for this transformation is then given by: Ws!t = Ws!o(Wt!o) 1 (3) se CCA are referred to as cca-nn, cca-bin, cca-nn-r and cca-bin-r where the -r e and target spaces are reversed, see Section 4. The -nn and -bin parts refer to a ly in the Sub-task 1, see Section 3.1. Thus, in Sub-task 2, there is no difference submissions: cca-nn – cca-bin and cca-nn-r – cca-bin-r. ogonal Transformation, the submissions are referred to as ort & uns. We use on with a supervised seed dictionary consisting of all words common to both s!t Ws→t 9

Slide 10

Slide 10 text

Dynamic Embeddings • Exponential Family Embeddings [Rudolph+16] • • Bernoulli Embeddings [Rudolph+Blei 17] • , where • Dynamic Embeddings • embedding vectors are time-speci fi c, while context vectors (parametrised by ) are shared over time xi |xci ∼ ExpFam(ηi (xci ), t(xi )) xiv |xci ∼ Bern(ρ(t) iv ) ηiv = ρ(ti ) v ⊤ ∑ j∈cj ∑ v′ αv′ xjv′ ρ(t) v αv 10 use a Gaussian random walk to capture drift in the underlying language model; for example, see Blei and Laerty [8], Wang et al. [43], Gerrish and Blei [13] and Frermann and Lapata [12]. Though topic models and word embeddings are related, they are ultimately dierent approaches to language analysis. Topic models capture co-occurrence of words at the document level and focus on heterogeneity, i.e., that a document can exhibit multiple topics [9]. Word embeddings capture co-occurrence in terms of proximity in the text, usually focusing on small neighborhoods around each word [26]. Combining dynamic topic models and dynamic word embeddings is an area for future study. 2 DYNAMIC EMBEDDINGS We develop dynamic embeddings (), a type of exponential family embedding () [35] that captures sequential changes in the representation of the data. We focus on text data and the Bernoulli embedding model. In this section, we review Bernoulli embeddings for text and show how to include dynamics into the model. We then derive the objective function for dynamic embeddings and develop stochastic gradients to optimize it on large collections of text. Bernoulli embeddings for text. An is a conditional model [2]. It has three ingredients: The context, the conditional distribution of each data point, and the parameter sharing structure. In an for text, the data is a corpus of text, a sequence of words (x1, . . . ,xN ) from a vocabulary of size V . Each word xi 2 {0, 1}V is an indicator vector (also called a “one-hot” vector). It has one Figure 2: Graphical representation of a for text data in T time slices, X (1), · · · ,X (T ). The embedding vectors of each term evolve over time. The context vectors are shared across all time slices. embedding vectors context vectors

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text