SSL: Self-Supervised Learning

Exercise 3 Presentation Masayuki Usui July 5, 2022

Agenda 1. Concise overview of self-supervised learning 2. Basic Framework
of self-supervised learning 3. Recent methods for self-supervised learning 4. Theoretical analysis of self-supervised learning

Concise overview of self-supervised learning Self-supervised learning is often used
for unsupervised representation learning In natural language processing, pre-training using masked language modeling (MLM) is widely adopted, and it can be viewed as self-supervised learning In computer vision, the use of self-supervised learning is relatively limited but actively researched I focus on self-supervised learning for CNNs in this presentation, but self-supervised learning for ViTs is also studied

Basic framework of self-supervised learning Popular methods in self-supervised learning
for CNNs are mostly based on maximization of similarity between features of different views of the same image Views are generated by data augmentation such as random cropping The features of the views are extracted by neural networks (often referred to as encoders) Then the similarity between the features is maximized

Basic framework of self-supervised learning (cont’d) methodologies of plays as
a hub to these methods. n be shared be- ection MLP [8], those missing in on negative sam- SimSiam can be . we append the mCLR.8 Here is red. & stop-grad 66.0 edictor is neces- alyzed in Sec. 5, xtra predictor is encoder similarity encoder predictor image SimSiam encoder similarity & dissimilarity encoder image SimCLR encoder similarity encoder Sinkhorn-Knopp image SwAV encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad Figure 3. Comparison on Siamese architectures. The encoder includes all layers that can be shared between both branches. The dash lines indicate the gradient propagation ﬂow. In BYOL, SwAV, and SimSiam, the lack of a dash line implies stop-gradient, and their symmetrization is not illustrated for simplicity. The com- Figure: cited from Chen and He, 2021

Recent methods for self-supervised learning There is some influential work
in the field of self-supervised learning I picked out from recent work BYOL (Grill et al., 2020), SimSiam (Chen and He, 2021), and Barlow Twins (Zbontar et al., 2021) BYOL was presented at NeurIPS 2020 SimSiam was presented at CVPR 2021 Barlow Twins was presented at ICML 2021 These methods will be explained in chronological order

BYOL x v y✓ z✓ q✓(z✓) v0 y0 ⇠ z0
⇠ sg(z0 ⇠ ) view input image representation projection prediction t f✓ g✓ q✓ t0 f⇠ g⇠ sg loss online target Figure 2: BYOL’s architecture. BYOL minimizes a similarity loss between q✓(z✓) and sg(z0 ⇠ ), where ✓ are the trained weights, ⇠ are an exponential moving average of ✓ and sg means stop-gradient. At the end of training, everything but f✓ is discarded, and y✓ is used as the image representation. 3.1 Description of BYOL BYOL’s goal is to learn a representation y✓ which can then be used for downstream tasks. As described previously, BYOL uses two neural networks to learn: the online and target networks. The online network is deﬁned by a set of weights ✓ and is comprised of three stages: an encoder f✓ , a projector g✓ and a predictor q✓ , as shown in Figure 2 and Figure 8. The target network has the same architecture as the online network, but uses a different set of weights ⇠. The target network provides the regression targets to train the online network, and its parameters ⇠ are an exponential moving average of the online parameters ✓ [54]. More precisely, given a target decay rate ⌧ 2 [0, 1], after each training step we perform the following update, Figure: cited from Grill et al., 2020 The online and target networks have the same architecture (except the predictor) but the parameters of the target network are an exponential moving average of the parameters of the online network

BYOL (cont’d) The loss is defined as follows: L =
qθ(zθ) qθ(zθ) − zξ z ξ 2 = 2 − 2 qθ(zθ), zξ qθ(zθ) z ξ To symmetrize the loss, we use the sum of the loss obtained by feeding v to the online network and v to the target network, and the loss obtained by feeding v to the online network and v to the target network Then the loss is minimized w.r.t. θ (via SGD), while ξ is updated as an exponential moving average of θ

SimSiam ebook AI Research (FAIR) tructure in representa- milarity be-
to certain this paper, le Siamese even using s, (ii) large ments show d structure, role in pre- the impli- -of-concept od achieves m tasks. We rethink the representa- encoder f similarity encoder f predictor h stop-grad image x x1 x2 Figure 1. SimSiam architecture. Two augmented views of one image are processed by the same encoder network f (a backbone plus a projection MLP). Then a prediction MLP h is applied on one side, and a stop-gradient operation is applied on the other side. The model maximizes the similarity between both sides. It uses neither negative pairs nor a momentum encoder. into Siamese networks. Beyond contrastive learning and clustering, BYOL [15] relies only on positive pairs but it Figure: cited from Chen and He, 2021 SimSiam is very similar to BYOL but uses the same network for encoders (i.e. weight sharing) Let zi = f(xi) and pi = h(zi) for i = 1, 2

SimSiam (cont’d) The loss is defined as follows: L =
1 2 D(p1, z2) + 1 2 D(p2, z1) where D(x, y) represents the negative cosine similarity Minimization of this objective, however, leads to collapsing solutions (constant representations are trivially optimal solutions) (Chen and He, 2021) found that the addition of the stop gradient operation prevents encoders from converging to collapsing solutions Then the loss is modified as follows: L = 1 2 D(p1, stopgrad(z2)) + 1 2 D(p2, stopgrad(z1)) where the stop gradient operation stopgrad means that the gradient is not propagated in backpropagation

Barlow Twins ng Ishan Misra Yann LeCun St´ ephane Deny
ly closing arge com- approach e invariant owever, a e existence rent meth- mplementa- e function suring the outputs of d versions he identity mbedding Figure 1. BARLOW TWINS’s objective function measures the cross- correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while mini- mizing the redundancy between the components of these vectors. BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, natu- rally avoiding trivial constant (i.e. collapsed) embeddings, and Figure: cited from Zbontar et al., 2021 Barlow Twins is essentially based on the comparison of the cross-correlation matrix This distinguishes Barlow Twins from BYOL or SimSiam

Barlow Twins (cont’d) After centering (subtracting its mean from a
variate to have a zero mean) ZA and ZB, the cross-correlation matrix C is defined as follows: Cij = b∈B zA b,i zB b,j b∈B zA b,i 2 b∈B zB b,j 2 where B represents the batch Then the loss function L is defined as follows: L = i (1 − Cii)2 + λ i j=i C2 ij where λ is a hyperparameter Minimization of this objective makes the cross-correlation matrix close to the identity matrix

Theoretical analysis of self-supervised learning Some work tries to explain
self-supervised learning in theoretical terms Theoretical analysis presented in (Lyu et al., 2022) is described here (Lyu et al., 2022) defines the generative model, defines the optimization problem, and proves some theorems

Generative model We consider the following multiview generative model: x(q)
l = g(q) zl c(q) l for q = 1, 2 where x(q) l is the lth sample of the qth view, zl is the shared component across views, c(q) l is the private information of qth view, and g(q) is an invertible and smooth function zl and c(q) l are regarded as the lth samples of random variables z and c(q), respectively We use the following independence assumption: p(z, c(1), c(2)) = p(z)p(c(1))p(c(2)) In this formulation, the goal is to extract zl and c(q) l from x(q) l

Optimization problem To achieve the aforementioned goal, we consider the
following latent correlation maximization problem: maximize f(1),f(2) Tr 1 N N l=1 f(1) S x(1) l f(2) S x(2) l subject to 1 N N l=1 f(q) S x(q) l = 0, f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l , f(q) is invertible, 1 N N l=1 f(q) S x(q) l f(q) S x(q) l = I, where f(q) is the feature extractor of view q

Optimization problem (cont’d) We used the following notations: f(q)(x(q) l
) = f(q) S (x(q) l ) f(q) P (x(q) l ) where f(q) S (x(q) l ) and f(q) P (x(q) l ) corresponds to the shared and private components of the extracted features, respectively Under the above constraints, the following equivalence holds: max f(q) Tr 1 N N l=1 f(1) S x(1) l f(2) S x(2) l ⇐⇒ min f(q) 1 N N l=1 f(1) S x(1) l − f(2) S x(2) l 2 2 The left-hand side is similar to the objective used in Barlow Twins, whereas the right-hand side is similar to the objectives used in BYOL and SimSiam

Theorem 1 Denote ˆ f(q) as any solution of the
above optimization problem and assume that the first-order derivative of ˆ f(q) ◦ g(q) exists Then there exists an invertible function γ such that ˆ z = ˆ f(q) S (x(q)) = γ(z) holds Due to the invertibility of γ, ˆ z = γ(z) means that ˆ z has all the information of z Theorem 1 holds whether the constraint f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l is enforced or not

Theorem 2 Assume the same conditions as Theorem 1 Then
there exists an invertible function δ(q) such that ˆ c(q) = ˆ f(q) P (x(q)) = δ(q)(c(q)) holds Due to the invertibility of δ(q), ˆ c(q) = δ(q)(c(q)) means that ˆ c(q) has all the information of c(q) Theorem 2 requires that the constraint f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l be enforced

Limitations Theorem 1 and Theorem 2 consider the limiting case
N → ∞ The above generative model and optimization problem are similar but not identical to actual settings and methods

References Xinlei Chen and Kaiming He. Exploring simple siamese representation
learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, June 2021. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21271–21284. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. Qi Lyu, Xiao Fu, Weiran Wang, and Songtao Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5FUq05QRc5b. Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12310–12320. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zbontar21a.html.

SSL: Self-Supervised Learning

SSL: Self-Supervised Learning

Masayuki Usui

More Decks by Masayuki Usui

Other Decks in Technology

Featured

Transcript

Exercise 3 Presentation Masayuki Usui July 5, 2022

Agenda 1. Concise overview of self-supervised learning 2. Basic Framework

Concise overview of self-supervised learning Self-supervised learning is often used

Basic framework of self-supervised learning Popular methods in self-supervised learning

Basic framework of self-supervised learning (cont’d) methodologies of plays as

Recent methods for self-supervised learning There is some influential work

BYOL x v y✓ z✓ q✓(z✓) v0 y0 ⇠ z0

BYOL (cont’d) The loss is defined as follows: L =

SimSiam ebook AI Research (FAIR) tructure in representa- milarity be-

SimSiam (cont’d) The loss is defined as follows: L =

Barlow Twins ng Ishan Misra Yann LeCun St´ ephane Deny

Barlow Twins (cont’d) After centering (subtracting its mean from a

Theoretical analysis of self-supervised learning Some work tries to explain

Generative model We consider the following multiview generative model: x(q)

Optimization problem To achieve the aforementioned goal, we consider the

Optimization problem (cont’d) We used the following notations: f(q)(x(q) l

Theorem 1 Denote ˆ f(q) as any solution of the

Theorem 2 Assume the same conditions as Theorem 1 Then

Limitations Theorem 1 and Theorem 2 consider the limiting case

References Xinlei Chen and Kaiming He. Exploring simple siamese representation