Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SSL: Self-Supervised Learning

SSL: Self-Supervised Learning

東京大学理学部情報科学科(通称理情)には、情報科学演習Ⅲ(通称演習3)という授業が存在し、そこでは計3つの研究室を訪問した上で、それぞれの研究室で短期間のプロジェクトに取り組みます。このスライドは、私が杉山研究室を訪問したときに、self-supervised learning (SSL)について調査したものです。日本語では自己教師あり学習と呼びます。

This slides explain self-supervised learning (SSL).

Masayuki Usui

November 08, 2023
Tweet

More Decks by Masayuki Usui

Other Decks in Technology

Transcript

  1. Agenda 1. Concise overview of self-supervised learning 2. Basic Framework

    of self-supervised learning 3. Recent methods for self-supervised learning 4. Theoretical analysis of self-supervised learning
  2. Concise overview of self-supervised learning Self-supervised learning is often used

    for unsupervised representation learning In natural language processing, pre-training using masked language modeling (MLM) is widely adopted, and it can be viewed as self-supervised learning In computer vision, the use of self-supervised learning is relatively limited but actively researched I focus on self-supervised learning for CNNs in this presentation, but self-supervised learning for ViTs is also studied
  3. Basic framework of self-supervised learning Popular methods in self-supervised learning

    for CNNs are mostly based on maximization of similarity between features of different views of the same image Views are generated by data augmentation such as random cropping The features of the views are extracted by neural networks (often referred to as encoders) Then the similarity between the features is maximized
  4. Basic framework of self-supervised learning (cont’d) methodologies of plays as

    a hub to these methods. n be shared be- ection MLP [8], those missing in on negative sam- SimSiam can be . we append the mCLR.8 Here is red. & stop-grad 66.0 edictor is neces- alyzed in Sec. 5, xtra predictor is encoder similarity encoder predictor image SimSiam encoder similarity & dissimilarity encoder image SimCLR encoder similarity encoder Sinkhorn-Knopp image SwAV encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad Figure 3. Comparison on Siamese architectures. The en- coder includes all layers that can be shared between both branches. The dash lines indicate the gradient propagation flow. In BYOL, SwAV, and SimSiam, the lack of a dash line implies stop-gradient, and their symmetrization is not illustrated for simplicity. The com- Figure: cited from Chen and He, 2021
  5. Recent methods for self-supervised learning There is some influential work

    in the field of self-supervised learning I picked out from recent work BYOL (Grill et al., 2020), SimSiam (Chen and He, 2021), and Barlow Twins (Zbontar et al., 2021) BYOL was presented at NeurIPS 2020 SimSiam was presented at CVPR 2021 Barlow Twins was presented at ICML 2021 These methods will be explained in chronological order
  6. BYOL x v y✓ z✓ q✓(z✓) v0 y0 ⇠ z0

    ⇠ sg(z0 ⇠ ) view input image representation projection prediction t f✓ g✓ q✓ t0 f⇠ g⇠ sg loss online target Figure 2: BYOL’s architecture. BYOL minimizes a similarity loss between q✓(z✓) and sg(z0 ⇠ ), where ✓ are the trained weights, ⇠ are an exponential moving average of ✓ and sg means stop-gradient. At the end of training, everything but f✓ is discarded, and y✓ is used as the image representation. 3.1 Description of BYOL BYOL’s goal is to learn a representation y✓ which can then be used for downstream tasks. As described previously, BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights ✓ and is comprised of three stages: an encoder f✓ , a projector g✓ and a predictor q✓ , as shown in Figure 2 and Figure 8. The target network has the same architecture as the online network, but uses a different set of weights ⇠. The target network provides the regression targets to train the online network, and its parameters ⇠ are an exponential moving average of the online parameters ✓ [54]. More precisely, given a target decay rate ⌧ 2 [0, 1], after each training step we perform the following update, Figure: cited from Grill et al., 2020 The online and target networks have the same architecture (except the predictor) but the parameters of the target network are an exponential moving average of the parameters of the online network
  7. BYOL (cont’d) The loss is defined as follows: L =

    qθ(zθ) qθ(zθ) − zξ z ξ 2 = 2 − 2 qθ(zθ), zξ qθ(zθ) z ξ To symmetrize the loss, we use the sum of the loss obtained by feeding v to the online network and v to the target network, and the loss obtained by feeding v to the online network and v to the target network Then the loss is minimized w.r.t. θ (via SGD), while ξ is updated as an exponential moving average of θ
  8. SimSiam ebook AI Research (FAIR) tructure in representa- milarity be-

    to certain this paper, le Siamese even using s, (ii) large ments show d structure, role in pre- the impli- -of-concept od achieves m tasks. We rethink the representa- encoder f similarity encoder f predictor h stop-grad image x x1 x2 Figure 1. SimSiam architecture. Two augmented views of one image are processed by the same encoder network f (a backbone plus a projection MLP). Then a prediction MLP h is applied on one side, and a stop-gradient operation is applied on the other side. The model maximizes the similarity between both sides. It uses neither negative pairs nor a momentum encoder. into Siamese networks. Beyond contrastive learning and clustering, BYOL [15] relies only on positive pairs but it Figure: cited from Chen and He, 2021 SimSiam is very similar to BYOL but uses the same network for encoders (i.e. weight sharing) Let zi = f(xi) and pi = h(zi) for i = 1, 2
  9. SimSiam (cont’d) The loss is defined as follows: L =

    1 2 D(p1, z2) + 1 2 D(p2, z1) where D(x, y) represents the negative cosine similarity Minimization of this objective, however, leads to collapsing solutions (constant representations are trivially optimal solutions) (Chen and He, 2021) found that the addition of the stop gradient operation prevents encoders from converging to collapsing solutions Then the loss is modified as follows: L = 1 2 D(p1, stopgrad(z2)) + 1 2 D(p2, stopgrad(z1)) where the stop gradient operation stopgrad means that the gradient is not propagated in backpropagation
  10. Barlow Twins ng Ishan Misra Yann LeCun St´ ephane Deny

    ly closing arge com- approach e invariant owever, a e existence rent meth- mplementa- e function suring the outputs of d versions he identity mbedding Figure 1. BARLOW TWINS’s objective function measures the cross- correlation matrix between the embeddings of two identical net- works fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while mini- mizing the redundancy between the components of these vectors. BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, natu- rally avoiding trivial constant (i.e. collapsed) embeddings, and Figure: cited from Zbontar et al., 2021 Barlow Twins is essentially based on the comparison of the cross-correlation matrix This distinguishes Barlow Twins from BYOL or SimSiam
  11. Barlow Twins (cont’d) After centering (subtracting its mean from a

    variate to have a zero mean) ZA and ZB, the cross-correlation matrix C is defined as follows: Cij = b∈B zA b,i zB b,j b∈B zA b,i 2 b∈B zB b,j 2 where B represents the batch Then the loss function L is defined as follows: L = i (1 − Cii)2 + λ i j=i C2 ij where λ is a hyperparameter Minimization of this objective makes the cross-correlation matrix close to the identity matrix
  12. Theoretical analysis of self-supervised learning Some work tries to explain

    self-supervised learning in theoretical terms Theoretical analysis presented in (Lyu et al., 2022) is described here (Lyu et al., 2022) defines the generative model, defines the optimization problem, and proves some theorems
  13. Generative model We consider the following multiview generative model: x(q)

    l = g(q) zl c(q) l for q = 1, 2 where x(q) l is the lth sample of the qth view, zl is the shared component across views, c(q) l is the private information of qth view, and g(q) is an invertible and smooth function zl and c(q) l are regarded as the lth samples of random variables z and c(q), respectively We use the following independence assumption: p(z, c(1), c(2)) = p(z)p(c(1))p(c(2)) In this formulation, the goal is to extract zl and c(q) l from x(q) l
  14. Optimization problem To achieve the aforementioned goal, we consider the

    following latent correlation maximization problem: maximize f(1),f(2) Tr 1 N N l=1 f(1) S x(1) l f(2) S x(2) l subject to 1 N N l=1 f(q) S x(q) l = 0, f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l , f(q) is invertible, 1 N N l=1 f(q) S x(q) l f(q) S x(q) l = I, where f(q) is the feature extractor of view q
  15. Optimization problem (cont’d) We used the following notations: f(q)(x(q) l

    ) = f(q) S (x(q) l ) f(q) P (x(q) l ) where f(q) S (x(q) l ) and f(q) P (x(q) l ) corresponds to the shared and private components of the extracted features, respectively Under the above constraints, the following equivalence holds: max f(q) Tr 1 N N l=1 f(1) S x(1) l f(2) S x(2) l ⇐⇒ min f(q) 1 N N l=1 f(1) S x(1) l − f(2) S x(2) l 2 2 The left-hand side is similar to the objective used in Barlow Twins, whereas the right-hand side is similar to the objectives used in BYOL and SimSiam
  16. Theorem 1 Denote ˆ f(q) as any solution of the

    above optimization problem and assume that the first-order derivative of ˆ f(q) ◦ g(q) exists Then there exists an invertible function γ such that ˆ z = ˆ f(q) S (x(q)) = γ(z) holds Due to the invertibility of γ, ˆ z = γ(z) means that ˆ z has all the information of z Theorem 1 holds whether the constraint f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l is enforced or not
  17. Theorem 2 Assume the same conditions as Theorem 1 Then

    there exists an invertible function δ(q) such that ˆ c(q) = ˆ f(q) P (x(q)) = δ(q)(c(q)) holds Due to the invertibility of δ(q), ˆ c(q) = δ(q)(c(q)) means that ˆ c(q) has all the information of c(q) Theorem 2 requires that the constraint f(q) S x(q) l ⊥ ⊥ f(q) P x(q) l be enforced
  18. Limitations Theorem 1 and Theorem 2 consider the limiting case

    N → ∞ The above generative model and optimization problem are similar but not identical to actual settings and methods
  19. References Xinlei Chen and Kaiming He. Exploring simple siamese representation

    learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, June 2021. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21271–21284. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf. Qi Lyu, Xiao Fu, Weiran Wang, and Songtao Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5FUq05QRc5b. Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12310–12320. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zbontar21a.html.