⾃⼰教師あり学習によるビジョン基盤モデルの事前学習

Slide 1

Slide 1 text

ϩϘοτ޻ֶηϛφʔɿʮϩϘοτͷͨΊͷ--.ɾ7-.ར׆༻ʯ ⾃⼰ ڭࢣ͋ΓֶशʹΑΔϏδϣϯج൫Ϟσϧͷࣄલֶश ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

Slide 5

Slide 5 text

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(#ηϚηάਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(#ݴޠ $"7."& <:(POH *$-3`> 3(#Ի੠ ରরֶश.*. ରরֶश 8IBU%P447J5-FBSO ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश.*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ 1SFUFYUλεΫͷվળ

Slide 7

Slide 7 text

w 4PMWJOH+JHTBX1V[[MFT λΠϧঢ়ʹͭͷύονΛγϟοϑϧ͠ɺγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ w $POUFYU&ODPEFST<1BUIBL $713> &ODPEFSɾ%FDPEFSߏ଄ͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7> Χϥʔը૾͔Βม׵ͨ͠άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBC஋Λ༧ଌ w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713> ֤ը૾ʹର͢Δಛ௃ྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλ਺ʹΫϥε਺ͱֶͯ͠शʣ w $POUFYU1SFEJDUJPO<%PFSTDI *$$7> λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff *$.-> ύονͷಛ௃ྔ͔ΒύονҐஔ͕ ݸઌͷύονͷಛ௃ྔΛ༧ଌ k ༷ʑͳϓϨςΩετλεΫ Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated and solved. We randomly crop a 225 ⇥ 225 pixel window from an image (red dashed box), divide it into a 3 ⇥ 3 grid, and randomly pick a 64 ⇥ 64 pixel tiles from each 75 ⇥ 75 pixel cell. These 9 tiles are reordered via a randomly chosen permutation from a predefined permutation set and are then fed to the CFN. The task is to predict the index of the chosen permutation (technically, we define as output a probability vector with 1 at the 64-th location and 0 elsewhere). The CFN is a siamese-ennead CNN. For simplicity, we do not indicate the max- pooling and ReLU layers. These shared layers are implemented exactly as in AlexNet [25]. In the transfer learning experiments we show results with the trained weights transferred on AlexNet (precisely, stride 4 on the first layer). The training in the transfer learning experiment is the same as in the other competing methods. Notice instead, that during the training on the puzzle task, we set the stride of the first layer of the CFN to 2 instead of 4. that permutation, and ask the CFN to return a vector with the probability value for each index. Given 9 tiles, there are 9! = 362,880 possible permutations. From our experimental validation, we found that the permutation set is an important factor on the performance of the representation that the network learns. We perform an ablation study on the impact of the permutation set in subsection 4.2. 3.3 Training the CFN The output of the CFN can be seen as the conditional probability density function (pdf) of the spatial arrangement of object parts (or scene parts) in a part- based model, i.e., p(S|A 1 , A 2 , . . . , A 9 ) = p(S|F 1 , F 2 , . . . , F 9 ) 9 Y i=1 p(Fi |Ai ) (1) where S is the configuration of the tiles, Ai is the i-th part appearance of the object, and {Fi }i=1,...,9 form the intermediate feature representation. Our objective is to train the CFN so that the features Fi have semantic attributes that can identify the relative position between parts. differ in the approach: whereas [7] are solving a discriminative task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embeddings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word prediction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based prediction is also harder to “cheat” since low-level image features, such as chromatic aberration, do not provide any meaning- ful information, in contrast to [7] where chromatic aberration partially solves the task. On the other hand, it is not yet clear if requiring faithful pixel generation is necessary for learning good visual features. Image generation Generative models of natural images have enjoyed significant research interest [16, 24, 35]. Re- cently, Radford et al. [33] proposed new convolutional architectures and optimization hyperparameters for Genera- tive Adversarial Networks (GAN) [16] producing encour- aging results. We train our context encoders using an ad- versary jointly with reconstruction loss for generating inpainting results. We discuss this in detail in Section 3.2. Dosovitskiy et al. [10] and Rifai et al. [36] demonstrate that CNNs can learn to generate novel images of particular object categories (chairs and faces, respectively), but rely on large labeled datasets with examples of these categories. In contrast, context encoders can be applied to any unlabeled image database and learn to generate images based on the surrounding context. Inpainting and hole-filling It is important to point out that our hole-filling task cannot be handled by classical inpainting [4, 32] or texture synthesis [2, 11] approaches, Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the missing regions in the image. 3. Context encoders for image generation We now introduce context encoders: CNNs that predict missing parts of a scene from their surroundings. We first give an overview of the general architecture, then provide details on the learning procedure and finally present various strategies for image region removal. 3.1. Encoder-decoder pipeline The overall architecture is a simple encoder-decoder pipeline. The encoder takes an input image with missing regions and produces a latent feature representation of that image. The decoder takes this feature representation and produces the missing image content. We found it important to connect the encoder and the decoder through a channel- wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content. Figure 2 shows an overview of our architecture. Encoder Our encoder is derived from the AlexNet architecture [26]. Given an input image of size 227×227, we use the first five convolutional layers and the following pooling layer (called pool5) to compute an abstract 6 × 6 × 256 dimensional feature representation. In contrast to AlexNet, 128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach Our goal is to learn an embedding function v = f✓(x) without supervision. f✓ is a deep neural network with parameters ✓, mapping image x to feature v. This embedding would induces a metric over the image space, as d✓(x, y) = kf✓(x) f✓(y)k for instances x and y. A good embedding should map visually similar images closer to each other. Our novel unsupervised feature learning approach is instance-level discrimination. We treat each image instance as a distinct class of its own and train a classifier to distin- guish between individual instance classes (Fig.2). 3.1. Non-Parametric Softmax Classifier Parametric Classifier. We formulate the instance-level classification objective using the softmax criterion. Sup- where ⌧ is a temperature parameter that controls the concentration level of the distribution [11]. ⌧ is important for supervised feature learning [43], and also necessary for tuning the concentration of v on our unit sphere. The learning objective is then to maximize the joint probability Qn i=1 P✓(i|f✓(xi)), or equivalently to minimize the negative log-likelihood over the training set, as J(✓) = n X i=1 log P(i|f✓(xi)). (3) Learning with A Memory Bank. To compute the probability P(i|v) in Eq. (2), {vj } for all the images are needed. Instead of exhaustively computing these representations ev- ery time, we maintain a feature memory bank V for stor- 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. [29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and encourage interested readers to investigate both concurrent papers. 2 Approach We train a CNN to map from a grayscale input to a distribution over quantized color value outputs using the architecture shown in Figure 2. Architectural details are described in the supplementary materials on our project webpage1, and the model is publicly available. In the following, we focus on the design of the objective function, and our technique for inferring point estimates of color from the predicted color distribution. 2.1 Objective Function Given an input lightness channel X 2 RH⇥W ⇥1, our objective is to learn a mapping b Y = F(X) to the two associated color channels Y 2 RH⇥W ⇥2, where H, W are image dimensions. (We denote predictions with ab · symbol and ground truth without.) We perform this task in CIE Lab color space. Because distances in this space model perceptual distance, a natural objective function, as used in [1, 2], is the Euclidean loss L2 (·, ·) between predicted and ground truth colors: L2 ( b Y, Y) = 1 2 X kYh,w b Yh,wk 2 2 (1) fθ x z [256, 256, 3] [7, 7, Patched ResNet-161 fθ x z [256, 256, 3] [7, 7, Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ x z [224, 224, 3] [14, 14 Efficient classification 1% to 100% images and labels fθ x z [H, W, 3] [H/16, W/ Transfer learning 100% images and labels x [224, 224, 3] Supervised training 1% to 100% images and labels Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 occur in a specific spatial configuration (if there is no specific configuration of the parts, then it is “stuff” [1]). We present a ConvNet-based approach to learn a visual representation from this task. We demonstrate that the resulting visual representation is good for both object detection, pro- viding a significant boost on PASCAL VOC 2007 compared to learning from scratch, as well as for unsupervised object discovery / visual data mining. This means, surprisingly, that our representation generalizes across images, despite being trained using an objective function that operates on a single image at a time. That is, instance-level supervision appears to improve performance on category-level tasks. 2. Related Work One way to think of a good image representation is as the latent variables of an appropriate generative model. An ideal generative model of natural images would both generate images according to their natural distribution, and be concise in the sense that it would seek common causes for different images and share information between them. However, inferring the latent structure given an image is in- tractable for even relatively simple models. To deal with these computational issues, a number of works, such as 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, similar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the ͔ΒҾ༻ <1BUIBL $713>͔ΒҾ༻ <8V $713>͔ΒҾ༻ <;IBOH &$$7>͔ΒҾ༻ <)ÉOB ff *$.->͔ΒҾ༻ <%PFSTDI *$$7>͔ΒҾ༻

Slide 13

Slide 13 text

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(#ηϚηάਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(#ݴޠ $"7."& <:(POH *$-3`> 3(#Ի੠ ରরֶश.*. 8IBU%P447J5-FBSO ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश.*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ ରরֶश

Slide 27

Slide 27 text

w ͳ่ͥյΛىͣ͜͞ʹֶश͕Մೳʁ ޯ഑ఀࢭॲཧͱ༧ଌϔουʹΑΔඇରশͳϞσϧߏ଄ͷ૊Έ߹Θ͕ͤॏཁ w ෼ੳʹؔ͢Δݚڀ $POUSBTUJOHUIFMBOETDBQFPGDPOUSBTUJWFBOEOPODPOUSBTUJWFMFBSOJOH<1PLMF "*45"54> &YQMPSJOHUIF&RVJWBMFODFPG4JBNFTF44-WJB"6OJ fi FE(SBEJFOU'SBNFXPSL<5BP $713> #SJEHJOHUIF(BQGSPN"TZNNFUSZ5SJDLTUP%FDPSSFMBUJPO1SJODJQMFTJO/PODPOUSBTUJWF44-<-JV /FVS*14> 0OUIFEVBMJUZCFUXFFODPOUSBTUJWFBOEOPODPOUSBTUJWF44-<(BSSJEP *$-3> *NQMJDJUWBSJBODFSFHVMBSJ[BUJPOJOOPODPOUSBTUJWF44-<)BMWBHBM /FVS*14> ෛྫ͕ෆཁͳରরֶशɿ4JN4JBN<$IFO $713> ˠҟͳΔ؍఺͔Β༷ʑͳ෼ੳ͕͞Ε͍ͯΔ *NBHF/FU,ʹର͢ΔઢܗධՁ ༧ଌϔουͷ༗ແɿ ޯ഑ఀࢭͷ༗ແɿ 0 100 0 50 epochs kNN acc. w/ stop-grad w/o stop-grad acc. (%) w/ stop-grad 67.7±0.1 w/o stop-grad 0.1 loss. Without stop-gradient it degenerates immediately. Middle aged std over all channels. Right plot: validation accuracy of a uation (“w/ stop-grad” is mean±std over 5 trials). (ablation in Sec. 4.4) or ReLU. This MLP has 2 layers. The dimension of h’s input and output (z and p) is d = 2048, and h’s hidden layer’s dimension is 512, making h a bottleneck structure (ablation in supplement). We use ResNet-50 [19] as the default backbone. Other im- lementation details are in supplement. We perform 100- pred. MLP h acc. (%) baseline lr with cosine decay 67.7 (a) no pred. MLP 0.1 (b) fixed random init. 1.5 (c) lr not decayed 68.1 Table 1. Effect of prediction MLP (ImageNet linear evaluation accuracy with 100-epoch pre-training). In all these variants, we use the same schedule for the encoder f (lr with cosine decay). that with stop-gradient, the std value is near 1 p d . This indi- cates that the outputs do not collapse, and they are scattered on the unit hypersphere. Figure 2 (right) plots the validation accuracy of a k- nearest-neighbor (kNN) classifier [36]. This kNN classifier can serve as a monitor of the progress. With stop-gradient, the kNN monitor shows a steadily improving accuracy. <$IFO $713>͔ΒҾ༻

Slide 30

Slide 30 text

w %*/0Ͱֶशͨ͠7J5ͷΫϥετʔΫϯʹର͢Δ"UUFOUJPOXFJHIUΛՄࢹԽ ࠷ऴ૚ͷ.VMUJ)FBE"UUFOUJPOͷதͰ࠷΋લܠʹண໨͍ͯ͠Δ)FBEʹ͍ͭͯՄࢹԽ w "UUFOUJPOXFJHIU΁ᮢ஋ॲཧΛ͔͚ͯՄࢹԽ ෛྫ͕ෆཁͳରরֶशɿ%*/0<$BSPO *$$7> Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron1,2 Hugo Touvron1,3 Ishan Misra1 Herv´ e Jegou1 Julien Mairal2 Piotr Bojanowski1 Armand Joulin1 1 Facebook AI Research 2 Inria⇤ 3 Sorbonne University Figure 1: Self-attention from a Vision Transformer with 8 ⇥ 8 patches trained with no supervision. We look at the self-attention of the [CLS] token on the heads of the last layer. This token is not attached to any label nor supervision. These maps show that the model automatically learns class-specific features leading to unsupervised object segmentations. Abstract In this paper, we question if self-supervised learning pro- vides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised 1. Introduction Transformers [70] have recently emerged as an alternative to convolutional neural networks (convnets) for visual recognition [19, 69, 83]. Their adoption has been coupled with a training strategy inspired by natural language processing (NLP), that is, pretraining on large quantities of data and finetuning on the target dataset [18, 55]. The resulting Vision Transformers (ViT) [19] are competitive with convnets but, they have not yet delivered clear benefits over them: they v:2104.14294v2 [cs.CV] 24 May 2021 ˠϥϕϧ৘ใ͕ແͯ͘΋ਖ਼֬ͳ෺ମྖҬΛ֫ಘ Supervised DINO Table 7: Important component for self-supervised ViT pre- training. Models are trained for 300 epochs with ViT-S/16. We study the different components that matter for the k-NN and linear (“Lin.”) evaluations. For the different variants, we highlight the differences from the default DINO setting. The best combination is the momentum encoder with the multicrop augmentation and the cross-entropy loss. We also report results with BYOL [30], MoCo-v2 [15] and SwAV [10]. Supervised DINO Random Supervised DINO ViT-S/16 22.0 27.3 45.9 Table 7: Imp training. Mod study the diffe (“Lin.”) evalu differences fro is the momen the cross-entr MoCo-v2 [15] Method M 1 DINO 2 3 ˠڭࢣ͋Γֶशͱൺ΂ͯ%*/0͸෺ମྖҬʹूத <$BSPO *$$7>͔ΒҾ༻ <$BSPO *$$7>͔ΒҾ༻

Slide 38

Slide 38 text

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(#ηϚηάਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(#ݴޠ $"7."& <:(POH *$-3`> 3(#Ի੠ ରরֶश.*. ରরֶश 8IBU%P447J5-FBSO ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश.*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ .BTLFE*NBHF.PEFMJOH .*.

Slide 42

Slide 42 text

w ϥϯμϜϚεΩϯάઓུʹ͓͚ΔϚεΫ཰ʹΑΔੑೳมԽ ϚεΫ཰ͷ࣌ʹઢܗධՁͱϑΝΠϯνϡʔχϯάͷ྆ํͰߴ͍ਫ਼౓ #&35ͷϚεΫ཰ͱൺ΂ͯߴ͍ϚεΫ཰ ୅දతͳ.*.ɿ."&<)F $713> ings to would e. The ing to ncoder nition). signed n. We d shal- ecoder th this y pro- tly re- e input . Each values 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. 4. ImageNet Experiments We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy *NBHF/FU,Λ༻͍ͨઢܗධՁ ing patch to be predicted. We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image. The decoder has another series of Transformer blocks. The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition). Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design. We experiment with very small decoders, narrower and shal- lower than the encoder. For example, our default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only pro- 10 20 30 40 50 60 70 80 90 83 84 85 83.2 83.4 83.4 84.7 84.9 85.0 84.9 84.9 84.5 83.0 fine-tuning masking ratio (%) 10 20 30 40 50 60 70 80 90 50 60 70 54.6 58.9 61.7 67.0 69.9 71.8 73.2 73.5 71.8 66.1 linear probing masking ratio (%) Figure 5. Masking ratio. A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom). The y-axes are ImageNet-1K validation accuracy (%) in all plots in this paper. *NBHF/FU,Λ༻͍ͨϑΝΠϯνϡʔχϯά <)F $713>͔ΒҾ༻ ϚεΫ཰ͷྫɿ block 50% grid 75% random 75% ୯ʹઢ΍ςΫενϟΛ֦ு͢Δ༧ଌͰ͸ࠔ೉ͳ໰୊ˠ෺ମ΍γʔϯͷશମ૾ͷཧղ΁

Slide 48

Slide 48 text

w ύονͷΑ͏ͳ୯Ґʹ෼ׂ͢Δ͜ͱͰը૾Ҏ֎ͷϞʔμϧʹ΋ద༻Մೳ ը૾Ҏ֎ͷϞʔμϧ΁ͷద༻ encoder decoder .... .... T W H T W H input target Figure 1: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias. To the extreme, if a video has T identical static frames, randomly sampling 1/T of all spacetime patches would reveal most of the static frame. Because slow motion is more likely than fast motion in natural videos, the masking ratio can be very high as we observe empirically. The higher masking ratio leads to a more efficient solution in practice. Following the MAE in [31] that applies the encoder only on visible tokens, a masking ratio of 90% reduces the encoder time and memory complexity to <1/10. Put together with a small decoder [31], the MAE pre-training can achieve a theoretically 7.7⇥ reduction in computation vs. encoding all tokens. In fact, the computation reduction is so large that the data loading time becomes a new bottleneck; even so, we record a 4.1⇥ wall-clock speedup. Such a significant speedup is of great importance for video research that is large-scale and time-consuming. We report strong results on a variety of video recognition datasets. Our MAE pre-training greatly improves generalization performance: on Kinetics-400 [35], it increases the accuracy of ViT-Large [18] by absolute 13% vs. training from scratch, while it takes less wall-clock training time overall (pre-training plus fine-tuning). Our MAE pre-training can outperform its supervised pre-training ."&"T4QBUJPUFNQPSBM-FBSOFST <'FJDIUFOIPGFS /FVS*14> Encoder Decoder … … Target MSE Input ) ( , Figure 1: Audio-MAE for audio self-supervised learning. An audio recording is first transformed into a spectrogram and split into patches. We embed patches and mask out a large subset (80%). An encoder then operates on the visible (20%) patch embeddings. Finally, a decoder processes the order-restored embeddings and mask tokens to reconstruct the input. Audio-MAE is minimizing the mean square error (MSE) on the masked portion of the reconstruction and the input spectrogram. This computational burden has been addressed in different ways. A popular approach is to reduce the sequence length in self-attention. Various ViT-based architectures have been developed to alleviate such issues for image and video understanding. For example, Swin-Transformer [19] only performs local attention within windows that shift across layers. MViT [20] employs pooling attention to construct a hierarchy of Transformers where sequence lengths are downsampled. For self-supervised learning, MAE [1] efficiently encodes only a small portion (25%) of visual patches while the majority of patches is discarded. The simplicity and scalability in MAE make it a promising framework for large-scale self-supervised learning. In this work, we study MAE for sound recognition and the unique challenges of the audio domain. We present Audio-MAE (Fig. 1) as unified and scalable framework for learning self-supervised audio representations. Similar to MAE, it is composed of a pair of a Transformer encoder and decoder. Figure 2: Visualizations on the Kinetics-400 [35] validation set (masking ratio 90%). We show the original video (top), masked video (middle), and MAE output (bottom) for each sample. This model reconstructs the original pixels. The video size is 16⇥224⇥224 and the spacetime patch size is 2⇥16⇥16 (the temporal patch size of 2 is not visualized here). Each sample has 8⇥14⇥14=1568 tokens with 156 being visible. For better visualizations, the known patches in the output are from the original input. Fig. 7 shows more examples. Figure 3: Visualizations of the same pre-trained model in Fig. 2 but with a masking ratio of 95%. Instead of predicting pixels [9, 18, 31, 80], another line of research focuses on the tokenization ."&UIBU-JTUFO <)VBOH /FVS*14> εϖΫτϩάϥϜʹରͯ͠ϚεΫ ಈը૾ʹରͯ͠ϚεΫ <)VBOH /FVS*14>͔ΒҾ༻ <'FJDIUFOIPGFS /FVS*14>͔ΒҾ༻

Slide 51

Slide 51 text

w ϚϧνϞʔμϧσʔλ ٖࣅϥϕϦϯάʹΑΓ3(#ը૾͔Β%FQUI৘ใͱ4FNBOUJDTFHNFOUBUJPO৘ใΛ࡞੒ ٖࣅϥϕϦϯάͱͯ͠4FHNFOUBUJPO%FQUIλεΫͷֶशࡁΈϞσϧͷग़ྗΛར༻ w &ODPEFS ϚεΫ͞Ε͍ͯͳ͍֤ϞμϦςΟͷύονΛ·ͱΊͯೖྗ ϚϧνϞʔμϧ΁ͷ֦ுɿ.VMUJ."&<#BDINBOO &$$7> Transformer encoder Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... MultiMAE pre-training Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear Decoder Selected input patches Original images Masked targets RGB Depth ntic ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) 3(# Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow Transformer. The queries consist of mask tokens (in gray), with the task-specific encoded tokens added at their respective positions. (Right) Fine-tuning: By pre-training on multiple modalities, MultiMAE lends itself to fine-tuning on single-modal and multi-modal downstream Transformer encoder Pre-trained MultiMAE encoder Pre-trained MultiMAE encoder Task- specific head(s) Decoder Linear proj. Linear proj. Linear proj. Decoder Decoder Selected input patches Original images Masked targets RGB Depth Semantic ... ... ... ... ... ... ... ... ... MultiMAE pre-training Single-modal fine-tuning Multi-modal fine-tuning Task- specific head(s) Figure 2. (Left) MultiMAE pre-training: A small subset of randomly sampled patches from multiple modalities (e.g., RGB, depth, and semantic segmentation) is linearly projected to tokens with a fixed dimension and encoded using a Transformer. Task-specific decoders reconstruct the masked-out patches by first performing a cross-attention step from queries to the encoded tokens, followed by a shallow 4FNBOUJD %FQUI ༧Ίඞཁͳσʔλ <#BDINBOO &$$7>͔ΒҾ༻ɼҰ෦վม

Slide 56

Slide 56 text

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(#ηϚηάਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(#ݴޠ $"7."& <:(POH *$-3`> 3(#Ի੠ ରরֶश.*. ରরֶश 8IBU%P447J5-FBSO ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश.*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ

Slide 59

Slide 59 text

w %*/0W $-*1 4".ͷͭͷج൫Ϟσϧ͔Β஌ࣝৠཹͯ͠ج൫ϞσϧΛ࡞੒ %*/0W༻ϔουɼ$-*1༻ϔουɼ4".༻ϔουΛ༻ҙͯͦ͠ΕͧΕͷϔουʹͯৠཹ ".3"%*0"HHMPNFSBUJWF.PEFM3FEVDF"MM%PNBJOT*OUP0OF <3BO[JOHFS $713> Results Vision Foundation model (student) Student head 2 CLIP Student head 1 DINOv2 Student head 3 SAM Teacher 1: DINOv2 Multi component distillation loss Teacher 2: CLIP Multi component distillation loss Teacher 3: SAM Multi component distillation loss A vintage radio on white sandy beach, a colorful beach ball nearby Pixel-level visual tasks: Text grounding: Semantic segmentation: (from scratch) Frozen weights Trainable Image only data: Framework Figure 1. AM-RADIO - is a multi-teacher distillation framework evaluated on the same dataset such as ImageNet-1k. To this end, we evaluate more than 10 promising architectures un- der the same training recipe for a direct comparison. We reveal that CNN-like architectures are faster but less ex- pressive than ViTs. This has led to the development of a novel hybrid architecture, E-RADIO, that exceeds the performance of its predecessors and is at least 7x faster than teacher models. Our main contributions are as follows: • We describe a general methodology for distilling multiple distinct foundation models into one, including models with incompatible input resolutions. • We show that these student models are able to outperform their teachers on representative benchmarks. • We demonstrate that these student models can either drop-in replace their teachers, or their features can be used directly in downstream applications such as provid- ing visual encoding for LLaVA [35, 36]. • We benchmark a number of efficient architectures and propose a new architecture (E-RADIO) that allows for similar model quality at significant speedups. 2. Related Work Knowledge Distillation The underpinning of our work is based on the method of Knowledge Distillation [4, 5, 23, 30, 42] which aims to train a “student” model using soft Results Teacher 1: Pixel-level visual tasks: Image only data: tions, since t ible with ou the student i face with the In additio model, we efficient mo comparable evaluated on end, we eval der the sam reveal that C pressive tha novel hybrid formance of teacher mod Our main c • We descri <3BO[JOHFS $713>͔ΒҾ༻

Slide 62

Slide 62 text

ࣗݾڭࢣ͋Γֶशͷਐల $1$ <"WE0PSE BS9JW`> ύονؒͰϖΞΛ ࡞੒ͯ͠ରরֶश $1$W <0+)ÉOB ff *$.-`> ϖΞͷ࡞੒΍ Ϟσϧߏ଄ͳͲΛվળ ϚεΫͨ͠ྖҬͷϐΫηϧΛ༧ଌ $POUFYU&ODPEFST <%1BUIBL $713`> δάιʔύζϧ ৭৘ใΛ༧ଌ ճస֯౓Λ༧ଌ *NBHF3PUBUJPOT <4(JEBSJT *$-3`> $POUFYU1SFEJDUJPO <$%PFSTDI *$$7`> ύον෼ׂͨ͠ը૾ͷ ύονؒͷ૬ରҐஔΛ༧ଌ $PVOUJOH <./PSPP[J *$$7`> ֤ύονग़ྗͷ࿨ͱը૾શମͷग़ྗ ͕Ұக͢ΔΑ͏ʹֶश +JHTBX <./PSPP[J $713`> ͭͷը૾ͷύζϧΛϛοΫε 8PSEWFD <5.JLPMPW BS9JW> #&35 <+%FWMJO /""$-> ࣗવݴޠॲཧ෼໺ 1SFUFYUλεΫͷվળ ྡ઀͢Δ୯ޠͷ༧ଌ Λը૾΁Ԡ༻ ը૾਺ʹΫϥε਺ͱֶͯ͠श .BTLFE-BOHVBHF.PEFMJOH .-. Λը૾΁Ԡ༻ +JHTBX <./PSPP[JBOE 1'BWBSP &$$7`> $PMPSJ[BUJPO <3;IBOH &$$7`> *OTUBODF%JTDSJNJOBUJPO <;8V $713`> ಛ௃ྔʹϚεΫΩϯά 4QPU"SUJGBDUT <4+FOOJBOE1'BWBSP $713`> ରরֶश 4JN$-3 <5$IFO *$.-`> 4JN$-3W <5$IFO /FVS*14`> .P$P <,)F $713> .P$PW <9$IFO BS9JW> 4JN$-3ͷςΫχοΫΛಋೖ &NCFEEJOH-FBSOJOH <.:F $713`> 6OTVQFSWJTFE σʔλ૿෯ͱෳ਺ͷը૾ Λ༻͍ͨରরֶशΛఏҊ աڈͷग़ྗΛ ෛྫͱͯ͠׆༻ େن໛ωοτϫʔΫͷಋೖ .P$PW <9$IFO *$$7> 7J5Ͱͷ༗ޮੑΛධՁ γϯϓϧͳରরֶशΛఏҊ ΫϥελϦϯάͷಋೖ 1$- <+-J *$-3`> ϓϩτλΠϓΛಋೖ 4X"7 <.$BSPO /FVS*14`> ਖ਼ྫͷଐ͢Δ ΫϥελΛਪఆ ωΨςΟϒϑϦʔ #:0- <+(SJMM /FVS*14> ਖ਼ྫϖΞͷΈͰֶश 4JN4JBN <9$IFO $713> ΑΓγϯϓϧͳֶशΛఏҊ %*/0 <.$BSPO *$$7> MPDBM͔ΒHMPCBM HMPCBM͔ΒHMPCBMΛ༧ଌ ϚϧνϞʔμϧ΁ͷ֦ு ը૾ʴݴޠ $-*1 <"3BEGPSE *$.-`> $."$$ <4.B *$-3`> ಈըʴԻ੠ .$/ <#$IFO *$$7`> ಈըʴԻ੠ʴݴޠ 4-JE3 <$4BVUJFS $713`> ը૾ʴ఺܈ .BTLFE*NBHF.PEFMJOH .*. ϚεΫྖҬͷಛ௃ྔΛ༧ଌ #&J5 <)#BP *$-3`> J#05 <+;IPV *$-3`> ϚεΫྖҬͷըૉ஋Λ༧ଌ ."& <,)F $713`> 4JN.*. <;9JF $713`> ϚεΩϯάઓུͷվળ 4E"& <:$IFO &$$7`> "UUFOUJPO(VJEFE.*. <*,BLPHFPSHJPV &$$7`> *+&1" <."TTSBO $713`> ϚϧνϞʔμϧ΁ͷ֦ு 3(#ηϚηάਂ౓ .VMUJ."& <3#BDINBOO &$$7`> .BTL7-. <(,XPO *$-3`> 3(#ݴޠ $"7."& <:(POH *$-3`> 3(#Ի੠ ରরֶश.*. ରরֶश 8IBU%P447J5-FBSO ରরֶशͱ.*.ͷֶशޮՌ ͷҧ͍Λ෼ੳ ".3"%*0 <.3BO[JOHFS $713`> ରরֶश.*. $-*1 4".ͷ ͭͷֶशࡁΈج൫ϞσϧΛ ͭͷϞσϧʹϞσϧѹॖ

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text