&ODPEFSɾ%FDPEFSߏͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH &$$7> Χϥʔը૾͔Βมͨ͠άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBCΛ༧ଌ w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V $713> ֤ը૾ʹର͢Δಛྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλʹΫϥεͱֶͯ͠शʣ w $POUFYU1SFEJDUJPO<%PFSTDI *$$7> λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W <)ÉOB ff *$.-> ύονͷಛྔ͔ΒύονҐஔ͕ ݸઌͷύονͷಛྔΛ༧ଌ k ༷ʑͳϓϨςΩετλεΫ Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7 Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated and solved. We randomly crop a 225 ⇥ 225 pixel window from an image (red dashed box), divide it into a 3 ⇥ 3 grid, and randomly pick a 64 ⇥ 64 pixel tiles from each 75 ⇥ 75 pixel cell. These 9 tiles are reordered via a randomly chosen permutation from a predefined permutation set and are then fed to the CFN. The task is to predict the index of the chosen permutation (technically, we define as output a probability vector with 1 at the 64-th location and 0 elsewhere). The CFN is a siamese-ennead CNN. For simplicity, we do not indicate the max- pooling and ReLU layers. These shared layers are implemented exactly as in AlexNet [25]. In the transfer learning experiments we show results with the trained weights transferred on AlexNet (precisely, stride 4 on the first layer). The training in the transfer learning experiment is the same as in the other competing methods. Notice instead, that during the training on the puzzle task, we set the stride of the first layer of the CFN to 2 instead of 4. that permutation, and ask the CFN to return a vector with the probability value for each index. Given 9 tiles, there are 9! = 362,880 possible permutations. From our experimental validation, we found that the permutation set is an important factor on the performance of the representation that the network learns. We perform an ablation study on the impact of the permutation set in subsection 4.2. 3.3 Training the CFN The output of the CFN can be seen as the conditional probability density func- tion (pdf) of the spatial arrangement of object parts (or scene parts) in a part- based model, i.e., p(S|A 1 , A 2 , . . . , A 9 ) = p(S|F 1 , F 2 , . . . , F 9 ) 9 Y i=1 p(Fi |Ai ) (1) where S is the configuration of the tiles, Ai is the i-th part appearance of the object, and {Fi }i=1,...,9 form the intermediate feature representation. Our ob- jective is to train the CFN so that the features Fi have semantic attributes that can identify the relative position between parts. differ in the approach: whereas [7] are solving a discrimina- tive task (is patch A above patch B or below?), our context encoder solves a pure prediction problem (what pixel inten- sities should go in the hole?). Interestingly, similar distinc- tion exist in using language context to learn word embed- dings: Collobert and Weston [5] advocate a discriminative approach, whereas word2vec [30] formulate it as word pre- diction. One important benefit of our approach is that our supervisory signal is much richer: a context encoder needs to predict roughly 15,000 real values per training example, compared to just 1 option among 8 choices in [7]. Likely due in part to this difference, our context encoders take far less time to train than [7]. Moreover, context based predic- tion is also harder to “cheat” since low-level image features, such as chromatic aberration, do not provide any meaning- ful information, in contrast to [7] where chromatic aberra- tion partially solves the task. On the other hand, it is not yet clear if requiring faithful pixel generation is necessary for learning good visual features. Image generation Generative models of natural images have enjoyed significant research interest [16, 24, 35]. Re- cently, Radford et al. [33] proposed new convolutional ar- chitectures and optimization hyperparameters for Genera- tive Adversarial Networks (GAN) [16] producing encour- aging results. We train our context encoders using an ad- versary jointly with reconstruction loss for generating in- painting results. We discuss this in detail in Section 3.2. Dosovitskiy et al. [10] and Rifai et al. [36] demonstrate that CNNs can learn to generate novel images of particular object categories (chairs and faces, respectively), but rely on large labeled datasets with examples of these categories. In contrast, context encoders can be applied to any unlabeled image database and learn to generate images based on the surrounding context. Inpainting and hole-filling It is important to point out that our hole-filling task cannot be handled by classical in- painting [4, 32] or texture synthesis [2, 11] approaches, Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as described in Section 3.1. The decoder then produces the missing regions in the image. 3. Context encoders for image generation We now introduce context encoders: CNNs that predict missing parts of a scene from their surroundings. We first give an overview of the general architecture, then provide details on the learning procedure and finally present various strategies for image region removal. 3.1. Encoder-decoder pipeline The overall architecture is a simple encoder-decoder pipeline. The encoder takes an input image with missing regions and produces a latent feature representation of that image. The decoder takes this feature representation and produces the missing image content. We found it important to connect the encoder and the decoder through a channel- wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content. Figure 2 shows an overview of our architecture. Encoder Our encoder is derived from the AlexNet archi- tecture [26]. Given an input image of size 227×227, we use the first five convolutional layers and the following pooling layer (called pool5) to compute an abstract 6 × 6 × 256 dimensional feature representation. In contrast to AlexNet, 128D Unit Sphere O 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 norm low dim Non-param Softmax Memory Bank Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere. 3. Approach Our goal is to learn an embedding function v = f✓(x) without supervision. f✓ is a deep neural network with parameters ✓, mapping image x to feature v. This em- bedding would induces a metric over the image space, as d✓(x, y) = kf✓(x) f✓(y)k for instances x and y. A good embedding should map visually similar images closer to each other. Our novel unsupervised feature learning approach is instance-level discrimination. We treat each image instance as a distinct class of its own and train a classifier to distin- guish between individual instance classes (Fig.2). 3.1. Non-Parametric Softmax Classifier Parametric Classifier. We formulate the instance-level classification objective using the softmax criterion. Sup- where ⌧ is a temperature parameter that controls the con- centration level of the distribution [11]. ⌧ is important for supervised feature learning [43], and also necessary for tun- ing the concentration of v on our unit sphere. The learning objective is then to maximize the joint prob- ability Qn i=1 P✓(i|f✓(xi)), or equivalently to minimize the negative log-likelihood over the training set, as J(✓) = n X i=1 log P(i|f✓(xi)). (3) Learning with A Memory Bank. To compute the proba- bility P(i|v) in Eq. (2), {vj } for all the images are needed. Instead of exhaustively computing these representations ev- ery time, we maintain a feature memory bank V for stor- 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. [29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and encourage interested readers to investigate both concurrent papers. 2 Approach We train a CNN to map from a grayscale input to a distribution over quantized color value outputs using the architecture shown in Figure 2. Architectural de- tails are described in the supplementary materials on our project webpage1, and the model is publicly available. In the following, we focus on the design of the objective function, and our technique for inferring point estimates of color from the predicted color distribution. 2.1 Objective Function Given an input lightness channel X 2 RH⇥W ⇥1, our objective is to learn a mapping b Y = F(X) to the two associated color channels Y 2 RH⇥W ⇥2, where H, W are image dimensions. (We denote predictions with ab · symbol and ground truth without.) We perform this task in CIE Lab color space. Because distances in this space model perceptual distance, a natural objective function, as used in [1, 2], is the Euclidean loss L2 (·, ·) between predicted and ground truth colors: L2 ( b Y, Y) = 1 2 X kYh,w b Yh,wk 2 2 (1) fθ x z [256, 256, 3] [7, 7, Patched ResNet-161 fθ x z [256, 256, 3] [7, 7, Self-supervised pre-training 100% images; 0% labels Linear classification 100% images and labels fθ x z [224, 224, 3] [14, 14 Efficient classification 1% to 100% images and labels fθ x z [H, W, 3] [H/16, W/ Transfer learning 100% images and labels x [224, 224, 3] Supervised training 1% to 100% images and labels Pre-trained Fixed / Tuned ResNet-161 Image x Feature Extractor fθ Patched ResNet-161 z c Context Network gφ Masked ConvNet Pre-trained Fixed / Tuned ResNet-161 Pre-trained Fixed Patched ResNet-161 occur in a specific spatial configuration (if there is no spe- cific configuration of the parts, then it is “stuff” [1]). We present a ConvNet-based approach to learn a visual repre- sentation from this task. We demonstrate that the resulting visual representation is good for both object detection, pro- viding a significant boost on PASCAL VOC 2007 compared to learning from scratch, as well as for unsupervised object discovery / visual data mining. This means, surprisingly, that our representation generalizes across images, despite being trained using an objective function that operates on a single image at a time. That is, instance-level supervision appears to improve performance on category-level tasks. 2. Related Work One way to think of a good image representation is as the latent variables of an appropriate generative model. An ideal generative model of natural images would both gener- ate images according to their natural distribution, and be concise in the sense that it would seek common causes for different images and share information between them. However, inferring the latent structure given an image is in- tractable for even relatively simple models. To deal with these computational issues, a number of works, such as 3 2 1 5 4 8 7 6 ); Y = 3 , X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the </PSPP[JBOE 'BWBSP &$$7>͔ΒҾ༻ <1BUIBL $713>͔ΒҾ༻ <8V $713>͔ΒҾ༻ <;IBOH &$$7>͔ΒҾ༻ <)ÉOB ff *$.->͔ΒҾ༻ <%PFSTDI *$$7>͔ΒҾ༻