Slide 7
Slide 7 text
w 4PMWJOH+JHTBX1V[[MFT
λΠϧঢ়ʹͭͷύονΛγϟοϑϧ͠ɺγϟοϑϧॱ൪ͷΠϯσοΫεΛ༧ଌ
w $POUFYU&ODPEFST<1BUIBL
$713>
&ODPEFSɾ%FDPEFSߏͷϞσϧʹΑΓϚεΫ͞ΕͨྖҬΛ༧ଌ
w $PMPSGVM*NBHF$PMPSJ[BUJPO<;IBOH
&$$7>
Χϥʔը૾͔Βมͨ͠άϨʔεέʔϧը૾͔Β-BC৭ۭؒͷBCΛ༧ଌ
w /PO1BSBNFUSJD*OTUBODF%JTDSJNJOBUJPO<8V
$713>
֤ը૾ʹର͢Δಛྔ͕ಠཱ͢ΔΑ͏ʹֶशʢσʔλʹΫϥεͱֶͯ͠शʣ
w $POUFYU1SFEJDUJPO<%PFSTDI
*$$7>
λΠϧঢ়ʹΫϩοϓͨ͠ύονؒͷ૬ରҐஔΛ༧ଌ
w $POUSBTUJWF1SFEJDUJWF$PEJOH $1$W
<)ÉOB
ff
*$.->
ύονͷಛྔ͔ΒύονҐஔ͕ ݸઌͷύονͷಛྔΛ༧ଌ
k
༷ʑͳϓϨςΩετλεΫ
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles 7
Fig. 3: Context Free Network. The figure illustrates how a puzzle is generated
and solved. We randomly crop a 225 ⇥ 225 pixel window from an image (red
dashed box), divide it into a 3 ⇥ 3 grid, and randomly pick a 64 ⇥ 64 pixel tiles
from each 75 ⇥ 75 pixel cell. These 9 tiles are reordered via a randomly chosen
permutation from a predefined permutation set and are then fed to the CFN. The
task is to predict the index of the chosen permutation (technically, we define as
output a probability vector with 1 at the 64-th location and 0 elsewhere). The
CFN is a siamese-ennead CNN. For simplicity, we do not indicate the max-
pooling and ReLU layers. These shared layers are implemented exactly as in
AlexNet [25]. In the transfer learning experiments we show results with
the trained weights transferred on AlexNet (precisely, stride 4 on the
first layer). The training in the transfer learning experiment is the
same as in the other competing methods. Notice instead, that during
the training on the puzzle task, we set the stride of the first layer of
the CFN to 2 instead of 4.
that permutation, and ask the CFN to return a vector with the probability value
for each index. Given 9 tiles, there are 9! = 362,880 possible permutations. From
our experimental validation, we found that the permutation set is an important
factor on the performance of the representation that the network learns. We
perform an ablation study on the impact of the permutation set in subsection 4.2.
3.3 Training the CFN
The output of the CFN can be seen as the conditional probability density func-
tion (pdf) of the spatial arrangement of object parts (or scene parts) in a part-
based model, i.e.,
p(S|A
1
, A
2
, . . . , A
9
) = p(S|F
1
, F
2
, . . . , F
9
)
9
Y
i=1
p(Fi
|Ai
) (1)
where S is the configuration of the tiles, Ai
is the i-th part appearance of the
object, and {Fi
}i=1,...,9
form the intermediate feature representation. Our ob-
jective is to train the CFN so that the features Fi
have semantic attributes that
can identify the relative position between parts.
differ in the approach: whereas [7] are solving a discrimina-
tive task (is patch A above patch B or below?), our context
encoder solves a pure prediction problem (what pixel inten-
sities should go in the hole?). Interestingly, similar distinc-
tion exist in using language context to learn word embed-
dings: Collobert and Weston [5] advocate a discriminative
approach, whereas word2vec [30] formulate it as word pre-
diction. One important benefit of our approach is that our
supervisory signal is much richer: a context encoder needs
to predict roughly 15,000 real values per training example,
compared to just 1 option among 8 choices in [7]. Likely
due in part to this difference, our context encoders take far
less time to train than [7]. Moreover, context based predic-
tion is also harder to “cheat” since low-level image features,
such as chromatic aberration, do not provide any meaning-
ful information, in contrast to [7] where chromatic aberra-
tion partially solves the task. On the other hand, it is not yet
clear if requiring faithful pixel generation is necessary for
learning good visual features.
Image generation Generative models of natural images
have enjoyed significant research interest [16, 24, 35]. Re-
cently, Radford et al. [33] proposed new convolutional ar-
chitectures and optimization hyperparameters for Genera-
tive Adversarial Networks (GAN) [16] producing encour-
aging results. We train our context encoders using an ad-
versary jointly with reconstruction loss for generating in-
painting results. We discuss this in detail in Section 3.2.
Dosovitskiy et al. [10] and Rifai et al. [36] demonstrate
that CNNs can learn to generate novel images of particular
object categories (chairs and faces, respectively), but rely on
large labeled datasets with examples of these categories. In
contrast, context encoders can be applied to any unlabeled
image database and learn to generate images based on the
surrounding context.
Inpainting and hole-filling It is important to point out
that our hole-filling task cannot be handled by classical in-
painting [4, 32] or texture synthesis [2, 11] approaches,
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
described in Section 3.1. The decoder then produces the
missing regions in the image.
3. Context encoders for image generation
We now introduce context encoders: CNNs that predict
missing parts of a scene from their surroundings. We first
give an overview of the general architecture, then provide
details on the learning procedure and finally present various
strategies for image region removal.
3.1. Encoder-decoder pipeline
The overall architecture is a simple encoder-decoder
pipeline. The encoder takes an input image with missing
regions and produces a latent feature representation of that
image. The decoder takes this feature representation and
produces the missing image content. We found it important
to connect the encoder and the decoder through a channel-
wise fully-connected layer, which allows each unit in the
decoder to reason about the entire image content. Figure 2
shows an overview of our architecture.
Encoder Our encoder is derived from the AlexNet archi-
tecture [26]. Given an input image of size 227×227, we use
the first five convolutional layers and the following pooling
layer (called pool5) to compute an abstract 6 × 6 × 256
dimensional feature representation. In contrast to AlexNet,
128D Unit Sphere
O
1-th image
2-th image
i-th image
n-1 th image
n-th image
CNN backbone
128D
2048D
128D
L2 norm
low dim
Non-param
Softmax
Memory
Bank
Figure 2: The pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature
vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level
discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.
3. Approach
Our goal is to learn an embedding function v = f✓(x)
without supervision. f✓
is a deep neural network with
parameters ✓, mapping image x to feature v. This em-
bedding would induces a metric over the image space, as
d✓(x, y) = kf✓(x) f✓(y)k for instances x and y. A
good embedding should map visually similar images closer
to each other.
Our novel unsupervised feature learning approach is
instance-level discrimination. We treat each image instance
as a distinct class of its own and train a classifier to distin-
guish between individual instance classes (Fig.2).
3.1. Non-Parametric Softmax Classifier
Parametric Classifier. We formulate the instance-level
classification objective using the softmax criterion. Sup-
where ⌧ is a temperature parameter that controls the con-
centration level of the distribution [11]. ⌧ is important for
supervised feature learning [43], and also necessary for tun-
ing the concentration of v on our unit sphere.
The learning objective is then to maximize the joint prob-
ability
Qn
i=1 P✓(i|f✓(xi)), or equivalently to minimize the
negative log-likelihood over the training set, as
J(✓) =
n
X
i=1
log P(i|f✓(xi)). (3)
Learning with A Memory Bank. To compute the proba-
bility P(i|v) in Eq. (2), {vj
} for all the images are needed.
Instead of exhaustively computing these representations ev-
ery time, we maintain a feature memory bank V for stor-
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
between conv blocks.
[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and
encourage interested readers to investigate both concurrent papers.
2 Approach
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in Figure 2. Architectural de-
tails are described in the supplementary materials on our project webpage1, and
the model is publicly available. In the following, we focus on the design of the
objective function, and our technique for inferring point estimates of color from
the predicted color distribution.
2.1 Objective Function
Given an input lightness channel X 2 RH⇥W ⇥1, our objective is to learn a
mapping b
Y = F(X) to the two associated color channels Y 2 RH⇥W ⇥2, where
H, W are image dimensions. (We denote predictions with ab
· symbol and ground
truth without.) We perform this task in CIE Lab color space. Because distances
in this space model perceptual distance, a natural objective function, as used in
[1, 2], is the Euclidean loss L2
(·, ·) between predicted and ground truth colors:
L2
( b
Y, Y) =
1
2
X
kYh,w
b
Yh,wk
2
2
(1)
fθ
x z
[256, 256, 3] [7, 7,
Patched
ResNet-161
fθ
x z
[256, 256, 3] [7, 7,
Self-supervised
pre-training
100% images; 0% labels
Linear classification
100% images and labels
fθ
x z
[224, 224, 3] [14, 14
Efficient classification
1% to 100% images and labels
fθ
x z
[H, W, 3] [H/16, W/
Transfer learning
100% images and labels
x
[224, 224, 3]
Supervised training
1% to 100% images and labels
Pre-trained
Fixed / Tuned
ResNet-161
Image x
Feature Extractor fθ
Patched ResNet-161
z
c
Context Network gφ
Masked ConvNet
Pre-trained
Fixed / Tuned
ResNet-161
Pre-trained Fixed
Patched ResNet-161
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre-
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener-
ate images according to their natural distribution, and be
concise in the sense that it would seek common causes
for different images and share information between them.
However, inferring the latent structure given an image is in-
tractable for even relatively simple models. To deal with
these computational issues, a number of works, such as
3
2
1
5
4
8
7
6
); Y = 3
,
X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
͔ΒҾ༻
<1BUIBL
$713>͔ΒҾ༻
<8V
$713>͔ΒҾ༻
<;IBOH
&$$7>͔ΒҾ༻
<)ÉOB
ff
*$.->͔ΒҾ༻
<%PFSTDI
*$$7>͔ΒҾ༻