Slide 178
Slide 178 text
, v ,01 L
0 5E 2 9 D C . 4 5 9 4 4 4 C D
5 4 G
ith softmax loss as the classifier (pre-
1000 classes as the main classifier, but
ence time).
of the resulting network is depicted in
odology
rks were trained using the DistBe-
achine learning system using mod-
and data-parallelism. Although we
plementation only, a rough estimate
ogLeNet network could be trained to
w high-end GPUs within a week, the
the memory usage. Our training used
tic gradient descent with 0.9 momen-
ng rate schedule (decreasing the learn-
8 epochs). Polyak averaging [13] was
l model used at inference time.
methods have changed substantially
ing to the competition, and already
re trained on with other options, some-
with changed hyperparameters, such
arning rate. Therefore, it is hard to
ance to the most effective single way
s. To complicate matters further, some
ainly trained on smaller relative crops,
, inspired by [8]. Still, one prescrip-
to work very well after the competi-
ng of various sized patches of the im-
ributed evenly between 8% and 100%
h aspect ratio constrained to the inter-
found that the photometric distortions
] were useful to combat overfitting to
s of training data.
14 Classification Challenge
sults
4 classification challenge involves the
image into one of 1000 leaf-node cat-
et hierarchy. There are about 1.2 mil-
ng, 50,000 for validation and 100,000
Each image is associated with one
, and performance is measured based
g classifier predictions. Two num-
rted: the top-1 accuracy rate, which
truth against the first predicted class,
te, which compares the ground truth
edicted classes: an image is deemed
the ground truth is among the top-5,
in them. The challenge uses the top-5
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
AveragePool
7x7+1(V)
FC
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 3: GoogLeNet network with all the bells and whistles.
4 5C L
, mWn rS
4 5C ot uVR I
1×1 f hc s pN
PV dgieh mWa
l I
(a) Inception module, na¨
ıve version
1x1 convolutions
3x3 convolutions 5x5 convolutions
Filter
concatenation
Previous layer
3x3 max pooling
1x1 convolutions 1x1 convolutions
1x1 convolutions
(b) Inception module with dimensionality reduction
Figure 2: Inception module
efficiency during training), it seemed beneficial to start us-
ing Inception modules only at higher layers while keeping
the lower layers in traditional convolutional fashion. This is
not strictly necessary, simply reflecting some infrastructural
inefficiencies in our current implementation.
A useful aspect of this architecture is that it allows for
increasing the number of units at each stage significantly
without an uncontrolled blow-up in computational com-
plexity at later stages. This is achieved by the ubiquitous
use of dimensionality reduction prior to expensive convolu-
tions with larger patch sizes. Furthermore, the design fol-
lows the practical intuition that visual information should
be processed at various scales and then aggregated so that
the next stage can abstract features from the different scales
simultaneously.
The improved use of computational resources allows for
increasing both the width of each stage as well as the num-
ber of stages without getting into computational difficulties.
One can utilize the Inception architecture to create slightly
inferior, but computationally cheaper versions of it. We
have found that all the available knobs and levers allow for
a controlled balancing of computational resources resulting
in networks that are 3 10⇥ faster than similarly perform-
ing networks with non-Inception architecture, however this
requires careful manual design at this point.