C . 4 5 9 4 4 4 C D 5 4 G ith softmax loss as the classifier (pre- 1000 classes as the main classifier, but ence time). of the resulting network is depicted in odology rks were trained using the DistBe- achine learning system using mod- and data-parallelism. Although we plementation only, a rough estimate ogLeNet network could be trained to w high-end GPUs within a week, the the memory usage. Our training used tic gradient descent with 0.9 momen- ng rate schedule (decreasing the learn- 8 epochs). Polyak averaging [13] was l model used at inference time. methods have changed substantially ing to the competition, and already re trained on with other options, some- with changed hyperparameters, such arning rate. Therefore, it is hard to ance to the most effective single way s. To complicate matters further, some ainly trained on smaller relative crops, , inspired by [8]. Still, one prescrip- to work very well after the competi- ng of various sized patches of the im- ributed evenly between 8% and 100% h aspect ratio constrained to the inter- found that the photometric distortions ] were useful to combat overfitting to s of training data. 14 Classification Challenge sults 4 classification challenge involves the image into one of 1000 leaf-node cat- et hierarchy. There are about 1.2 mil- ng, 50,000 for validation and 100,000 Each image is associated with one , and performance is measured based g classifier predictions. Two num- rted: the top-1 accuracy rate, which truth against the first predicted class, te, which compares the ground truth edicted classes: an image is deemed the ground truth is among the top-5, in them. The challenge uses the top-5 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure 3: GoogLeNet network with all the bells and whistles. 4 5C L , mWn rS 4 5C ot uVR I 1×1 f hc s pN PV dgieh mWa l I (a) Inception module, na¨ ıve version 1x1 convolutions 3x3 convolutions 5x5 convolutions Filter concatenation Previous layer 3x3 max pooling 1x1 convolutions 1x1 convolutions 1x1 convolutions (b) Inception module with dimensionality reduction Figure 2: Inception module efficiency during training), it seemed beneficial to start us- ing Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation. A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational com- plexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolu- tions with larger patch sizes. Furthermore, the design fol- lows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously. The improved use of computational resources allows for increasing both the width of each stage as well as the num- ber of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3 10⇥ faster than similarly perform- ing networks with non-Inception architecture, however this requires careful manual design at this point.