寝ながら学べるDeep Learning

Slide 1

Slide 1 text

No content

Slide 178

Slide 178 text

, v ,01 L 0 5E 2 9 D C . 4 5 9 4 4 4 C D 5 4 G ith softmax loss as the classifier (pre- 1000 classes as the main classifier, but ence time). of the resulting network is depicted in odology rks were trained using the DistBe- achine learning system using mod- and data-parallelism. Although we plementation only, a rough estimate ogLeNet network could be trained to w high-end GPUs within a week, the the memory usage. Our training used tic gradient descent with 0.9 momen- ng rate schedule (decreasing the learn- 8 epochs). Polyak averaging [13] was l model used at inference time. methods have changed substantially ing to the competition, and already re trained on with other options, some- with changed hyperparameters, such arning rate. Therefore, it is hard to ance to the most effective single way s. To complicate matters further, some ainly trained on smaller relative crops, , inspired by [8]. Still, one prescrip- to work very well after the competi- ng of various sized patches of the im- ributed evenly between 8% and 100% h aspect ratio constrained to the inter- found that the photometric distortions ] were useful to combat overfitting to s of training data. 14 Classification Challenge sults 4 classification challenge involves the image into one of 1000 leaf-node cat- et hierarchy. There are about 1.2 mil- ng, 50,000 for validation and 100,000 Each image is associated with one , and performance is measured based g classifier predictions. Two num- rted: the top-1 accuracy rate, which truth against the first predicted class, te, which compares the ground truth edicted classes: an image is deemed the ground truth is among the top-5, in them. The challenge uses the top-5 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure 3: GoogLeNet network with all the bells and whistles. 4 5C L , mWn rS 4 5C ot uVR I 1×1 f hc s pN PV dgieh mWa l I (a) Inception module, na¨ ıve version 1x1 convolutions 3x3 convolutions 5x5 convolutions Filter concatenation Previous layer 3x3 max pooling 1x1 convolutions 1x1 convolutions 1x1 convolutions (b) Inception module with dimensionality reduction Figure 2: Inception module efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation. A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational com- plexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes. Furthermore, the design fol- lows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously. The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3 10⇥ faster than similarly perform- ing networks with non-Inception architecture, however this requires careful manual design at this point.

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text