Pythonで動かして学ぶ機械学習入門第三回　ディープラーニング理論

Slide 1

Slide 1 text

σΟʔϓϥʔχϯάཧ࿦   PythonͰಈֶ͔ͯ͠Ϳػցֶशೖ໳ ୈࡾճʢશ࢛ճγϦʔζʣ URL : http://shiroyagi.connpass.com/event/41884/ ٠ా ངฏ  2016/10/27

Slide 23

Slide 23 text

σΟʔϓϥʔχϯάͷੑ࣭ • େྔͷύϥϝλΛେྔͷσʔλ͔Βֶश  ࠷దԽର৅ͱͳΔύϥϝλ͸ʢଞͷػցֶशख๏ͱൺ΂Δͱʣଟ͍  ɹ→ 100ສݸҎ্ͷύϥϝλͱͳΔ͜ͱ΋…  ɹɹ cf.) ਓؒͷେ೴ൽ࣭ͷਆܦࡉ๔͸100ԯݸҎ্ r net with this 3-layer bottleneck block, resulting in yer ResNet (Table 1). We use option B for increasing ions. This model has 3.8 billion FLOPs. -layer and 152-layer ResNets: We construct 101- nd 152-layer ResNets by using more 3-layer blocks 1). Remarkably, although the depth is significantly ed, the 152-layer ResNet (11.3 billion FLOPs) still wer complexity than VGG-16/19 nets (15.3/19.6 bil- OPs). 50/101/152-layer ResNets are more accurate than layer ones by considerable margins (Table 3 and 4). not observe the degradation problem and thus en- nificant accuracy gains from considerably increased The benefits of depth are witnessed for all evaluation (Table 3 and 4). arisons with State-of-the-art Methods. In Table 4 mpare with the previous best single-model results. seline 34-layer ResNets have achieved very compet- curacy. Our 152-layer ResNet has a single-model alidation error of 4.49%. This single-model result orms all previous ensemble results (Table 5). We e six models of different depth to form an ensemble method error (%) Maxout [10] 9.38 NIN [25] 8.81 DSN [24] 8.22 # layers # params FitNet [35] 19 2.5M 8.39 Highway [42, 43] 19 2.3M 7.54 (7.72±0.16) Highway [42, 43] 32 1.25M 8.80 ResNet 20 0.27M 8.75 ResNet 32 0.46M 7.51 ResNet 44 0.66M 7.17 ResNet 56 0.85M 6.97 ResNet 110 1.7M 6.43 (6.61±0.16) ResNet 1202 19.4M 7.93 Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [43]. so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts. We use a weight decay of 0.0001 and momentum of 0.9, ग़ॴɿhttps://www.robots.ox.ac.uk/~vgg/rg/papers/deepres.pdf 110૚Ͱ170ສύϥϝλʂ

Slide 24

Slide 24 text

σΟʔϓϥʔχϯάͷੑ࣭ • େྔͷύϥϝλΛେྔͷσʔλ͔Βֶश  ࠷దԽର৅ͱͳΔύϥϝλ͸ʢଞͷػցֶशख๏ͱൺ΂Δͱʣଟ͍  ɹ→ 100ສݸҎ্ͷύϥϝλͱͳΔ͜ͱ΋…  ͜ΕΛֶश͢ΔͨΊʹେྔͷը૾σʔλΛ࢖༻͢Δ  ɹ→ ImageNet (http://www.image-net.org/) Ͱެ։͍ͯ͠Δը૾͸ 1400ສຕ  ɹ→ ͜ͷେྔσʔλͷֶशͷͨΊʹ GPU ͕ඞਢ ग़ॴɿhttp://static.googleusercontent.com/media/research.google.com/ja//archive/unsupervised_icml2012.pdf Building high-level features using large-scale unsupervised the cortex. They also demonstrate that convolutional DBNs (Lee et al., 2009), trained on aligned images of faces, can learn a face detector. This result is inter- esting, but unfortunately requires a certain degree of supervision during dataset construction: their training images (i.e., Caltech 101 images) are aligned, homoge- neous and belong to one selected category. Figure 1. The architecture and parameters in one layer of our network. The overall network replicates this structure three times. For simplicity, the images are in 1D. logical and computation Lyu & Simoncelli, 2008; As mentioned above, cen of local connectivity bet ments, the first sublayer pixels and the second su lapping neighborhoods o The neurons in the first su input channels (or maps second sublayer connect (or map).3 While the firs responses, the pooling lay the sum of the squares o is known as L2 pooling. Our style of stacking ules, switching betwe ance layers, is remini HMAX (Fukushima & M 1998; Riesenhuber & Po been argued to be an a brain (DiCarlo et al., 201 Although we use local not convolutional: the across different location a stark difference betw vious work (LeCun et a Building high-level features using large-scale unsupervised learning and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% accuracy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the local contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with previous study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. tested neuron, by solving: x∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to numerical constraint optimization. 4.5. Invariance properties Googleͷೣͷ࿦จͰ͸  1000ສຕͷ Youtube ը૾Λ࢖༻

Slide 64

Slide 64 text

RNN ͜͜Ͱ঺հͨ͠΋ͷ͸࠷΋جຊతͳ΋ͷͰ਺ଟ͘ͷϞσϧ͕ఏҊ  ʢ https://github.com/kjw0612/awesome-rnn ͳͲʣ ࣗવݴޠͷྖҬͰಛʹ੒ޭΛऩΊ͍ͯΔ Իָͷੜ੒ͳͲ΋࣮ݱʢ https://maraoz.com/2016/02/02/abc-rnn/ ͳͲʣ CNN ͱͷ૊߹ͤͳͲ΋੝Μʹݚڀ͞Ε͍ͯΔ  ग़ॴɿhttps://arxiv.org/pdf/1411.4555.pdf ɹɹ https://arxiv.org/pdf/1604.04573.pdf al Image Caption Generator Samy Bengio Google [email protected] Dumitru Erhan Google [email protected] A group of people shopping at an outdoor market. ! There are many vegetables at the fruit stand. Vision! Deep CNN Language ! Generating! RNN Figure 1. NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates complete sentences in natural language from an input image, as shown on the example above. CNN-RNN: A Unified Framework for Multi-label Image Classification Jiang Wang1 Yi Yang1 Junhua Mao2 Zhiheng Huang3∗ Chang Huang4∗ Wei Xu1 Baidu Research 2University of California at Los Angles 3Facebook Speech 4 Horizon Robotics Abstract While deep convolutional neural networks (CNNs) have own a great success in single-label image classification, s important to note that real world images generally con- n multiple labels, which could correspond to different jects, scenes, actions and attributes in an image. Tradi- nal approaches to multi-label image classification learn dependent classifiers for each category and employ rank- g or thresholding on the classification results. These tech- ques, although working well, fail to explicitly exploit the Airplane Great Pyrenees Archery Sky, Grass, Runway Dog, Person, Room Person, Hat, Nike Figure 1. We show three images randomly selected from ImageNet 2012 classification dataset. The second row shows their corre- sponding label annotations. For each image, there is only one label (i.e. Airplane, Great Pyrenees, Archery) annotated in the Im-

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text