情報処理応用B第10回資料 /advancedB10

Slide 1

Slide 1 text

情報処理応⽤B 第10回藤⽥⼀寿

Slide 12

Slide 12 text

深層ニューラルネットワークの例 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure 3: GoogLeNet network with all the bells and whistles 7 INPUT 32x32 Convolutions Subsampling Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10 LeNet-5 (1998) GoogleNet (2014) Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256 kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. AlexNet (2012) 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 34-layer residual for ImageNet. Left: the s) as a reference. Mid- yers (3.6 billion FLOPs). meter layers (3.6 billion mensions. Table 1 shows Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1⇥1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2. 3.4. Implementation Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224⇥224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 ⇥ 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16]. In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully- convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}). 4. Experiments 4.1. ImageNet Classification We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates. Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for de- tailed architectures. The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training pro- cedure. We have observed the degradation problem - the 4 ResNet34 (2015) 152層で画像認識コンテストで優勝昔からある⼈⼯ニューラルネットワーク

Slide 15

Slide 15 text

畳み込みニューラルネットワーク出⼒⼊⼒に近い層では画像を形作る線分の特徴が抽出され，上位層に⾏くに従い下位層で抽出された情報を組み合わせた抽象的な情報になっていく．⼊⼒ (Zeiler and Fergus 2013) Visualizing and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. Visualizing and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. Visualizing and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.

Slide 16

Slide 16 text

畳み込みニューラルネットワーク le Unsupervised Learning of Hierarchical Representations ven ment -up own s to ers; the ayer ght- Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations are not conditionally independent of one another given the layers above and below. In contrast, our treatment using undirected edges enables combining bottom-up and top-down information more e ciently, as shown in Section 4.5. In our approach, probabilistic max-pooling helps to address scalability by shrinking the higher layers; weight-sharing (convolutions) further speeds up the algorithm. For example, inference in a three-layer network (with 200x200 input images) using weight- sharing but without max-pooling was about 10 times slower. Without weight-sharing, it was more than 100 times slower. In work that was contemporary to and done indepen- dently of ours, Desjardins and Bengio (2008) also ap- plied convolutional weight-sharing to RBMs and ex- perimented on small image patches. Our work, however, develops more sophisticated elements such as probabilistic max-pooling to make the algorithm more scalable. 4. Experimental results 4.1. Learning hierarchical representations from natural images We first tested our model’s ability to learn hierarchical representations of natural images. Specifically, we Figure 2. The first layer bases (top) and the second layer bases (bottom) learned from natural images. Each second layer basis (filter) was visualized as a weighted linear com- bination of the first layer bases. unlabeled data do not share the same class labels, or the same generative distribution, as the labeled data. This framework, where generic unlabeled data improve performance on a supervised learning task, is known as self-taught learning. In their experiments, they used sparse coding to train a single-layer representation, and then used the learned representation to construct features for supervised learning tasks. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Table 2. Test error for MNIST dataset Labeled training samples 1,000 2,000 3,000 5,000 CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% Ranzato et al. (2007) 3.21% 2.53% - 1.52% Hinton and Salakhutdinov (2006) - - - - Weston et al. (2008) 2.73% - 1.83% - faces Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned f categories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from object categories (faces, cars, airplanes, motorbikes). Convolutional Deep Belief Networks for Scalable Unsupervised Le Table 2. Test error for MNIST data Labeled training samples 1,000 2,000 CDBN 2.62±0.12% 2.13±0.10% 1 Ranzato et al. (2007) 3.21% 2.53% Hinton and Salakhutdinov (2006) - - Weston et al. (2008) 2.73% - Figure 3. Columns 1-4: the second layer bases (top) and the third layer b categories. Column 5: the second layer bases (top) and the third layer bas object categories (faces, cars, airplanes, motorbikes). 出⼒⼊⼒に近い層では画像を形作る線分の特徴が抽出され，上位層に⾏くに従い下位層で抽出された情報を組み合わせた抽象的な情報になっていく．⼊⼒注：画像はconvolutional deep belief networkのもの (Lee et al. 2009)

Slide 24

Slide 24 text

医⽤画像への応⽤ • ⽪膚がんの判定 • ディープラーニングによる⽪膚がんの画像判定の結果が，⽪膚科医の判定とほぼ⼀致した． • 肺炎の判定 • ディープラーニングによるレントゲン画像からの肺炎判定の結果が，⼈レベルまでになった． LETTER ARCH Acral-lentiginous melanoma Amelanotic melanoma Lentigo melanoma … Blue nevus Halo nevus Mongolian spot … Training classes (757) Deep convolutional neural network (Inception v3) Inference classes (varies by task) 92% malignant melanocytic lesion 8% benign melanocytic lesion Skin lesion image Convolution AvgPool MaxPool Concat Dropout Fully connected Softmax 1 | Deep CNN layout. Our classification technique is a NN. Data flow is from left to right: an image of a skin lesion ample, melanoma) is sequentially warped into a probability ution over clinical classes of skin disease using Google Inception N architecture pretrained on the ImageNet dataset (1.28 million over 1,000 generic object classes) and fine-tuned on our own of 129,450 skin lesions comprising 2,032 different diseases. 7 training classes are defined using a novel taxonomy of skin disease (for example, acrolentiginous melanoma, amelanotic melanoma, lentigo melanoma). Inference classes are more general and are composed of one or more training classes (for example, malignant melanocytic lesions—the class of melanomas). The probability of an inference class is calculated by summing the probabilities of the training classes according to taxonomy structure (see Methods). Inception v3 CNN architecture reprinted from https://research.googleblog.com/2016/03/train-your-own-image- classifier-with.html (Esteva et al. 2017) (Rajpurkar et al. 2017) CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning Pranav Rajpurkar * 1 Jeremy Irvin * 1 Kaylie Zhu 1 Brandon Yang 1 Hershel Mehta 1 Tony Duan 1 Daisy Ding 1 Aarti Bagul 1 Robyn L. Ball 2 Curtis Langlotz 3 Katie Shpanskaya 3 Matthew P. Lungren 3 Andrew Y. Ng 1 Abstract We develop an algorithm that can detect pneumonia from chest X-rays at a level ex- ceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X- ray dataset, containing over 100,000 frontal- view X-ray images with 14 diseases. Four practicing academic radiologists annotate a test set, on which we compare the performance of CheXNet to that of radiologists. We ﬁnd that CheXNet exceeds average radiologist performance on the F1 metric. We extend CheXNet to detect all 14 diseases in ChestX-ray14 and achieve state of the art results on all 14 diseases. 1. Introduction More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone (CDC, 2017). Chest X-rays are currently the best available method for diagnosing pneumonia (WHO, 2001), playing a crucial role in clinical care (Franquet, 2001) and epidemiological studies Output Pneumonia Positive (85%) Input Chest X-Ray Image CheXNet 121-layer CNN Figure 1. CheXNet is a 121-layer convolutional neural network that takes a chest X-ray image as input, and outputs the probability of a pathology. On this example, CheXnet arXiv:1711.05225v3 [cs.CV] 25 Dec 2017 CheXNet: Radiologist-Level Pneumonia D F1 Score (95% CI) Radiologist 1 0.383 (0.309, 0.453) Radiologist 2 0.356 (0.282, 0.428) Radiologist 3 0.365 (0.291, 0.435) Radiologist 4 0.442 (0.390, 0.492) Radiologist Avg. 0.387 (0.330, 0.442) CheXNet 0.435 (0.387, 0.481) Table 1. We compare radiologists and our model on the F1 metric, which is the harmonic average of the precision and recall of the models. CheXNet achieves an F1 score of 0.435 (95% CI 0.387, 0.481), higher than the radiologist average of 0.387 (95% CI 0.330, 0.442). We use the bootstrap to ﬁnd that the di↵erence in performance is statistically sig- Validation set Classifier Three-way accuracy Dermatologist 1 65.6% Dermatologist 2 66.0% CNN 69.5% CNN - PA 72.0% Classifier Nine-way accuracy Dermatologist 1 53.3% Dermatologist 2 55.0% CNN 48.9% CNN - PA 55.3% Disease classes: nine-way classification 0. Cutaneous lymphoma and lymphoid infiltrates 1. Benign dermal tumors, cysts, sinuses 2. Malignant dermal tumor 3. Benign epidermal tumors, hamartomas, milia, and growths 4. Malignant and premalignant epidermal tumors 5. Genodermatoses and supernumerary growths 6. Inflammatory conditions 7. Benign melanocytic lesions 8. Malignant Melanoma Disease classes: three-way classification 0. Benign single lesions 1. Malignant single lesions 2. Non-neoplastic lesions Skin Cancer Classification

Slide 34

Slide 34 text

Semantic Segmentation • ディープラーニングによるセグメンテーション（領域分割） • 各ピクセルが何の物体に属しているかを推定する 9 FCN-8s SDS [14] Ground Truth Image Fig. 6. Fully convolutional networks improve performance on PASCAL. The left column shows the output of our most accurate net, FCN-8s. The TABLE 8 The role of foreground, background, and shape cues. All scores are the mean intersection over union metric excluding background. The architecture and optimization are ﬁxed to those of FCN-32s (Reference) and only input masking differs. train test FG BG FG BG mean IU Reference keep keep keep keep 84.8 Reference-FG keep keep keep mask 81.0 Reference-BG keep keep mask keep 19.8 FG-only keep mask keep mask 76.1 BG-only mask keep mask keep 37.8 Shape mask mask mask mask 29.1 Masking the foreground at inference time is catastrophic. However, masking the foreground during learning yields a network capable of recognizing object segments without observing a single pixel of the labeled class. Masking the background has little effect overall but does lead to class confusion in certain cases. When the background is masked during both learning and inference, the network unsurpris- ingly achieves nearly perfect background accuracy; however certain classes are more confused. All-in-all this suggests that FCNs do incorporate context even though decisions are driven by foreground pixels. To separate the contribution of shape, we learn a net restricted to the simple input of foreground/background masks. The accuracy in this shape-only condition is lower Shelhamer et al. 2016 7 a b c d Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the “PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth (yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result (random colored masks) with manual ground truth (yellow border). Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015. Name PhC-U373 DIC-HeLa IMCB-SG (2014) 0.2669 0.2935 KTH-SE (2014) 0.7953 0.4607 HOUS-US (2014) 0.5323 - Ronneberger et al. 2015 10 Body Part Lungs Clavicles Heart Evaluation Metric D J D J D J Human Observer [5] - 0.946 - 0.896 - 0.878 ASM Tuned [5] (*) - 0.927 - 0.734 - 0.814 Hybrid Voting [5] (*) - 0.949 - 0.736 - 0.860 Ibragimov et al. [9] - 0.953 - - - - Seghers et al. [11] - 0.951 - - - - InvertedNet with ELU 0.974 0.950 0.929 0.868 0.937 0.882 TABLE VI: Our best architecture compared with state-of-the-art methods; (*) single-class algorithms trained and evaluated for different organs separately; ”-” the score was not reported Fig. 7: Segmentation results and corresponding Jaccard scores on some images for U-Net (top row) and proposed InvertedNet with ELUs (bottom row). The contour of the ground-truth is shown in green, segmentation result of the algorithm in red and overlap of two contours in yellow. Novikov et al. 2018

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text