情報処理応用B第10回資料 /advancedB10

情報処理応⽤B 第10回藤⽥⼀寿

深層学習深層ニューラルネットワーク

深層学習で出来ること • 画像識別 • 国際的な画像識別⼤会で他の⼿法を圧倒 • ⼈間よりも⾼性能と⾔われている • 画像⽣成 •
写真，イラスト，絵画などを⾃動⽣成 • 画像から⽂章の作成 • 画像の領域分割 • 画像から駐⾞場所を検出 • ロボット制御 • ゲームAI • 囲碁AI AlphaGoが柯潔⽒に勝利 • 機械翻訳 • Google翻訳，DeepL • ⽂書⽣成，コード⽣成 • GTP3，tabnine (Jin et al. 2017) 翻訳(DeepL)

深層学習とはなにか • ⼈⼯ニューラルネットワークの技術の⼀つ． • 深いニューラルネットワークを⽤いた⼿法． • 情報表現からタスクの学習まで⼀貫した学習が可能． • このように本によく書いてあるが現実はそう⽢くはない． •
既存⼿法との⽐較 • 深層学習が流⾏る以前には，⼈⼿もしくは専⽤のアルゴリズムで特徴（情報表現の獲得）を抽出しそれを機械学習などを⽤い学習することが多かった． • 深層学習では，⼊⼒を与えるだけでその情報表現の獲得とタスクの学習を同時に⾏える． A 尖った部分とT字の組み合わせはAである．昔の⼿法では，尖りやT字といった特徴を抽出する画像処理的⼿法を事前に使っていたが，深層ニューラルネットワークではその特徴抽出も学習により獲得する．

ニューラルネットワークとは

ゴルジ染⾊（⿊い反応） 1873年 • ゴルジが発明した神経細胞の染⾊⽅法 • この染⾊法により脳の内部構造の理解が深まった． • Waldeyere-Heartzが独⽴した神経単位をニューロン，ニューロン
(1891年) とニューロンのつなぎ⽬をシナプス(1897年)と命名した． • Cajal（カハール）は，脳は独⽴したニューロンからなると考えるニューロン説を提唱した (1980年代) ． • 脳がニューロンのネットワーク（ニューラルネットワーク）であることが分かり始める．海馬のスケッチ(Golgi, 1886)

All or none law (全か無かの法則) (Adrian E. D., 1914) •
神経細胞の応答の基本的なルール • 膜電位が閾値を越えると活動電位を発する（発⽕する）． • 簡単に⾔うと，神経細胞への⼊⼒が⼀定の値（閾値）より⼤きいと1を出⼒する． • ⼊⼒の⼤きさによらず，活動電位の⼤きさは⼀定である． • 簡単に⾔うと，⼊⼒の⼤きさによらず，神経細胞の出⼒は1で⼀定である． t0 time 出⼒⼊⼒閾値⼊⼒が閾値をこえると活動電位を発する．活動電位の⼤きさは⼀定である． t0 活動電位の⼤きさ実際のニューロンはそこまで単純ではない．

簡単なニューロンモデル • ⼊⼒はシナプスを介し神経細胞に⼊⼒される． • ⼊⼒𝑥! が神経細胞に伝わる強さ (接続の強度) 𝑤!
を結合荷重（シナプス荷重，重み）と⾔う． • ⼊⼒と重みをかけたものの総和が閾値ℎを超えたら1を出⼒する． • All or None law (全か無かの法則)をモデル化している． • シナプスの重みが変わることで，様々な計算が出来る． • 我々や⼈⼯知能が何か覚えるためには，シナプスが変わる必要がある． 𝑦 ⼊⼒結合荷重閾値を超えたら1を出⼒そうでなければ0を出⼒出⼒ " ! 𝑤! 𝑥! 𝑓(⋅) 𝑥" 𝑥# 𝑥$ 𝑤" 𝑤# 𝑤$ 閾値 ℎ

⼈⼯ニューラルネットワーク • ⼈⼯ニューラルネットワークは脳の構造・機能の模倣から発展した．ニューロンモデル(数式) 神経細胞脳⼈⼯ニューラルネットワーク神経科学⼈⼯ニューラル
ネットワークニューロン⼀つでは出来ることに限りがあるが，ニューロンのネットワークを構築しシナプスを学習させることで，様々な機能を実現できるようになった．ネットワーク化

深層学習，深層ニューラルネットワーク

深層ニューラルネットワークは深い shallow network deep network 今までディープニューラルネットワーク⼊⼒出⼒⼊⼒
出⼒ニューラルネットワークが深いとはニューラルネットワークを構成する層が多いことを⾔う．⼈⼯ニューラルネットワークをより深く（⼤きく）することで，精度の⾼い識別が可能になった．ネットワークを複雑化することで，様々な機能を実現できる．ニューラルネットワークを単に深くすれば性能が上がるわけではない！！

深層ニューラルネットワークの例 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv
3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure 3: GoogLeNet network with all the bells and whistles 7 INPUT 32x32 Convolutions Subsampling Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10 LeNet-5 (1998) GoogleNet (2014) Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256 kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. AlexNet (2012) 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image 34-layer residual for ImageNet. Left: the s) as a reference. Mid- yers (3.6 billion FLOPs). meter layers (3.6 billion mensions. Table 1 shows Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1⇥1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2. 3.4. Implementation Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224⇥224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 ⇥ 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16]. In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully- convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}). 4. Experiments 4.1. ImageNet Classification We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates. Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for de- tailed architectures. The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training pro- cedure. We have observed the degradation problem - the 4 ResNet34 (2015) 152層で画像認識コンテストで優勝昔からある⼈⼯ニューラルネットワーク

548 LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel 10
output units layer H3 30 hidden units layer H2 12 x 16=192 hidden units layer H1 12 x 64 = 768 hidden units H1 .1 256 input units fully connected - 300 links fully connected - 6000 links - 40,000 links from 12 kernels 5 x 5 x 8 -20,000 links from 12 kernels 5 x 5 畳み込みニューラルネットワーク • 福島のNeocognitronが源流 • 視覚モデル • HubelとWieselの視覚の研究 • 局所受容野 • LeCunが⼿書き認識への応⽤で成功 • 2012年画像識別⼤会で優勝したことによりブームへ (LeCun 1989) LeNet-5(LeCun 1998) (Fukushima 1980)

畳み込みニューラルネットワーク画像の場所ごとに，その場所の任意の特徴を捉えるニューロンがある．尖りを抽出するニューロンのうち，A の尖りの部分を担当するニューロンが応答する．

畳み込みニューラルネットワーク出⼒⼊⼒に近い層では画像を形作る線分の特徴が抽出され，上位層に⾏くに従い下位層で抽出された情報を組み合わせた抽象的な情報になっていく．⼊⼒ (Zeiler and Fergus 2013) Visualizing
and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. Visualizing and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. Visualizing and Understanding Convolutional Networks Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.

畳み込みニューラルネットワーク le Unsupervised Learning of Hierarchical Representations ven ment -up
own s to ers; the ayer ght- Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations are not conditionally independent of one another given the layers above and below. In contrast, our treatment using undirected edges enables combining bottom-up and top-down information more e ciently, as shown in Section 4.5. In our approach, probabilistic max-pooling helps to address scalability by shrinking the higher layers; weight-sharing (convolutions) further speeds up the algorithm. For example, inference in a three-layer network (with 200x200 input images) using weight- sharing but without max-pooling was about 10 times slower. Without weight-sharing, it was more than 100 times slower. In work that was contemporary to and done independently of ours, Desjardins and Bengio (2008) also applied convolutional weight-sharing to RBMs and ex- perimented on small image patches. Our work, however, develops more sophisticated elements such as probabilistic max-pooling to make the algorithm more scalable. 4. Experimental results 4.1. Learning hierarchical representations from natural images We first tested our model’s ability to learn hierarchical representations of natural images. Specifically, we Figure 2. The first layer bases (top) and the second layer bases (bottom) learned from natural images. Each second layer basis (filter) was visualized as a weighted linear combination of the first layer bases. unlabeled data do not share the same class labels, or the same generative distribution, as the labeled data. This framework, where generic unlabeled data improve performance on a supervised learning task, is known as self-taught learning. In their experiments, they used sparse coding to train a single-layer representation, and then used the learned representation to construct features for supervised learning tasks. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Table 2. Test error for MNIST dataset Labeled training samples 1,000 2,000 3,000 5,000 CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% Ranzato et al. (2007) 3.21% 2.53% - 1.52% Hinton and Salakhutdinov (2006) - - - - Weston et al. (2008) 2.73% - 1.83% - faces Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned f categories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from object categories (faces, cars, airplanes, motorbikes). Convolutional Deep Belief Networks for Scalable Unsupervised Le Table 2. Test error for MNIST data Labeled training samples 1,000 2,000 CDBN 2.62±0.12% 2.13±0.10% 1 Ranzato et al. (2007) 3.21% 2.53% Hinton and Salakhutdinov (2006) - - Weston et al. (2008) 2.73% - Figure 3. Columns 1-4: the second layer bases (top) and the third layer b categories. Column 5: the second layer bases (top) and the third layer bas object categories (faces, cars, airplanes, motorbikes). 出⼒⼊⼒に近い層では画像を形作る線分の特徴が抽出され，上位層に⾏くに従い下位層で抽出された情報を組み合わせた抽象的な情報になっていく．⼊⼒注：画像はconvolutional deep belief networkのもの (Lee et al. 2009)

⼈間の視覚処理 Figure 25-12 Possible functions mediated by the two pathways
connecting visual processing centers in the cerebral cortex. The icons represent salient physiological properties of cells in these areas. On the top is the pathway extending to the posterior parietal cortex, which is thought to be particularly involved in processing motion, depth, and spatial information. On the bottom is the pathway to the inferior temporal cortex, which is more concerned with form and color. Feeding into those two cortical pathways are the P and M pathways from the retina. (MT = middle temporal; LGN = lateral geniculate nucleus.) (Adapted from Van Essen and Gallant 1994.) Instead, as we have seen in this chapter, visual images typically are built up from the inputs of parallel pathways that process different features—movement, depth, form, and color. To express the specific combination of properties in the visual field at any given moment, independent groups of cells with different functions must temporarily be brought into association. As a result, there must be a mechanism by which the brain momentarily associates the information being processed independently by different cell populations in different cortical regions. This mechanism, as yet unspecified, is called the binding mechanism. Anne Treisman and her colleagues and Bela Julesz have independently shown in psychophysical studies that such associations require focused attention on elements in the visual field. They began by trying to understand one of the problems addressed by the early Gestalt (Kandel) 網膜センサ LGN V1 ⾓度 V2 組み合わせ V4 組み合わせ IT 基本画像腹側経路 V1の構造 (Carlson) Anterior-posterior Rostral-Caudal Dorsal-Ventral Superior-Inferior Medial-Lateral 現在の⼈⼯ニューラルネットワークは脳からだいぶ離れている？

⼈⼯ニューラルネットワークの歴史 1960 2010 1970 1990 1980 2000 パーセプトロン Rosenblatt, 1957
Deep belief network Hinton, 2006 Self Organizing Map Kohonen, 1980 Self Organizing Map von der Malsberg, 1973 Neocognitron Fukushima, 1980 第１次ニューラルネットワークブーム第２次ニューラルネットワークブーム第３次ニューラルネットワークブーム LeNet LeCun, 1989 McCulloch-Pittsモデル, 1943 バックプロパゲーション Rumelhart et al., 1986 AlexNet Krizhevsky, 2012 歴史は意外と古い Neocognitron，LeNetは深層ニューラルネットワーク

なぜ今更ニューラルネットワーク • 現在の深層ニューラルネットワークの原型である福島やLeCunのモデルは1980年代にはすでに発表されていた．なぜ2010年前後から急にディープラーニングが発展・普及したのか？ • データの充実 • 深層ニューラルネットワークの学習には⼤量のデータが必要 •
インターネット，ビッグデータの発展により学習データが供給された． • コンピュータの⾼速化 • 膨⼤なデータを巨⼤なニューラルネットワークで学習するために必要な計算が実⽤的な時間でできるようになった（GPGPUの発展により）． • 簡単に使える • 開発環境が充実し開発が⽤意になった． • ⼿法の改善 • 他の⼿法の⼿詰まり感

ディープラーニングで何ができるのか

⼀般物体認識 • Imagenet Large Scale Visual Recognition Challenge 2012で優勝
• 1000カテゴリx1000枚の画像を学習 • Convolution neural networkを使⽤ Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the ﬁgure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– (Krizhevsky 2012) (Krizhevsky 2012)

⼈間を超えた？画像識別能⼒ Classification Results (CLS) 0.28 0.26 0.16 0.12 0.07 0.036
0.03 0.023 0 0.05 0.1 0.15 0.2 0.25 0.3 2010 2011 2012 2013 2014 2015 2016 2017 Classification Error 16.7% ↓ 23.3% ↓ (http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf) 画像識別⼤会で優勝したモデルのスコア 0.051 Karpathy http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ AlexNet ResNet Deep learning GoogLeNet

まだまだ進化している https://paperswithcode.com/sota/image-classification-on-imagenet?p=deepvit-towards-deeper-vision-transformer ViT

医⽤画像への応⽤ • ⽪膚がんの判定 • ディープラーニングによる⽪膚がんの画像判定の結果が，⽪膚科医の判定とほぼ⼀致した． • 肺炎の判定 • ディープラーニングによるレントゲン画像からの肺炎判定の結果が，⼈レ
ベルまでになった． LETTER ARCH Acral-lentiginous melanoma Amelanotic melanoma Lentigo melanoma … Blue nevus Halo nevus Mongolian spot … Training classes (757) Deep convolutional neural network (Inception v3) Inference classes (varies by task) 92% malignant melanocytic lesion 8% benign melanocytic lesion Skin lesion image Convolution AvgPool MaxPool Concat Dropout Fully connected Softmax 1 | Deep CNN layout. Our classification technique is a NN. Data flow is from left to right: an image of a skin lesion ample, melanoma) is sequentially warped into a probability ution over clinical classes of skin disease using Google Inception N architecture pretrained on the ImageNet dataset (1.28 million over 1,000 generic object classes) and fine-tuned on our own of 129,450 skin lesions comprising 2,032 different diseases. 7 training classes are defined using a novel taxonomy of skin disease (for example, acrolentiginous melanoma, amelanotic melanoma, lentigo melanoma). Inference classes are more general and are composed of one or more training classes (for example, malignant melanocytic lesions—the class of melanomas). The probability of an inference class is calculated by summing the probabilities of the training classes according to taxonomy structure (see Methods). Inception v3 CNN architecture reprinted from https://research.googleblog.com/2016/03/train-your-own-image- classifier-with.html (Esteva et al. 2017) (Rajpurkar et al. 2017) CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning Pranav Rajpurkar * 1 Jeremy Irvin * 1 Kaylie Zhu 1 Brandon Yang 1 Hershel Mehta 1 Tony Duan 1 Daisy Ding 1 Aarti Bagul 1 Robyn L. Ball 2 Curtis Langlotz 3 Katie Shpanskaya 3 Matthew P. Lungren 3 Andrew Y. Ng 1 Abstract We develop an algorithm that can detect pneumonia from chest X-rays at a level ex- ceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X- ray dataset, containing over 100,000 frontal- view X-ray images with 14 diseases. Four practicing academic radiologists annotate a test set, on which we compare the performance of CheXNet to that of radiologists. We ﬁnd that CheXNet exceeds average radiologist performance on the F1 metric. We extend CheXNet to detect all 14 diseases in ChestX-ray14 and achieve state of the art results on all 14 diseases. 1. Introduction More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone (CDC, 2017). Chest X-rays are currently the best available method for diagnosing pneumonia (WHO, 2001), playing a crucial role in clinical care (Franquet, 2001) and epidemiological studies Output Pneumonia Positive (85%) Input Chest X-Ray Image CheXNet 121-layer CNN Figure 1. CheXNet is a 121-layer convolutional neural network that takes a chest X-ray image as input, and outputs the probability of a pathology. On this example, CheXnet arXiv:1711.05225v3 [cs.CV] 25 Dec 2017 CheXNet: Radiologist-Level Pneumonia D F1 Score (95% CI) Radiologist 1 0.383 (0.309, 0.453) Radiologist 2 0.356 (0.282, 0.428) Radiologist 3 0.365 (0.291, 0.435) Radiologist 4 0.442 (0.390, 0.492) Radiologist Avg. 0.387 (0.330, 0.442) CheXNet 0.435 (0.387, 0.481) Table 1. We compare radiologists and our model on the F1 metric, which is the harmonic average of the precision and recall of the models. CheXNet achieves an F1 score of 0.435 (95% CI 0.387, 0.481), higher than the radiologist average of 0.387 (95% CI 0.330, 0.442). We use the bootstrap to ﬁnd that the di↵erence in performance is statistically sig- Validation set Classifier Three-way accuracy Dermatologist 1 65.6% Dermatologist 2 66.0% CNN 69.5% CNN - PA 72.0% Classifier Nine-way accuracy Dermatologist 1 53.3% Dermatologist 2 55.0% CNN 48.9% CNN - PA 55.3% Disease classes: nine-way classification 0. Cutaneous lymphoma and lymphoid infiltrates 1. Benign dermal tumors, cysts, sinuses 2. Malignant dermal tumor 3. Benign epidermal tumors, hamartomas, milia, and growths 4. Malignant and premalignant epidermal tumors 5. Genodermatoses and supernumerary growths 6. Inflammatory conditions 7. Benign melanocytic lesions 8. Malignant Melanoma Disease classes: three-way classification 0. Benign single lesions 1. Malignant single lesions 2. Non-neoplastic lesions Skin Cancer Classification

画像から⽂章⽣成 Google Vinyals et al. 2015

GAN (Generative Adversarial Network) • データ⽣成に使われる⼿法本物の画像 Generator 画像⽣成 Discriminator
本物か偽物か本物に似せるように学習ニューラルネットワークニューラルネットワーク

GAN 4x4 G D 4x4 8x8 Reals 4x4 4x4 Reals
8x8 4x4 Latent Reals 4x4 … Training progresses Latent Latent 1024x1024 1024x1024 Figure 1: Our training starts with both the generator (G) and discriminator (D) having a low spatial resolution of 4⇥4 pixels. As the training advances, we incrementally add layers to G and D, NVIDIA Karras et al. 2018

GAN https://www.youtube.com/watch?v=XOxxPcy5Gr4

GAN (Generative Adversarial Network) https://www.youtube.com/watch?v=9reHvktowLY&feature= Zhu et al. 2017

Deepfake https://www.youtube.com/watch?v=x2g48Q2I2ZQ

PaintsChainer https://www.youtube.com/watch?v=lCoZR5S1btY https://www.youtube.com/watch?v=wud8p9DQwco 2018年PaintsChainerが第21回⽂化庁メディア芸術祭エンターテインメント部⾨で優秀賞を受賞 2019年ピクシブとPreferred Networksが提携

画像⽣成 https://www.youtube.com/watch?v=UDT_2lHv8o8 ⾦陽華さん２３歳が開発（２０１７年当時） Figure 7: Generated samples “glasses”), color attributes
are easier to learn. Notice that the boundary between similar colors like https://arxiv.org/abs/1708.05509

Stable diffusion • テキスト⼊⼒に基づき画層を⽣成する深層ニューラルネットワーク • 変分オートエンコーダ，U-Net，テキストエンコーダにより構成される．⼈⼯知能に⾃分の名前を与え画像を⽣成
⼈⼯知能に⼩松⼤学を与え画像を⽣成

Semantic Segmentation • ディープラーニングによるセグメンテーション（領域分割） • 各ピクセルが何の物体に属しているかを推定する 9 FCN-8s SDS [14]
Ground Truth Image Fig. 6. Fully convolutional networks improve performance on PASCAL. The left column shows the output of our most accurate net, FCN-8s. The TABLE 8 The role of foreground, background, and shape cues. All scores are the mean intersection over union metric excluding background. The architecture and optimization are ﬁxed to those of FCN-32s (Reference) and only input masking differs. train test FG BG FG BG mean IU Reference keep keep keep keep 84.8 Reference-FG keep keep keep mask 81.0 Reference-BG keep keep mask keep 19.8 FG-only keep mask keep mask 76.1 BG-only mask keep mask keep 37.8 Shape mask mask mask mask 29.1 Masking the foreground at inference time is catastrophic. However, masking the foreground during learning yields a network capable of recognizing object segments without observing a single pixel of the labeled class. Masking the background has little effect overall but does lead to class confusion in certain cases. When the background is masked during both learning and inference, the network unsurpris- ingly achieves nearly perfect background accuracy; however certain classes are more confused. All-in-all this suggests that FCNs do incorporate context even though decisions are driven by foreground pixels. To separate the contribution of shape, we learn a net restricted to the simple input of foreground/background masks. The accuracy in this shape-only condition is lower Shelhamer et al. 2016 7 a b c d Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the “PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth (yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result (random colored masks) with manual ground truth (yellow border). Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015. Name PhC-U373 DIC-HeLa IMCB-SG (2014) 0.2669 0.2935 KTH-SE (2014) 0.7953 0.4607 HOUS-US (2014) 0.5323 - Ronneberger et al. 2015 10 Body Part Lungs Clavicles Heart Evaluation Metric D J D J D J Human Observer [5] - 0.946 - 0.896 - 0.878 ASM Tuned [5] (*) - 0.927 - 0.734 - 0.814 Hybrid Voting [5] (*) - 0.949 - 0.736 - 0.860 Ibragimov et al. [9] - 0.953 - - - - Seghers et al. [11] - 0.951 - - - - InvertedNet with ELU 0.974 0.950 0.929 0.868 0.937 0.882 TABLE VI: Our best architecture compared with state-of-the-art methods; (*) single-class algorithms trained and evaluated for different organs separately; ”-” the score was not reported Fig. 7: Segmentation results and corresponding Jaccard scores on some images for U-Net (top row) and proposed InvertedNet with ELUs (bottom row). The contour of the ground-truth is shown in green, segmentation result of the algorithm in red and overlap of two contours in yellow. Novikov et al. 2018

領域分割による空間認識 https://www.youtube.com/watch?v=1HJSMR6LW2g

Deep Q-Network (DQN) • Deep learningと強化学習(Q学習)を組み合わせたアルゴリズム • 強化学習をベースとしているため，試⾏錯誤により問題を解く．
• ゲームやロボットの制御に利⽤ • ゲームでは⼈間より良いスコアを出す • Atariの49のゲームのうち29タイトルで⼈間より良い成績を収める (Mnih et al. 2015)． (Mnih et al. 2015)．

https://www.youtube.com/watch?v=TmPfTpjtdgg

https://www.youtube.com/watch?v=THhUXIhjkCM

2016 https://www.youtube.com/watch?v=L4KBBAwF_bE

2016 https://www.youtube.com/watch?v=iaF43Ze1oeI

https://www.youtube.com/watch?v=l8zKZLqkfII

2016 https://www.youtube.com/watch?v=Q9tDHuidzak

2016 https://www.youtube.com/watch?v=H4V6NZLNu-c

(Rahmatizadeh et al. 2016) https://www.youtube.com/watch?v=9vYlIG2ozaM

(Tai et al. 2017) https://www.youtube.com/watch?v=9AOIwBYIBbs

AlphaGo, AlphaGo Zero, AlphaZero • AlphaGo • 2016年イ・セドル⽒に勝利，2017年柯潔（コ・ジェ）⽒に勝利 • 2015年まではアマ有段者レベルだった囲碁AIが，AlphaGoの登場によりに
より囲碁AIが世界トップ棋⼠より強くなった． • AlphaGo Zero (Silver et al. 2017) • ⾃⼰学習のみでAlphaGoに勝つ． • AlphaZero (Silver et al. 2018) • 様々なボードゲームにも対応できる． • ⾃⼰学習のみで強くなる．⼈間はデータを⽤意する必要がない． • 碁ではAlpha Goにも勝てる． • チェス (Stockfish)，将棋（elmo）にも勝てる．

最強のゲームAIを⼿に⼊れるのは難しい？ • 原理は簡単で，誰でもソースコードを⼿に⼊れることができる． • しかし，囲碁で最強のAIを⼿に⼊れるのは難しい． • Googleは機械学習・深層学習
のためのハードを開発している． • それで動かして初めて実⽤的な時間で学習が可能となる．

Deep Learningの何が良い • ⾼性能 • 導⼊が簡単 • 有名なアルゴリズムはDeep learning⽤の開発環境に元々⼊っている． •
開発環境に⼊っていないものもオープンソースで公開されていることが多い． • 専⾨家でなくても使える． • 学習済みモデルが公開されており，すぐに最新の⾼性能なモデルを使うことができる．

ディープラーニングは使えるのか • 使うためには • プログラミング⾔語であるPythonが使える． • 多少のディープラーニング⽤語を知っている． • 技術内容の雰囲気を掴んでいる． •
Google Colaboratoryが使える． • インターネットでGoogle Colaboratoryにつなぐだけで⼈⼯知能は作れる． • つまり，やる気さえあれば誰でも使える． • 浅い理解をするためには • 線形代数，微分積分，確率，統計が必要． • いずれも⾼校レベルの数学を理解していれば何とかなる（と思う）．

猫のフン防⽌組み込み用ボード猫が来たらスプリンクラーが反応 2016 NVIDIAのエンジニアのボンドさん(65歳)

きゅうり選別機 • きゅうり農家の⼩池さんがDeep learningのソフトウェアを使いきゅうりの選別を⾏う装置を開発機械学習のプロではなくてもDeep learning使える！！ 2016

ディープラーニングの技術的問題 • 軽量，⾼速，⾼性能なディープラーニングの技術の要望 • エッジで処理したい． • ラベル付きデータ不⾜した場合の対応 • 動物はデータが少なくても学習できる？ •
モデルが⼤きすぎて問題が起きた時に原因を特定しにくい． • ⼈をゴリラと何故間違えたのか？(2015年に発覚したが現在も解決されず) • ⼊⼒を少し変えるだけで，別物と識別される． • 転移学習 • ある問題を解決する機械を他の問題に転⽤できるか？ • 抽象的な命令を理解できるか？ Figure 1: The left image shows real graffiti on a Stop sign, something that most humans would not think is suspicious. The right image shows our a physical perturbation applied to a Stop sign. We design our perturbations to mimic graffiti, and thus “hide in the human psyche.” the viewing camera. Additionally, other practicality challenges exist: (1) Perturbations in the digital world can be so small in magnitude that it is likely that a camera will not be able to perceive them due to sensor imperfections. (2) Figure 2: RP2 pipeline ov sign. RP2 samples from dynamics (in this case, v uses a mask to project c that resembles graffiti. T perturbations and sticks physical attacks, we dra 制限速度の標識と識別する

ディープラーニングの注意 • これなら間違いないという技術は無いと思おう • まだ発展途上の技術． • 今⽇習ったことが明⽇古いということがある． • ディープラーニングに詳しくあろうとするためには⽇々勉強が必要． •
つぎのものがあれば誰にでもチャンスは有る • やる気，思いつき，スピード，暇

さいごにひとこと • ⼈⼯知能を必要以上に恐れる必要はない． • ⼈⼯知能を⼤したことがないと⾺⿅にする必要もない． • いかに活⽤するかが重要である． • 問題を具体化，事象を数値化し，コンピュータで解けるようにする． •
⾃由な発想で技術を発展させていこう．

演習

演習 • ディープラーニングの特徴として間違っているものはどれか． 1. 少量のデータを学習するだけで⾼精度な認識が可能である． 2. ディープラーニングはCPUではなくゲームなどで⽤いられる GPUで動かすことが推奨される． 3. 画像認識において⼈間と同等以上の能⼒を有する．
4. 構造が巨⼤で複雑なため問題が起こったとき，その原因特定が難しい場合がある．

演習 • ディープラーニングの特徴として間違っているものはどれか． 1. 少量のデータを学習するだけで⾼精度な認識が可能である．学習には⼤量のデータが必要です． 2. ディープラーニングはCPUではなくゲームなどで⽤いられる GPUで動かすことが推奨される． 3.
画像認識において⼈間と同等以上の能⼒を有する． 4. 構造が巨⼤で複雑なため問題が起こったとき，その原因特定が難しい場合がある．

演習 • AIにおけるディープラーニングの特徴はどれか。(平成30年春期基本情報改) 1. "AならばBである"というルールを⼈間があらかじめ設定して，新しい知識を論理式で表現したルールに基づく推論の結果として，解を求めるものである。 2. 厳密な解でなくてもなるべく正解に近い解を得るようにする⽅法で
あり，特定分野に特化せずに，広範囲で汎⽤的な問題解決ができるようにするものである。 3. ニューラルネットワークを⽤いて，⾼性能な画像認識，画像⽣成などをできるようにするものである。 4. 判断ルールを作成できる医療診断などの分野に限定されるが，症状から特定の病気に絞り込むといった，確率的に⾼い判断ができる。

演習 • ⾞載機器の性能の向上に関する記述のうち，ディープラーニングを⽤いているものはどれか．(平成29年秋期基本情報) 1. ⾞の壁への衝突を加速度センサが検知し，エアバッグを膨らませて搭乗者をけがから守った． 2. システムが⼤量の画像を取得し処理することによって，歩⾏者と⾞をより確実に⾒分けることができるようになった．
3. ⾃動でアイドリングストップする装置を搭載することによって，運転経験が豊富な運転者が運転する場合よりも燃費を向上させた． 4. ナビゲーションシステムが，携帯電話回線を通してソフトウェアのアップデートを⾏い，地図を更新した．

情報処理応用B第10回資料 /advancedB10

情報処理応用B第10回資料 /advancedB10

More Decks by Kazuhisa Fujita

Other Decks in Education

Featured

Transcript