Pythonで動かして学ぶ機械学習入門第三回　ディープラーニング理論

σΟʔϓϥʔχϯάཧ࿦   PythonͰಈֶ͔ͯ͠Ϳػցֶशೖ໳ ୈࡾճʢશ࢛ճγϦʔζʣ URL : http://shiroyagi.connpass.com/event/41884/ ٠ా ངฏ  2016/10/27

ߨࢣ঺հ • ٠ా ངฏʢ͖ͨ͘ Α͏΁͍ʣ • ത࢜ʢཧֶʣ • ݱࡏ͸๭ίϯαϧςΟϯάϑΝʔϜʹͯσʔλ෼ੳۀ຿ʹैࣄ •
ಘҙ෼໺  ɾػցֶशͷཧ࿦తଆ໘  ɾਪનΞϧΰϦζϜ  ɾը૾෼ੳʢDeep Learningʣ • ࿈བྷઌ  Կ͔͋Γ·ͨ͠Β͓ؾܰʹ͝࿈བྷ͍ͩ͘͞  Email : [email protected]  Facebook : https://www.facebook.com/yohei.kikuta.3  Linkedin : https://jp.linkedin.com/in/yohei-kikuta-983b29117

໨࣍ • σΟʔϓϥʔχϯάͱ͸Կ͔  σΟʔϓϥʔχϯάͷఆٛͱԡ͓͑ͯ͘͞΂͖جຊతͳੑ࣭  • σΟʔϓϥʔχϯάͷཧ࿦తجૅ  σΟʔϓϥʔχϯάͷίΞͱͳΔཁૉ  • σΟʔϓϥʔχϯάͷ۩ମతϞσϧ  ໨తʹԠͨ͡۩ମతͳϞσϧͷߏ੒
※਺ࣜ͸΄ͱΜͲ࢖Θͳ͍ߏ੒ͱͳ͍ͬͯ·͢

σΟʔϓϥʔχϯάͱ͸Կ͔

σΟʔϓϥʔχϯάͷ࿩୊ ImageNet Large Scale Visual Recognition Challenges (ILSVRC) ग़ॴɿhttp://danielnouri.org/notes/2014/01/10/using-deep-learning-to-listen-for-whales/

σΟʔϓϥʔχϯάͷ࿩୊ ImageNet Large Scale Visual Recognition Challenges (ILSVRC) ग़ॴɿhttp://image-net.org/
Classification error [%] σΟʔϓϥʔχϯάొ৔ !! ਓؒͷࣝผੑೳ

σΟʔϓϥʔχϯάͷ࿩୊ Facebook mobile Question Answering app. ग़ॴɿhttp://venturebeat.com/2015/11/03/facebook-builds-a-mobile-app-that-answers-spoken-questions-about-photos-with-ai/

σΟʔϓϥʔχϯάͷ࿩୊ Google AlphaGo ग़ॴɿhttp://9801.me/?p=1738  ɹɹɹwww.nature.com/nature/journal/v529/n7587/abs/nature16961.html?lang=en ARTICLE RESEARCH learning of
convolutional networks, won 11% of games against Pachi23 and 12% against a slightly weaker program, Fuego24. Reinforcement learning of value networks The final stage of the training pipeline focuses on position evaluation, (s, a) of the search tree stores an action value Q(s, a), visit count N(s, a), and prior probability P(s, a). The tree is traversed by simulation (that is, descending the tree in complete games without backup), starting from the root state. At each time step t of each simulation, an action at is selected from state st Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation traverses the tree by selecting the edge with maximum action value Q, plus a bonus u(P) that depends on a stored prior probability P for that edge. b, The leaf node may be expanded; the new node is processed once by the policy network pσ and the output probabilities are stored as prior probabilities P for each action. c, At the end of a simulation, the leaf node is evaluated in two ways: using the value network vθ ; and by running a rollout to the end of the game with the fast rollout policy pπ , then computing the winner with function r. d, Action values Q are updated to track the mean value of all evaluations r(·) and vθ (·) in the subtree below that action. Selection a b c d Expansion Evaluation Backup p S p V Q + u(P) Q + u(P) Q + u(P) Q + u(P) P P P P Q Q Q Q Q r r r r P max max P Q T Q T Q T Q T Q T Q T

σΟʔϓϥʔχϯάͷ࿩୊ Image creation ग़ॴɿhttp://gigazine.net/news/20150707-deep-dreaming-fear/  ɹɹɹhttps://arxiv.org/pdf/1508.06576v2.pdf Figure 2: Images that
combine the content of a photograph with the style of several well-known artworks. The images were created by ﬁnding an image that simultaneously matches the content representation of the photograph and the style representation of the artwork (see Methods). The original photograph depicting the Neckarfront in T¨ ubingen, Germany, is shown in A (Photo:

σΟʔϓϥʔχϯάͷ׆༂ σΟʔϓϥʔχϯά͸͋ΒΏΔ෼໺ʹਐग़ͯͦ͠ͷҖྗΛൃش͍ͯ͠Δ • ը૾ೝࣝɺಈը෼ੳ • ࣗಈӡస • Ի੠ೝࣝٴͼ຋༁ •
ࣗવݴޠॲཧ • ਪનγεςϜ • ը૾ੜ੒ɺԻָੜ੒ • ήʔϜૢ࡞ɺϩϘοτ੍ޚ • … ͦ΋ͦ΋σΟʔϓϥʔχϯάͷ࣮ମ͸Կͳͷ͔ʁ

σΟʔϓϥʔχϯάͷఆٛ σΟʔϓϥʔχϯά ਂ͍ (Neural Network) (ػց) ֶश ԿΛͲͷΑ͏ʹֶश͢Δͷ͔ʁ   
  ※ͦ΋ͦ΋ػցֶशͱ͸Կ͔ʹؔͯ͠͸  ୈҰճษڧձͷࢿྉΛࢀর  https://speakerdeck.com/diracdiego/pythondedong-kasitexue- buji-jie-xue-xi-ru-men-di-hui-ji-jie-xue-xi-falseli-jie Α͘ԼਤͷΑ͏ͳֆ͕ඳ͔ΕΔ͕ ͜Ε͸ԿΛද͍ͯ͠Δͷ͔ʁ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ

Neural Network (NN) ͱ͸Կ͔ • ೴ͷਆܦܥΛ໛ͨ͠Ϟσϧ  ೴ͷਆܦܥ͸ೖྗ৴߸ʹΑΓγφϓεͷ݁߹ڧ౓͕มԽ͠ɺ  ൃୡͨ͠ωοτϫʔΫߏ଄ʹΑΓ໰୊Λղܾ͢ΔೳྗΛ֫ಘ ग़ॴɿhttp://www.bordalierinstitute.com/target1.html

Neural Network (NN) ͱ͸Կ͔ • ೴ͷਆܦܥΛ໛ͨ͠Ϟσϧ  ݸʑͷχϡʔϩϯ͸୯७ͳ਺ֶతϞσϧͰදݱՄೳ  ͜ͷχϡʔϩϯΛେྔʹͭͳ͗߹Θͤͨ΋ͷ͕ NN
ͱͳΔ ग़ॴɿhttps://openclipart.org/detail/245373/neuron ೖྗ1 ೖྗ2 ग़ྗ ᮢ஋

Neural Network (NN) ͱ͸Կ͔ • େྔͷϞσϧ͕ଘࡏ  ༷ʑͳಈػͰൃలܥ͕ߟҊ  ೴ਆܦܥͱ͸͔͚཭Εͨ΋ͷ΋ଟ͍  ʢartificial
NN ͱݺΜͩΓ΋͢Δʣ • ຊษڧձͰ͸ͦͷҰ෦Λ঺հ  ଟ͘ͷ NN ͷجૅͱͳΔ΋ͷΛ঺հ ग़ॴɿhttp://www.asimovinstitute.org/neural-network-zoo/

୯७ύʔηϓτϩϯ • ߏ଄  ೖྗ஋ͷઢܗ݁߹ͱ׆ੑԽؔ਺ͷม׵ͷΈͷγϯϓϧͳ΋ͷ • ௕ॴ  γϯϓϧ͕ͩ ”ֶश” ͕Մೳ 
ʢॏΈΛߋ৽ͯ͠σʔλΛ෼཭͢Δઢܗฏ໘Λߏஙʣ • ୹ॴ  ઢܗ෼཭ෆՄೳͳ໰୊ʢXOR໰୊ʣ͸ղ͚ͳ͍ ೖྗ1 ೖྗn ग़ྗ ᮢ஋ ೖྗi ɾ  ɾ ɾ ɾ ɾ ɾ ɾ ɾ ॏΈ1 ॏΈi ॏΈn ೖྗ ᮢ஋ ×× × × × × × × × × ׆ੑԽؔ਺ εςοϓؔ਺ γάϞΠυؔ਺

ଟ૚ύʔηϓτϩϯ • ߏ଄  ଟ਺ͷ୯७ύʔηϓτϩϯΛ֊૚తʹੵΈॏͶͨ΋ͷ • ௕ॴ  े෼ͳ਺ͷϊʔυ͕͋Ε͹೚ҙͷؔ਺ΛۙࣅՄೳ  ૚ΛੵΈॏͶΔ͜ͱͰೖग़ྗͷෳࡶͳؔ܎ΛදݱՄೳ •
୹ॴ  ύϥϝλ͕ଟ͘ɺֶश্͕ख͍͔͘ͳ͍͜ͱ͕ଟ͍ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ೖྗ૚ ӅΕ૚ ૚ΛॏͶͯෳ਺ʹ͢Δ͜ͱ͕Ͱ͖Δ ग़ྗ૚ ॏΈ ॏΈ

ֶशʹ͓͚Δࠔ೉ • աֶश͠΍͍͢  ύϥϝλ͕ଟ͘දݱྗ͕ߴ͍෼ɺख࣋ͪͷσʔλΛֶश͠”ա͗Δ”  ͦͷͨΊະ஌σʔλʹର͢Δੑೳʢ൚Խੑೳʣ͕௿͘ͳͬͯ͠·͏ ੑೳ ֶशճ਺ ςετσʔλʹର͢Δ݁Ռ ֶशσʔλʹର͢Δ݁Ռ
աֶश ྖҬ ໨తม਺ આ໌ม਺

ֶशʹ͓͚Δࠔ೉ • ӅΕ૚Λ૿΍͢ͱֶश͕·ͱ΋ʹਐ·ͳ͍  ֶश͸ޡࠩٯ఻೻๏ʢৄࡉ࣍અʣͰߦ͏  ͜Ε͸ NN ͷ༧ଌͱ౴͑ͷ৯͍ҧ͍Λੋਖ਼͢ΔΑ͏ʹॏΈΛߋ৽  ૚͕ਂ͘ͳΔͱߋ৽ͷӨڹ͕౸ୡͤͣʹֶश͕͏·͍͔͘ͳ͍ ɾ
ɾ ɾ ɾ ɾ ɾ ೖྗ૚ ग़ྗ૚ ɾ ɾ ɾ ɾ ɾ ɾ ༧ଌ ༧ଌͱ౴͑ͷͣΕमਖ਼͢ΔΑ͏ֶश ೖྗ૚ʹ͍ۙͱ͜Ζ͸मਖ਼ͷ৴߸͕ಧ͖ʹ͍͘ ༧ଌͱ౴͑Λൺֱ

ࠔ೉ͷղܾ ۙ೥༷ʑͳঢ়گͷมԽ΍ख๏ͷ։ൃͰલทͷ໰୊͸ղܾʢৄࡉ࣍અʣ • σʔλྔ΍ܭࢉࢿݯͷ૿Ճɺ࠷దԽख๏ͷൃୡ  ѻ͑Δσʔλͷྔ΍ܭࢉػॲཧੑೳ͕େ෯ʹ૿େ  ֶशΛޮ཰తʹਐΊΔ࠷దԽख๏΋ൃୡ • ࣄલֶशͷಋೖ  ॏΈͷॳظ஋ͱͯ͠༗༻ͳ΋ͷΛ֫ಘ
• ֶश͕ೖྗ૚ۙ͘·Ͱ͏·͍͘͘Α͏ʹ׆ੑԽؔ਺ΛσβΠϯ  Rectified Liner Unit (ReLU) ͷ։ൃ • աֶशͷճආख๏΍ਖ਼نԽख๏ͷߟҊ  dropout ΍όονਖ਼نԽͳͲͰ NN ͷੑೳ͕޲্ • …

ͦͯ͠σΟʔϓϥʔχϯά΁ • σΟʔϓϥʔχϯάͷຊ࣭తʹ৽͍͠ͱ͜Ζ͸Ͳ͔͜ʁ  ࣮͸ਅ৽͍͜͠ͱ͸͋·Γͳ͍ʢReLU͚ͩͱ͍͏ਓ΋͍Δʣ  લทͷΑ͏ͳൃల͕૊Έ߹Θͬͨ݁͞Ռ্ख͍͘͘Α͏ʹͳͬͨ • ػ͕ख़ͯ͠ ILSVRC 2012
Ͱ࣮݁  σΟʔϓϥʔχϯάΛ࢖Θͳ͍ख๏ΑΓ΋ 10% Ҏ্ਫ਼౓͕ߴ͔ͬͨ ˞࣮͸Ԡ༻Ͱઌʹ੒ՌΛग़͍ͯ͠Δͷ͸Ի੠ೝࣝ෼໺ͩͬͨΓ͢Δ • ϥΠϒϥϦͷॆ࣮ͰҰؾʹ޿͕ͬͨ  Caffe ΍ Tensorflow ͳͲͷొ৔Ͱ୭Ͱ΋ѻ͑ΔΑ͏ʹͳͬͨ  ಛʹඇߏ଄Խσʔλʢը૾΍Ի੠΍ςΩετʣʹର༷ͯ͠ʑͳԠ༻  GitHub ͷ·ͱΊ৘ใɿ https://github.com/ChristosChristofidis/awesome-deep-learning

σΟʔϓϥʔχϯάͷఆٛ σΟʔϓϥʔχϯά ਂ͍ (Neural Network) (ػց) ֶश NN ͷॏΈͷֶश͸೉͍͠
ֶशΛద੾ʹ࣮ߦ͢ΔͨΊ༷ʑͳख๏͕ߟҊ • େྔͷσʔλΛ࢖༻ • ࠷దԽख๏ͷൃୡ • ࣄલֶश • ReLU ͳͲͷ৽͍͠׆ੑԽؔ਺ͷߟҊ • dropout ΍όονਖ਼ଇԽʹΑΔੑೳ޲্ ӅΕ૚͕2૚Ҏ্ͷ NN  NN ͸ύʔηϓτϩϯΛੵΈ্͛ͨ΋ͷ  ֶशͰॏΈΛߋ৽ͯ͠༧ଌਫ਼౓Λ޲্ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ

σΟʔϓϥʔχϯάͷੑ࣭ σΟʔϓϥʔχϯάͷಛච͢΂͖ੑ࣭Λྻڍ • େྔͷύϥϝλΛେྔͷσʔλ͔Βֶश  ֶशσʔλΑΓύϥϝλ਺ͷํ͕ଟ͔ͬͨΓ΋͢Δ • ෼ࢄදݱͱଟ૚ߏ଄ʹΑΔߴ͍දݱྗͷ֫ಘ  ؍ଌͨ͠σʔλͷຊ࣭తͳಛ௃Λଊ͑Δ͜ͱ͕Մೳ •
ಛ௃ྔͷࣗಈநग़  ػցֶशʹ͓͚Δ෼ੳϑϩʔʹ͓͚ΔύϥμΠϜγϑτ • ಛఆͷλεΫͰߴ͍ੑೳ  ಛʹඇߏ଄ԽσʔλΛѻ͏ը૾෼ੳɺԻ੠ೝࣝɺࣗવݴޠॲཧͳͲ

σΟʔϓϥʔχϯάͷੑ࣭ • େྔͷύϥϝλΛେྔͷσʔλ͔Βֶश  ࠷దԽର৅ͱͳΔύϥϝλ͸ʢଞͷػցֶशख๏ͱൺ΂Δͱʣଟ͍  ɹ→ 100ສݸҎ্ͷύϥϝλͱͳΔ͜ͱ΋…  ɹɹ cf.) ਓؒͷେ೴ൽ࣭ͷਆܦࡉ๔͸100ԯݸҎ্
r net with this 3-layer bottleneck block, resulting in yer ResNet (Table 1). We use option B for increasing ions. This model has 3.8 billion FLOPs. -layer and 152-layer ResNets: We construct 101- nd 152-layer ResNets by using more 3-layer blocks 1). Remarkably, although the depth is significantly ed, the 152-layer ResNet (11.3 billion FLOPs) still wer complexity than VGG-16/19 nets (15.3/19.6 bil- OPs). 50/101/152-layer ResNets are more accurate than layer ones by considerable margins (Table 3 and 4). not observe the degradation problem and thus en- nificant accuracy gains from considerably increased The benefits of depth are witnessed for all evaluation (Table 3 and 4). arisons with State-of-the-art Methods. In Table 4 mpare with the previous best single-model results. seline 34-layer ResNets have achieved very compet- curacy. Our 152-layer ResNet has a single-model alidation error of 4.49%. This single-model result orms all previous ensemble results (Table 5). We e six models of different depth to form an ensemble method error (%) Maxout [10] 9.38 NIN [25] 8.81 DSN [24] 8.22 # layers # params FitNet [35] 19 2.5M 8.39 Highway [42, 43] 19 2.3M 7.54 (7.72±0.16) Highway [42, 43] 32 1.25M 8.80 ResNet 20 0.27M 8.75 ResNet 32 0.46M 7.51 ResNet 44 0.66M 7.17 ResNet 56 0.85M 6.97 ResNet 110 1.7M 6.43 (6.61±0.16) ResNet 1202 19.4M 7.93 Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [43]. so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts. We use a weight decay of 0.0001 and momentum of 0.9, ग़ॴɿhttps://www.robots.ox.ac.uk/~vgg/rg/papers/deepres.pdf 110૚Ͱ170ສύϥϝλʂ

σΟʔϓϥʔχϯάͷੑ࣭ • େྔͷύϥϝλΛେྔͷσʔλ͔Βֶश  ࠷దԽର৅ͱͳΔύϥϝλ͸ʢଞͷػցֶशख๏ͱൺ΂Δͱʣଟ͍  ɹ→ 100ສݸҎ্ͷύϥϝλͱͳΔ͜ͱ΋…  ͜ΕΛֶश͢ΔͨΊʹେྔͷը૾σʔλΛ࢖༻͢Δ  ɹ→ ImageNet
(http://www.image-net.org/) Ͱެ։͍ͯ͠Δը૾͸ 1400ສຕ  ɹ→ ͜ͷେྔσʔλͷֶशͷͨΊʹ GPU ͕ඞਢ ग़ॴɿhttp://static.googleusercontent.com/media/research.google.com/ja//archive/unsupervised_icml2012.pdf Building high-level features using large-scale unsupervised the cortex. They also demonstrate that convolutional DBNs (Lee et al., 2009), trained on aligned images of faces, can learn a face detector. This result is interesting, but unfortunately requires a certain degree of supervision during dataset construction: their training images (i.e., Caltech 101 images) are aligned, homoge- neous and belong to one selected category. Figure 1. The architecture and parameters in one layer of our network. The overall network replicates this structure three times. For simplicity, the images are in 1D. logical and computation Lyu & Simoncelli, 2008; As mentioned above, cen of local connectivity bet ments, the first sublayer pixels and the second su lapping neighborhoods o The neurons in the first su input channels (or maps second sublayer connect (or map).3 While the firs responses, the pooling lay the sum of the squares o is known as L2 pooling. Our style of stacking ules, switching betwe ance layers, is remini HMAX (Fukushima & M 1998; Riesenhuber & Po been argued to be an a brain (DiCarlo et al., 201 Although we use local not convolutional: the across different location a stark difference betw vious work (LeCun et a Building high-level features using large-scale unsupervised learning and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% accuracy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the local contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with previous study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. tested neuron, by solving: x∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron in- deed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to numerical constraint optimization. 4.5. Invariance properties Googleͷೣͷ࿦จͰ͸  1000ສຕͷ Youtube ը૾Λ࢖༻

σΟʔϓϥʔχϯάͷੑ࣭ • ෼ࢄදݱͱଟ૚ߏ଄ʹΑΔߴ͍දݱྗͷ֫ಘ  ෼ࢄදݱͰσʔλΛදݱ  ɹ→ ϞσϧͷදݱྗʢσʔλۭؒΛ෼ׂ͢Δೳྗʣ͕ࢦ਺తʹ্ঢ  ɹɹ Ex.) 3࣍ݩͷόΠφϦม਺Ͱͷදݱྗ
Symbolic rep. Distributed rep. 3ύλʔϯΛදݱ 2 ύλʔϯΛදݱ i j k i, j, k = {0,1} (1,1,1) (1,1,0) (0,0,1) (1,0,1) (0,1,1) (1,0,0) (0,1,0) 3

σΟʔϓϥʔχϯάͷੑ࣭ • ෼ࢄදݱͱଟ૚ߏ଄ʹΑΔߴ͍දݱྗͷ֫ಘ  ෼ࢄදݱͰσʔλΛදݱ  ɹ→ ϞσϧͷදݱྗʢσʔλۭؒΛ෼ׂ͢Δೳྗʣ͕ࢦ਺తʹ্ঢ  ଟ૚ߏ଄ͰσʔλΛදݱ  ɹ→ ஈ֊తʹදݱΛֶशͦ͠ͷ૊Έ߹ΘͤͰෳࡶͳදݱ͕Մೳ
ग़ॴɿhttps://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/

σΟʔϓϥʔχϯάͷੑ࣭ • ෼ࢄදݱͱଟ૚ߏ଄ʹΑΔߴ͍දݱྗͷ֫ಘ  ෼ࢄදݱͰσʔλΛදݱ  ɹ→ ϞσϧͷදݱྗʢσʔλۭؒΛ෼ׂ͢Δೳྗʣ͕ࢦ਺తʹ্ঢ  ଟ૚ߏ଄ͰσʔλΛදݱ  ɹ→ ஈ֊తʹදݱΛֶशͦ͠ͷ૊Έ߹ΘͤͰෳࡶͳදݱ͕Մೳ 
ɹ→ ઙ͍ωοτϫʔΫͱൺ΂ͯগͳ͍ύϥϝλͰ࣮ݱՄೳ

σΟʔϓϥʔχϯάͷੑ࣭ • ಛ௃ྔͷࣗಈநग़  લทͷදݱֶशʹΑΓσʔλͷಛ௃ྔΛࣗಈతʹநग़Մೳ  ɹ→ ͜Ε·Ͱ͸ಛ௃ྔ࡞੒͕ॏཁͩͬͨ  ɹɹ ಛʹඇߏ଄Խσʔλʹରͯ͠͸ݦஶ  ɹ→
σΟʔϓϥʔχϯάͰ͸͜ͷεςοϓ͕͍Βͳ͍ʢগͳ͍ʣ  ɹɹ ͔͠΋ैདྷख๏ΑΓ΋ਫ਼౓͕ߴ͍

σΟʔϓϥʔχϯάͷੑ࣭ • ಛ௃ྔͷࣗಈநग़ ग़ॴɿhttp://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html ैདྷͷػցֶशͰ͸ ಛ௃ྔ࡞੒͸ਓखͰ࣮ߦ σΟʔϓϥʔχϯάͰ͸ ֊૚తʹಛ௃ྔΛࣗಈ࡞੒

σΟʔϓϥʔχϯάͷੑ࣭ • ಛఆͷλεΫͰߴ͍ੑೳ  ಛ௃ྔ࡞੒͕೉͍͠ඇߏ଄Խσʔλʹରͯ͠ಛʹҖྗΛൃش  λεΫΛݶఆ͢Ε͹ਓؒΛ্ճΔੑೳΛ࣋ͭ΋ͷ΋਺ଟ͘ग़ݱ  ɹ→ ը૾෼ੳ  ɹ→ Ի੠ೝࣝ 
ɹ→ ࣗવݴޠॲཧ  ɹ→ ήʔϜૢ࡞΍ϩϘοτ੍ޚ  ɹ→ …

σΟʔϓϥʔχϯάͷཧ࿦తجૅ

ཧ࿦తجૅ ͜ͷઅͰ͸લઅʹొ৔ٕͨ͠ज़త಺༰Λৄ͘͠આ໌͢Δ  શ෦͸આ໌͖͠Εͳ͍ͷͰԼهτϐοΫΛऔΓ্͛Δ • ୯७ύʔηϓτϩϯͷֶश • ଟ૚ύʔηϓτϩϯͷֶशํ๏ͱޡࠩٯ఻೻๏ • ࣄલֶश
• ׆ੑԽؔ਺ Rectified Linear Unit (ReLU) • dropout (ਖ਼ଇԽ) • දݱֶशͱͯ͠ͷଆ໘ • …

୯७ύʔηϓτϩϯͷֶश ༧ଌͱ౴͕͑ҟͳΔ΋ͷ͔Βࣝผฏ໘Λߋ৽͍ͯ͘͠  Ex.) ೖྗม਺͕ೋ࣍ݩͰೋ஋൑ผͷ৔߹ ֶशͷՄࢹԽͳͲʹ͸ http://playground.tensorflow.org/ ͕༗༻ ύʔηϓτϩϯͷֶश͸ઢܗ෼཭ՄೳͳΒ͹༗ݶճͰऩଋ https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/lec2.pdf
͔͠͠ઢܗ෼཭ෆՄೳ໰୊͸ղ͘͜ͱ͕Ͱ͖ͳ͍ × × × × × × × × ߋ৽

ޡࠩٯ఻೻๏ ଟ૚ύʔηϓτϩϯͷֶशํ๏ͰσΟʔϓϥʔχϯάֶशͷجૅ ୯७ύʔηϓτϩϯͱಉ͡Α͏ʹ༧ଌͱ౴͑ͷҧ͍Λ΋ͱʹֶश͢Δ͕ɺ ૚͕ଟ͍ͷͰग़ྗ͔Βॱʑʹ఻೻͍ͤͯ͘͞ ɾ ɾ ɾ ɾ ɾ
ɾ ɾ ɾ ɾ ɾ ɾ ɾ ೖྗ૚ ग़ྗ૚ ༧ଌͱ౴͑ͷͣΕΛੋਖ਼͢ΔͨΊͷॏΈߋ৽ͷ৴߸ΛඥΛҾͬுͬͯ఻͑ΔΠϝʔδ ઢ͕ଠ͍΄ͲॏΈͷߋ৽ྔ͕େ͖͍͜ͱΛදݱ͍ͯ͠Δ  → Ͳ͏ͯ͠΋ೖྗ૚ʹ͍ۙ෦෼͸৴߸͕ऑ͘ͳͬͯ͠·͏ʢޯ഑ফࣦ໰୊ʣ

ɾ ɾ ɾ ɾ ɾ ɾ ɾ ೖྗ૚ ग़ྗ૚ ୯७ύʔηϓτϩϯͱ͸ҟͳΓઢܗ෼཭ෆՄೳ໰୊΋ղ͘͜ͱ͕Մೳ  ͜Ε΋ http://playground.tensorflow.org/ Ͱ֬ೝՄೳ

ɾ ɾ ɾ ɾ ɾ ɾ ɾ ೖྗ૚ ग़ྗ૚ ޡࠩٯ఻೻๏͸େҬత࠷దղʹ͸ͨͲΓண͚ͣہॴత࠷దղʹḷΓͭ͘΋ͷͰ͋Δ ※େҬత࠷దղΛٻΊΔͷ͸ NP ࠔ೉

େҬղͱہॴղ ྫ͑͹͋Δؔ਺ͷ࠷খ஋ΛٻΊΔͱ͖ ɹେҬղɿͦͷؔ਺ͷఆٛҬʹ͓͍ͯ࠷΋஋͕খ͘͞ͳΔ఺ ɹہॴղɿ࠷΋஋͕খ͍͞อূ͸ͳ͍͕ۙ๣ͷ఺ΑΓ͸஋͕খ͍͞఺ ग़ॴɿhttp://mirlab.org/jang/courses/cs4601/ga.htm ہॴղ େҬղ

େҬղ͸෼͔Βͳͯ͘΋ྑ͍ʁ σΟʔϓϥʔχϯάͰ࠷దԽͷ෣୆ͱͳΔͷ͸ඇৗʹߴ࣍ݩͷۭؒ  ͋Δํ޲Ͱ͸ہॴղͰ΋ଞͷํ޲ΛݟΕ͹ΑΓ௿͍ͱ͜Ζʹߦ͖ಘΔ ※ߴ࣍ݩۭؒͰ͸ͦͷΑ͏ͳํ޲͕਺ଟ͘ଘࡏ͢Δ͜ͱ͕ࣔࠦ → ͦ͏ͯ͠ͲΜͲΜہॴղΛ߱Γ͍ͯ͘ͱେҬղʹे෼ۙ͘ͳΔ  ɹ ※͜Ε͸ৗʹ੒ΓཱͭҰൠతੑ࣭Ͱ͸ͳ͍͜ͱʹ஫ҙ ग़ॴɿhttps://arxiv.org/pdf/1406.2572.pdf

ଟ૚ύʔηϓτϩϯͷදݱྗ े෼ͳ਺ͷӅΕ૚͕Ұ૚͋Ε͹೚ҙͷਫ਼౓Ͱ೚ҙͷؔ਺ΛۙࣅͰ͖Δ ʢສೳੑఆཧɿhttp://deeplearning.cs.cmu.edu/pdfs/Cybenko.pdfʣ  ෦෼తͳεςοϓؔ਺ͷΑ͏ͳ΋ͷΛେྔʹ૊Έ߹ΘͤΔΠϝʔδ ࢹ֮తͳղઆαΠτ͸ͪ͜Βɿhttp://neuralnetworksanddeeplearning.com/chap4.html ɾ ɾ ɾ ग़ྗ
ೖྗ Ex.) ೖྗ૚ͱग़ྗ૚ ͕Ұ࣍ݩͷ৔߹ ͨͩ͠લड़ͷΑ͏ʹӅΕ૚ͷ૚਺Λ૿΍ͨ͠ํ͕গͳ͍ύϥϝλͰࡁΉ  ʢͦͯ͠ஈ֊తͳಛ௃ྔநग़΋Մೳʣ

ࣄલֶश ໨తͱ͢Δ൑ผ໰୊΍ճؼ໰୊ʹର͢Δޡࠩٯ఻೻๏Λߦ͏લͷֶश େ͖͘෼͚ͯԼهͷೋͭͷύλʔϯ͕͋Δ • ڭࢣແ͠ࣄલֶश  ॳظͷσΟʔϓϥʔχϯά੒ޭΛݗҾ  Ұ૚ຖͷֶशͰॏΈͷॳظ஋Λ֫ಘ͍ͯ͘͠  ͨͩ͠ݱࡏ͸ࣗવݴޠॲཧҎ֎Ͱ͸͋·Γ࢖ΘΕ͍ͯͳ͍ •
ڭࢣ༗Γࣄલֶश  ը૾෼ੳʹ͓͚Δ pre-trained Ϟσϧ͕ݦஶͳྫ  େྔͷσʔλ͔Βීวతͳදݱֶ͕शͰ͖͍ͯΕ͹ԿʹͰ΋࢖͑Δ

ڭࢣແ͠ࣄલֶश Ұ૚ຖͷֶशͰॏΈΛ֫ಘ  (ϊΠζΛՃ͑ͨೖྗͰܭࢉͨ݁͠Ռ) = (ݩͷೖग़ྗ)ͱͳΔΑ͏ֶश ɾ ɾ ɾ ɾ
ɾ ɾ ೖྗ x ɾ ɾ ɾ ग़ྗ x’’ ϊΠζೖྗ x’ ӅΕ૚ h’ x’’ → x

ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ೖྗ x ϊΠζೖྗ x’ ӅΕ૚ h’ ӅΕ૚ h’ ग़ྗ h’’ h’’ → h 2

ɾ ɾ ɾ ɾ ɾ ग़ྗ y ೖྗ x ӅΕ૚ h ӅΕ૚ h2 ڭࢣແ͠ࣄલֶशͰ֫ಘͨ͠ॏΈ͸ޡࠩٯ఻೻๏ʹ͓͚Δྑ͍ॳظ஋ ۩ମྫ͸ http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf ͳͲ ͔͠͠ݱࡏͰ͸ࣄલֶश͕ඞཁͳ͍ϞσϧʢCNNʣͳͲ΋ߟҊ͞Εɺ ڭࢣແ͠ࣄલֶशΛ࢖͏ͷ͸ࣗવݴޠॲཧ͘Β͍

ڭࢣ༗Γࣄલֶश ໨తͱ͢Δσʔληοτͱ͸ผͷσʔλͰֶश͓ͯ͘͜͠ͱ సҠֶशʢσΟʔϓϥʔχϯάͱ͸ಠཱͷ֓೦ʣͱ͍͏ݺͼํ͕Ұൠత ը૾Ͱͷྫ͕ݦஶ ɹ→ େྔσʔλ͔ΒΤοδ΍ͦͷ૊߹ͤύλʔϯΛֶशͰ͖͍ͯΕ͹ ɹ ɹ໨తͷσʔληοτʹରͯ͠΋༗ޮͩͱߟ͑ΒΕΔ ࣮ࡍʹ
ImageNet ͷσʔλΛ༻͍ͨ pre-trained Ϟσϧ͕Α͘࢖ΘΕΔ  Caffe model zoo (http://caffe.berkeleyvision.org/model_zoo.html) ͳͲ͕༗໊ pre-trained Ϟσϧ͸1000Ϋϥε෼ྨثͰͦͷޙ໨తͷ෼ྨثʹ fine-tuning σʔλ෼෍͔Βಛ௃తͳදݱΛֶशͰ͖Ε͹ࣄલֶश͸༗༻

׆ੑԽؔ਺ ReLU Α͘࢖ΘΕ͍ͯͨγάϞΠυؔ਺͸୺ͷํͰޯ഑͕ग़ʹ͍͘  ɹ→ ֶश͕ਐ·ͳ͘ͳͬͯ͠·͏ ޯ഑͕ग़΍͍͢Α͏ʹؔ਺Λઃܭʢγϯϓϧ͕ͩڧྗʣ ͜ͷลΓͷྖҬ͸܏͖͕ খֶ͘͞श͕ਐΈʹ͍͘ ܏͖͕ҰఆͳͷͰ
ֶश͕ਐΈ΍͍͢ ReLU ෳ਺ͷઢܗؔ਺ͷ ૊Έ߹Θͤ Maxout

dropout ӅΕ૚ͷϊʔυΛϥϯμϜʹऔΓআ͘͜ͱͰਖ਼ଇԽͷޮՌ͕ಘΒΕΔ γϯϓϧ͕ͩڧྗͳҖྗΛൃش͢Δख๏ https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf ʢաֶशΛ๷͙ʣ ɾ ɾ ɾ ɾ
ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ Dropout, on the other hand, prevents overfitting, even in this case. It does not even need early stopping. Goodfellow et al. (2013) showed that results can be further improved to 0.94% by replacing ReLU units with maxout units. All dropout nets use p = 0.5 for hidden units and p = 0.8 for input units. More experimental details can be found in Appendix B.1. Dropout nets pretrained with stacks of RBMs and Deep Boltzmann Machines also give improvements as shown in Table 2. DBM—pretrained dropout nets achieve a test error of 0.79% which is the best performance ever reported for the permutation invariant setting. We note that it possible to obtain better results by using 2-D spatial information and augmenting the training set with distorted versions of images from the standard training set. We demonstrate the e↵ectiveness of dropout in that setting on more interesting data sets. With dropout Without dropout @ R @ @ R In order to test the robustness of dropout, classification experiments were done with networks of many di↵erent architectures keeping all hyperparameters, in- cluding p, fixed. Figure 4 shows the test error rates obtained for these di↵erent architectures as training progresses. The same architectures trained with and without dropout have drastically di↵erent test errors as seen as by the two separate clus- ters of trajectories. Dropout gives a huge improvement across all architectures, without using hyperparameters that were tuned specifically for each architecture.

දݱֶशͱͯ͠ͷଆ໘ σΟʔϓϥʔχϯά͕ࣗಈ֫ಘ͢Δදݱʹؔͯ͠͸໌֬ͳཧ࿦తମܥ͕ ଘࡏ͢ΔΘ͚Ͱͳ͍͕ɺඇৗʹڵຯਂ͍࿩୊͕ଟ͍ ग़ॴɿhttps://devblogs.nvidia.com/parallelforall/deep-learning-computer-vision-caffe-cudnn/ ਂ͍૚Ͱֶशͨ͠දݱͷํ͕Α͘෼཭Ͱ͖͍ͯΔ

දݱֶशͱͯ͠ͷଆ໘ σΟʔϓϥʔχϯά͕ࣗಈ֫ಘ͢Δදݱʹؔͯ͠͸໌֬ͳཧ࿦తମܥ͕ ଘࡏ͢ΔΘ͚Ͱͳ͍͕ɺඇৗʹڵຯਂ͍࿩୊͕ଟ͍ ग़ॴɿhttp://www.aclweb.org/anthology/N13-1090 ୯ޠͷ෼ࢄදݱ  ϕΫτϧԋࢉ͕Մೳʹͳ͍ͬͯΔ Figure 2: Left
panel shows vector offsets for th

දݱֶशͱͯ͠ͷଆ໘ σΟʔϓϥʔχϯά͕ࣗಈ֫ಘ͢Δදݱʹؔͯ͠͸໌֬ͳཧ࿦తମܥ͕ ଘࡏ͢ΔΘ͚Ͱͳ͍͕ɺඇৗʹڵຯਂ͍࿩୊͕ଟ͍ ग़ॴɿhttp://nlp.stanford.edu/pubs/SocherGanjooManningNg_NIPS2013.pdf ଟ༷ମ্ͷϚοϐϯά  Ϛοϐϯά৔ॴ͕៉ྷʹ෼͔Ε͍ͯΔΠϝʔδ Manifold of known
classes auto horse dog truck New test image from unknown class cat Training images Figure 1: Overview of our cross-modal zero-shot model. We ﬁrst map each new testing image into a lower dimensional semantic word vector space. Then, we determine whether it is on the manifold of seen images. If the image is ‘novel’, meaning not on the manifold, we classify it with the help of

ͦͷଞͷτϐοΫ ঺հ͖͠Ε͍ͯͳ͍τϐοΫ͸ͱͯ΋ଟ͍ • ໨తؔ਺΍ੑೳࢦඪͷઃܭ • σʔληοτʹର͢ΔॲཧʢaugmentationͳͲʣ • ֶशʹ͓͚ΔϋΠύʔύϥϝλͷऔΓѻ͍ •
σΟʔϓϥʔχϯάͷֶशʹ͓͚Δ࠷దԽख๏ʢadamͳͲʣ • όονਖ਼نԽͳͲͷֶश࣌ͷ޻෉ • ੜ੒Ϟσϧͷ࿩ʢGenerative Adversarial NetworkͳͲʣ • GPUΛ༻͍ͨॲཧͷඞཁੑͱੑೳ • σΟʔϓϥʔχϯά͕ѻ͑ΔϥΠϒϥϦ • …  ڵຯ͕͋Δํ͸ͥͻࣗ͝਎Ͱௐ΂ͯΈ͍ͯͩ͘͞ʂ

σΟʔϓϥʔχϯάͷ۩ମతϞσϧ

۩ମతͳ NN ͷϞσϧ NN ͷجૅͱͳΔ෦෼͸લઅ·Ͱʹݟͨ௨Γ͕ͩɺ ߴ͍ੑೳΛൃش͢ΔͨΊʹ͸໨తʹԠͨ͡ಛผͳߏ੒ͷϞσϧ͕ඞཁ    ͜͜Ͱ͸ಛʹΑ͘࢖ΘΕΔೋͭͷϞσϧΛ঺հ •
Convolutional Neural Network (CNNɿ৞ΈࠐΈ NN)  ը૾෼ੳͰΑ͘༻͍ΒΕΔϞσϧ  ը૾தͷΦϒδΣΫτʹର͢ΔزԿֶతม׵ʹରͯ͠ڧ͍  Ex.) ը૾ͷࠨ্ʹ͍ࣸͬͯͯ΋ӈԼʹ͍ࣸͬͯͯ΋ࣝผՄೳ • Recurrent Neural Network (RNNɿ࠶ؼܕ NN)  ࣗવݴޠॲཧͳͲͰΑ͘༻͍ΒΕΔϞσϧ  ॱ൪ʢ࣌ؒॱংʣ͕ॏཁʹͳΔσʔλʹରͯ͠ڧ͍  Ex.) “ࢲ” → “͕” ΍ ”͸” ͕དྷ΍͍͢͜ͱΛ༧ଌՄೳ

CNN ͷϞνϕʔγϣϯ ը૾ͷྫΛߟ͑Δ ը૾தͷΦϒδΣΫτ͸Ͳ͜ʹ͍ࣸͬͯͯ΋ಉ͡΋ͷͱೝ͍ࣝͨ͠ ग़ॴɿhttp://yann.lecun.com/exdb/mnist/ Ґஔ͸ҟͳΔ͕ ಉ͡਺ࣈͷ ”4” ਓؒͷ໨ͰݟΕ͹໌Β͔ʹಉ͕ͩ͡ɺσʔλͱͯ͠͸݁ߏҟͳΔ 
ʢϐΫηϧ஋Λ࣋ͭ৔ॴ͕ҟͳΔͨΊʣ ͜Ε·Ͱʹݟͨ NN Ͱ͸ಉҰͱ൑ผ͢Δͷ͸؆୯Ͱ͸ͳ͍ ɹ→ ฏߦҠಈ΍ճసͷΑ͏ͳزԿֶม׵ʹڧ͍ϞσϧΛ࡞Γ͍ͨ

CNN ͷߏ଄ CNN ͷه೦ൾతϞσϧ ग़ॴɿhttp://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf INPUT 32x32 Convolutions Subsampling
Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10 Full connection (Fully Connected) ͸͜Ε·Ͱʹݟ͖ͯͨී௨ͷॏΈ  ৽͘͠ొ৔ͨ͠ͷ͸Լهೋͭ • Convolutionsɿ৞ΈࠐΈ૚ • Subsampling (ݱ୅Ͱ͸ Pooling)ɿϓʔϦϯά૚

৞ΈࠐΈ૚ ฏߦҠಈͳͲΛͤͯ͞΋ೝࣝΛͤ͞ΔͨΊʹ͸ɺ  ӅΕ૚ϊʔυ͕ہॴతʹ൓Ԡ͢ΔΑ͏ͳߏ଄͕ඞཁʢہॴड༰໺ʣ ࠨ্ͷӅΕ૚ϊʔυ͸ݩը૾ͷࠨ্ͷۣܗͷΈʹ൓Ԡ ӈԼͷӅΕ૚ϊʔυ͸ݩը૾ͷӈԼͷۣܗͷΈʹ൓Ԡ ॏΈΛֶश͢Δ͜ͱͰը૾ͷہॴతಛ௃Λநग़  ɹ→ ͜ͷॏΈ͸ಛ௃நग़ͷϑΟϧλͱͯ͠ߟ͑ΒΕΔ ͞Βʹ͜ͷॏΈ͸ӅΕ૚ϊʔυؒͰڞ༗͢Δ
ɹ→ ฏߦҠಈͳͲʹରͯ͠ߴ͍ੑೳΛൃش ͜ͷॏΈ΋ޡࠩٯ఻೻๏Ͱֶश͢Δ͜ͱ͕Մೳ ࣮ࡍͷྫͰ͸RGBͷ֤৭ʹෳ਺ݸͷϑΟϧλΛ४උֶͯ͠श͢Δ ʢϑΟϧλͷ਺΍αΠζͳͲ͕ύϥϝλͱͳΔʣ

ʢิ଍ʣ৞ΈࠐΈԋࢉ ৞ΈࠐΈࣗମ͸σΟʔϓϥʔχϯάͱ͸ಠཱͳ਺ֶతԋࢉ ͋Δؔ਺ʢԼਤͷ੨ʣʹରͯ͠ผͷؔ਺ʢԼਤͷ੺ʣΛͣΒ͠ͳ͕Β ॏͶ߹Θͤͯͦͷ໘ੵ஋Λऔಘ͢Δ ɹ→ ಛఆͷ৘ใΛൈ͖ग़͢ϑΟϧλͱͳΔ ɹ→ લทͷ৞ΈࠐΈ૚ͷॏΈ͸͜ͷϑΟϧλͱͯ͠ಇ͘ ग़ॴɿhttps://en.wikipedia.org/wiki/Convolution

ϓʔϦϯά૚ ৞ΈࠐΈ૚Ͱऔಘͨ͠ಛ௃ը૾Λѹॖ͢Δ ͜ͷૢ࡞Ͱ΋زԿֶม׵ʹڧ͘ͳΔ͜ͱʹՃ͑σʔλྔ΋ݮΒͤΔ ܾΊΒΕͨྖҬ಺Ͱͷ࠷େ஋ΛऔΔ max pooling ͕࠷΋Α͘࢖ΘΕΔ ʢͦͷଞʹ΋ฏۉ஋ΛऔΔ average
pooling ͳͲ΋͋Δʣ ग़ॴɿhttps://cs231n.github.io/convolutional-networks/

CNN ৞ΈࠐΈ૚΍ϓʔϦϯά૚ΛੵΈॏͶΔ͜ͱͰ CNN ͕ߏங͞ΕΔ ଟ૚ߏ଄ͷϑΟϧλʹΑΓஈ֊తʹಛ௃ྔ͕நग़͞Εɺ ͦΕ͸ Τοδ→ύʔπ→ΦϒδΣΫτ ͷΑ͏ͳߏ଄ͱͳ͍ͬͯΔ ը૾෼ੳʹ͓͍ͯ͸
CNN ͸ѹ౗తͳੑೳΛތΓ୭͠΋͕࢖͍ͬͯΔ

RNN ͷϞνϕʔγϣϯ ʢ࣌ؒతʣॱং͕ॏཁʹͳΔσʔλ͸਺ଟ͍ • จষ • Իָ • גՁ
• … ͜ΕΒΛѻ͏ʹ͸໌ࣔతʹ࣌ؒॱংΛೖΕͨߏ଄Λಋೖ͢Δඞཁ༗ ͜Ε·Ͱʹݟͨ NN Ͱ͸೉͍͠ ɹ→ ࣌ؒॱংΛ໌ࣔతʹؚΜͩϞσϧ͕Λ࡞Γ͍ͨ ɹɹ ʢߋʹจষͷΑ͏ʹ௕͕͞Մม௕Ͱ͋ͬͯ΋ѻ͑ΔϞσϧʣ

RNN ͷߏ଄ ࣌ࠁ t-1 ͷӅΕ૚ͷ஋͕࣌ࠁ t ͷೖྗʹ࢖ΘΕΔߏ଄Λಋೖ͢Δ ɾ ɾ
ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ ɾ xt yt ht ࣌ؒํ޲ʹల։͢Δ͜ͱͰ௨ৗͷଟ૚ NN ͱͯ͠ղऍͰ͖Δ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾ time x1 y1 h1 t = 1 t = 2 t = T x2 y2 h2 xT yT hT Ex.) ࢲ ͸ ɻ ɾ ɾ ɾ

࣌ؒํ޲ͷޡࠩٯ఻೻๏ ల։ͨ͠ܗΛݟΕ͹ޡࠩٯ఻೻๏Λద༻Ͱ͖Δ͜ͱ͕෼͔Δ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ
ɾɾɾɾɾɾ ɾɾɾ ɾɾɾ x1 y1 h1 x2 y2 h2 xT yT hT ͜ΕͰݪཧతʹֶशՄೳ ඞཁͳ৘ใؒͷ࣌ؒతΪϟοϓ͕খ͍͞৔߹͸͜ΕͰ΋͏·͍͘͘ Ex.) ۭ͕੨͘੖Ε͍ͯΔ “ۭ”, “͕”, ”੨͘” ͱ͍͏ۙ͘ͷ৘ใ͔Β “੖Ε” ͱ͍͏୯ޠ͕དྷΔ͜ͱ͸༧ଌ͠΍͍͢

࣌ؒํ޲ͷޡࠩٯ఻೻๏ ల։ͨ͠ܗΛݟΕ͹ޡࠩٯ఻೻๏Λద༻Ͱ͖Δ͜ͱ͕෼͔Δ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ ɾɾɾɾɾɾ ɾɾɾ ɾɾɾɾɾɾ
ɾɾɾɾɾɾ ɾɾɾ ɾɾɾ x1 y1 h1 x2 y2 h2 xT yT hT ඞཁͳ৘ใؒͷ࣌ؒతΪϟοϓ͕େ͖͘ͳΔͱɺ ޡ͕ࠩ఻೻͢Δܦ࿏͕௕͘ͳΓޯ഑ͷ໰୊͕ੜͯ͡͠·͏ • ޯ഑ফࣦ໰୊ • ޯ഑രൃ໰୊ ɹ→ ͜ΕΒͷ໰୊Λղܾ͠े෼ͳ “هԱ” ͕Ͱ͖Δ΋ͷΛ࡞Γ͍ͨ

Long Short Term Memory (LSTM) ήʔτͱ͍͏֓೦Λಋೖ͠ඞཁͳ৘ใΛબ୒తʹબผ xt-1 yt-1 LSTM
xt yt xt+1 yt+1 ct-1 yt-1 xt yt ct yt tanh sigm tanh ੵ ࿨ ੵ ੵ sigm sigm ๨٫ ήʔτ ೖྗ ήʔτ ग़ྗ ήʔτ ಺෦ͷϝϞϦͷঢ়ଶͷ{๨ΕΔ, ॻ׵͑Δ, ग़ྗ} Λ੍ޚ͢Δ͜ͱͰޡࠩ৴߸Λద੾ʹ఻೻ͤ͞Δ LSTM LSTM

RNN ͜͜Ͱ঺հͨ͠΋ͷ͸࠷΋جຊతͳ΋ͷͰ਺ଟ͘ͷϞσϧ͕ఏҊ  ʢ https://github.com/kjw0612/awesome-rnn ͳͲʣ ࣗવݴޠͷྖҬͰಛʹ੒ޭΛऩΊ͍ͯΔ Իָͷੜ੒ͳͲ΋࣮ݱʢ https://maraoz.com/2016/02/02/abc-rnn/ ͳͲʣ
CNN ͱͷ૊߹ͤͳͲ΋੝Μʹݚڀ͞Ε͍ͯΔ  ग़ॴɿhttps://arxiv.org/pdf/1411.4555.pdf ɹɹ https://arxiv.org/pdf/1604.04573.pdf al Image Caption Generator Samy Bengio Google [email protected] Dumitru Erhan Google [email protected] A group of people shopping at an outdoor market. ! There are many vegetables at the fruit stand. Vision! Deep CNN Language ! Generating! RNN Figure 1. NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates complete sentences in natural language from an input image, as shown on the example above. CNN-RNN: A Unified Framework for Multi-label Image Classification Jiang Wang1 Yi Yang1 Junhua Mao2 Zhiheng Huang3∗ Chang Huang4∗ Wei Xu1 Baidu Research 2University of California at Los Angles 3Facebook Speech 4 Horizon Robotics Abstract While deep convolutional neural networks (CNNs) have own a great success in single-label image classification, s important to note that real world images generally con- n multiple labels, which could correspond to different jects, scenes, actions and attributes in an image. Tradi- nal approaches to multi-label image classification learn dependent classifiers for each category and employ rank- g or thresholding on the classification results. These tech- ques, although working well, fail to explicitly exploit the Airplane Great Pyrenees Archery Sky, Grass, Runway Dog, Person, Room Person, Hat, Nike Figure 1. We show three images randomly selected from ImageNet 2012 classification dataset. The second row shows their corre- sponding label annotations. For each image, there is only one label (i.e. Airplane, Great Pyrenees, Archery) annotated in the Im-

·ͱΊ

·ͱΊ • σΟʔϓϥʔχϯάͱ͸Կ͔  େྔσʔλ΍ֶश࣌ͷ༷ʑͳ޻෉Ͱద੾ʹֶशͰ͖ΔΑ͏ʹͳͬͨ ෳ਺ӅΕ૚ͷ Neural Netwok   ࣗಈతͳಛ௃நग़΍ಛఆλεΫʹର͢Δѹ౗తͳੑೳ͕ಛ௃త  •
σΟʔϓϥʔχϯάͷཧ࿦తجૅ  ޡࠩٯ఻೻๏ʹجֶͮ͘शͱ ReLU ΍ dropout ͳͲͷٕज़తൃల  • σΟʔϓϥʔχϯάͷ۩ମతϞσϧ  ը૾σʔλʹରͯ͠ੑೳΛൃش͢Δ Convolutional Neural Network  ॱং෇͖σʔλʹରͯ͠ੑೳΛൃش͢Δ Recurrent Neural Network

Pythonで動かして学ぶ機械学習入門第三回 ディープラーニング理論

Pythonで動かして学ぶ機械学習入門第三回 ディープラーニング理論

More Decks by yoppe

Other Decks in Technology

Featured

Transcript

Pythonで動かして学ぶ機械学習入門第三回　ディープラーニング理論

Pythonで動かして学ぶ機械学習入門第三回　ディープラーニング理論