最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -

໊ݹ԰$713.-ษڧձ ࠷ۙͷਂ૚ֶशʹ͓͚Δ"UUFOUJPOػߏ ɹ໊ݹ԰$713.-ษڧձWFS )JSPTIJ'VLVJ

ࣗݾ঺հ w ໊લɿ෱Ҫ޺ ॴଐɿ/&$όΠΦϝτϦΫεݚڀॴ ݩ୅໨໊ݹ԰$713.-ษڧձװࣄ w 4/4ɿ
ݸਓ)1ɿIUUQTTJUFTHPPHMFDPNTJUFGIJSPSFTFBSDIIPNF 5XJUUFSɿIUUQTUXJUUFSDPN$BUFDIJOF 'BDFCPPLɿIUUQTXXXGBDFCPPLDPNHSFFOUFB

ൃද͖ͯ͠·ͨ͠!$713 ॳ$713౤ߘ ॳࠃࡍձٞΦʔϥϧൃද

ຊ೔ͷ಺༰ w ࠓɼྲྀߦΓͷ"UUFOUJPOػߏʹ͍ͭͯޠΓ·͢ ϝΠϯɿ$PNQVUFS7JTJPOʹ͓͚Δ"UUFOUJPOػߏ αϒɿ/BUVSBM-BOHVBHF1SPDFTTJOHʹ͓͚Δ"UUFOUJPOػߏ $7XJUI"UUFOUJPONFDIBOJTN
/-1XJUI"UUFOUJPONFDIBOJTN

"UUFOUJPOػߏͬͯԿʁ w ಛ௃ྔ΁ͷॏΈ෇͚ʹΑΔಛ௃நग़ͷվળ ώτͷ஫ҙػߏΛػցֶश΁ͱԠ༻ٕͨ͠ज़ f′(x) = M(x)
⋅ f(x) ಛ௃ϕΫτϧPSಛ௃Ϛοϓ "UUFOUJPOػߏͷॏΈ *HOPSF "UUFOUJPO

"UUFOUJPOػߏͬͯԿʁ w ಛ௃ྔ΁ͷॏΈ෇͚ʹΑΔಛ௃நग़ͷվળ ώτͷ஫ҙػߏΛػցֶश΁ͱԠ༻ٕͨ͠ज़ w "UUFOUJPOػߏͷॏΈ͸αϯϓϧ ಛఆͷཁૉ ͝ͱʹҟͳΔ
ωοτϫʔΫͷύϥϝʔλ஋͸શαϯϓϧͰҰఆ ˠαϯϓϧ͝ͱʹՄม ˠֶशޙ͸શαϯϓϧͰݻఆ f′(x) = M(x) ⋅ f(x) "UUFOUJPOػߏͷॏΈ *HOPSF "UUFOUJPO ಛ௃ϕΫτϧPSಛ௃Ϛοϓ

"UUFOUJPOػߏ͸Ͳ͏΍ͬͯྲྀߦͬͨͷ͔ʁ Published as a conference paper at ICLR
2015 (a) (b) (c) (d) Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot correspond to the words in the source sentence (English) and the generated translation (French), respectively. Each pixel shows the weight ↵ij of the annotation of the j-th source word for the i-th target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three randomly selected samples among the sentences without any unknown words and of length between 10 and 20 words from the test set. One of the motivations behind the proposed approach was the use of a fixed-length context vector in the basic encoder–decoder approach. We conjectured that this limitation may make the basic encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the performance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand, both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch- 50, especially, shows no performance deterioration even with sentences of length 50 or more. This superiority of the proposed model over the basic encoder–decoder is further confirmed by the fact that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1). 6 /FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOH UP"MJHOBOE5SBOTMBUF &⒎FDUJWF"QQSPBDIFTUP "UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPO yt ˜ ht ct at ht pt ¯ hs Attention Layer Context vector Local weights Aligned position Figure 3: Local attention model – the model first predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct , a weighted average of the source hidden states in the window. The weights at are inferred refers to the global attention approach in which weights are placed “softly” over all patches in the source image. The hard attention, on the other hand, selects one patch of the image to attend to at a time. While less expensive at inference time, the hard attention model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train. Our local attention mechanism selectively focuses on a small window of context and is differentiable. This approach has an advantage of avoid- ing the expensive computation incurred in the soft attention and at the same time, is easier to train than the hard attention approach. In concrete details, the model first generates an aligned position pt for each target word at time t. The context vector ct is then derived as a weighted average over the set of source hidden states within the window [pt −D, pt +D]; D is empirically selected.8 Unlike Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corre two variants: a “hard” attention mechanism and a “soft” attention mechanism. We also show how one advantage of including attention is the ability to visualize what the model “sees”. Encouraged by recent advances in caption generation and inspired by recent success in employing attention in machine translation (Bahdanau et al., 2014) and object recognition (Ba et al., 2014; Mnih et al., 2014), we investi- gate models that can attend to salient part of an image while generating its caption. 2. Related Work In this section we provide relevant backgroun work on image caption generation and attent several methods have been proposed for gen descriptions. Many of these methods are b rent neural networks and inspired by the suc sequence to sequence training with neural ne chine translation (Cho et al., 2014; Bahdana 4IPX "UUFOEBOE5FMM /FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPO 3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPO down sample down sample up sample up sample convolution receptive field Soft Mask Branch Trunk Branch 1 + # $ % '($) * + + , + Figure 3: The receptive field comparison between mask branch and trunk branch. range to [0, 1] after two consecutive 1 ⇥ 1 convolution layers. We also added skip connections between bottom-up and top-down parts to capture information from different scales. The full module is illustrated in Fig.2. The bottom-up top-down structure has been applied to image segmentation and human pose estimation. However, the difference between our structure and the previous one lies in its intention. Our mask branch aims at improving Activation f1( f2( f3( Table 1: Th network with Layer Conv1 Max pool Residual U Attention M Residual U Attention M Residual U Attention M Residual U Average po FC,Softm pa /-1 /-1 $7 $7 ࿦จͷओுͰ͸ɻɻɻ Stacked Hourglass Networks for Human Pose Estimation Alejandro Newell, Kaiyu Yang, and Jia Deng University of Michigan, Ann Arbor {alnewell,yangky,jiadeng}@umich.edu Abstract. This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial re- lationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods. Keywords: Human Pose Estimation Fig. 1. Our network for pose estimation consists of multiple stacked hourglass modules which allow for repeated bottom-up, top-down inference. 1 Introduction A key step toward understanding people in images and video is accurate pose estimation. Given a single RGB image, we wish to determine the precise pixel location of important keypoints of the body. Achieving an understanding of a person’s posture and limb articulation is useful for higher level tasks like ac- tion recognition, and also serves as a fundamental tool in fields such as human- computer interaction and animation. arXiv:1603.06937v2 [cs.CV] 26 Jul 2016 4UBDLFE)PVSHMBTT/FUXPSLT GPS)VNBO1PTF&TUJNBUJPO )JHIXBZ/FUXPSLT 1.1. Notation We use boldface letters for vectors and matrices, and ital- icized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an identity matrix. The function (x) is defined as (x) = 1 1+e x , x 2 R. 2. Highway Networks A plain feedforward neural network typically consists of L layers where the lth layer (l 2 {1, 2, ..., L}) applies a non- linear transform H (parameterized by W H,l ) on its input x l to produce its output y l . Thus, x 1 is the input to the network and y L is the network’s output. Omitting the layer index and biases for clarity, y = H(x, W H ). (1) H is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. For a highway network, we additionally define two non- linear transforms T(x, W T ) and C(x, W C ) such that y = H(x, W H )· T(x, W T ) + x · C(x, W C ). (2) We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 T, giving y = H(x, W H )· T(x, W T ) + x · (1 T(x, W T )). (3) The dimensionality of x, y, H(x, W H ) and T(x, W T ) must be the same for Equation (3) to be valid. Note that ple computing units such that the i unit computes yi = Hi (x), a highway network consists of multiple blocks such that the ith block computes a block state Hi (x) and transform gate output Ti (x). Finally, it produces the block output yi = Hi (x) ⇤ Ti (x) + xi ⇤ (1 Ti (x)), which is connected to the next layer. 2.1. Constructing Highway Networks As mentioned earlier, Equation (3) requires that the dimensionality of x, y, H(x, W H ) and T(x, W T ) be the same. In cases when it is desirable to change the size of the representation, one can replace x with ˆ x obtained by suitably sub-sampling or zero-padding x. Another alternative is to use a plain layer (without highways) to change dimensionality and then continue with stacking highway layers. This is the alternative we use in this study. Convolutional highway layers are constructed similar to fully connected layers. Weight-sharing and local receptive fields are utilized for both H and T transforms. We use zero-padding to ensure that the block state and transform gate feature maps are the same size as the input. 2.2. Training Deep Highway Networks For plain deep networks, training with SGD stalls at the beginning unless a specific weight initialization scheme is used such that the variance of the signals during forward and backward propagation is preserved initially (Glorot & Bengio, 2010; He et al., 2015). This initialization depends on the exact functional form of H. For highway layers, we use the transform gate defined as T(x) = (W T T x+b T ), where W T is the weight matrix and b T the bias vector for the transform gates. This sug- gests a simple initialization scheme which is independent of the nature of H: bT can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior. This scheme is strongly in- $7 $7

"UUFOUJPOػߏͷ܏޲ w /-1ʹ͓͚Δ"UUFOUJPOػߏɿ୯ޠ จࣈ ͝ͱͷؔ܎ੑΛදݱ w $7ʹ͓͚Δ"UUFOUJPOػߏɿը૾্Ͱண໨͢΂͖ྖҬΛදݱ 4&/FU͸গ͚ͩ͠ྫ֎
൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ "UUFOUJPO "UUFOUJPO /-1ʹ͓͚Δ"UUFOUJPOػߏ $7ʹ͓͚Δ"UUFOUJPOػߏ

/-1ʹ͓͚Δ"UUFOUJPOػߏͷߟ͑ํ w "UUFOUJPOػߏ͸ࣙॻΦϒδΣΫτ 2VFSZʹର͢ΔॏΈΛ2VFSZͱ,FZͷಛ௃͔Βࢉग़ ࢉग़ͨ͠ॏΈΛ࢖ͬͯɼ2VFSZʹߩݙ͢Δ7BMVFΛऔΓग़͢ 2VFSZ
,FZ 7BMVF ॏΈΛࢉग़ 8FJHIU 1JDLVQ 7 " # $ % & 8 9 : ; &04 &04 8 9 : ; &ODPEFS 4PVSDF %FDPEFS 5BSHFU Weig h t( ) 2 , ɼ ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$- "7BTXBOJ /4IB[FFS /1BSNBS +6T[LPSFJU -+POFT "/(PNF[ -,BJTFS BOE-1PMPTVLIJO l"UUFOUJPOJT"MM:PV/FFEz /*14

/FVSBM.BDIJOF5SBOTMBUJPO /.5 w &ODPEFS%FDPEFS 4FR4FR Λ༻͍ͨϞσϧʹΑΓ຋༁ &ODPEFS 4PVSDF ɿݪจΛೖྗ
%FDPEFS 5BSHFU ɿ຋༁จΛग़ྗ ਂ૚ֶशʹ͓͚Δ"UUFOUJPOػߏͷ֓೦Λ࡞ͬͨݚڀ %#BIEBOBV ,$IP BOE:#FOHJP l/FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOHUP"MJHOBOE5SBOTMBUFz *$-3 ction that outputs the probability of yt , and st is other architectures such as a hybrid of an RNN Kalchbrenner and Blunsom, 2013). TE eural machine translation. The new architecture c. 3.2) and a decoder that emulates searching tion (Sec. 3.1). x 1 x 2 x 3 x T + α t,1 α t,2 α t,3 α t,T y t-1 y t h 1 h 2 h 3 h T h 1 h 2 h 3 h T s t-1 s t Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- nal probability , (4) d by er–decoder ap- ed on a distinct of annotations sentence. Each input sequence th word of the ations are com- ed sum of these Published as a conference paper at ICLR 2015 (a) (b) Published as a conference paper at ICLR 2015 (a) (b)

/.5ʹ͓͚Δ"UUFOUJPOͷՄࢹԽ "UUFOUJPOBOE"VHNFOUFE3FDVSSFOU/FVSBM/FUXPSLTIUUQTEJTUJMMQVCBVHNFOUFESOOT 4PVSDF 5BSHFU ˠ4FMG"UUFOUJPO͕஀ੜͨ͋ͨ͠Γʹ4PVSDF5BSHFU"UUFOUJPOͱݺ͹ΕΔ

(MPCBM"UUFOUJPOͱ-PDBM"UUFOUJPO unit defined in ng objective is (y|x)
(4) corpus. ls are classifed nd local. These the “attention” r on only a few ese two model y. odels is the fact ing phase, both hidden state ht M. The goal is hat captures rel- help predict the e models differ yt ˜ ht ct at ht ¯ hs Global align weights Attention Layer Context vector Figure 2: Global attentional model – at each time step t, the model infers a variable-length align- ment weight vector at based on the current target state ht and all source states ¯ hs . A global context vector ct is then computed as the weighted average, according to at , over all the source states. Here, score is referred as a content-based function yt ˜ ht ct at ht pt ¯ hs Attention Layer Context vector Local weights Aligned position Figure 3: Local attention model – the model first predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct , a weighted average of the source hidden states in the window. The weights at are inferred from the current target state ht and those source states ¯ hs in the window. refers to the glo weights are plac source image. hand, selects one a time. While le hard attention m quires more com ance reduction o Our local atte cuses on a smal entiable. This ap ing the expensiv attention and at than the hard at tails, the model pt for each targe tor ct is then de the set of source [pt −D, pt +D]; the global appro is now fixed-dim sider two varian (MPCBM"UUFOUJPO -PDBM"UUFOUJPO ɾ&ODPEFSͷ͢΂ͯͷ૚ͷಛ௃͔ΒॏΈΛܭࢉ ɾೖྗจશͯͷ୯ޠΛߟྀͨ͠"UUFOUJPOػߏ ɾ&ODPEFSͷಛఆͷ૚ͷಛ௃͔ΒॏΈΛܭࢉ ɾಛఆͷ୯ޠؒͷؔ܎ੑΛߟྀͨ͠"UUFOUJPOػߏ 5-.JOI 1)JFV BOE.%$ISJTUPQIFS l&⒎FDUJWF"QQSPBDIFTUP"UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPOz "$-

5SBOTGPSNFS w "UUFOUJPOػߏͷΈͰ຋༁͢Δ/.5 $//΍3//Λ༻͍ͳ͍ωοτϫʔΫߏ଄ .VMUJ)FBE"UUFOUJPOʹΑΓωοτϫʔΫΛߏ੒ w ෳ਺ͷ4FMG"UUFOUJPOػߏΛಋೖͨ͠ߏ଄
Figure 1: The Transformer - model architecture. "7BTXBOJ /4IB[FFS /1BSNBS +6T[LPSFJU -+POFT "/(PNF[ -,BJTFS BOE-1PMPTVLIJO l"UUFOUJPOJT"MM:PV/FFEz /*14

4FMG"UUFOUJPO w ೖྗͷಛ௃ͷΈ༻͍ͯ"UUFOUJPOػߏͷॏΈΛࢉग़ 4PVSDFͱ5BSHFUͷ֓೦͕ແ͍ͷ͕ಛ௃ ൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ ൴ ͸
෱Ҫ͞Μ Ͱ͢ ɽ ൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ "UUFOUJPOػߏ Λ ௐ΂ͨ ൴ ɽ จষ 5BSHFU จষ 4PVSDF จষ 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO

4FMG"UUFOUJPO w ೖྗͷಛ௃ͷΈ༻͍ͯ"UUFOUJPOػߏͷॏΈΛࢉग़ 4PVSDFͱ5BSHFUͷ֓೦͕ແ͍ͷ͕ಛ௃ 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO จষ
5BSHFU จষ 4PVSDF จষ

5SBOTGPSNFSͷ"UUFOUJPOػߏ Multi-Head Attention ttention. (right) Multi-Head Attention consists
of several Mu ltiHead(Q, K, V) = co n cat(h ead 1 , …, h ead h )WO 4FMG"UUFOUJPO Scaled Dot-Product Attention Multi-Head Attention Atten tio n (Q, K, V) = so ftmax ( QKT dk ) ⋅ V 4DBMFE%PU1SPEVDU"UUFOUJPO 4PVSDF5BSHFU"UUFOUJPO h ead i = Atten tio n ( QWQ i , KWK i , VWV i )

.FNPSZ/FUXPSL w ֎෦ϝϞϦΛಋೖͨ͠ωοτϫʔΫ ͭͷϞδϡʔϧ͔Βߏ੒͞ΕͨωοτϫʔΫ ௕จΛهԱ͢Δ͜ͱͰߴਫ਼౓ͳจষཁ໿Λ࣮ݱ propagate
through it. Other recently proposed forms of memory or attention take this approach, notably Bahdanau et al. [2] and Graves et al. [8], see also [9]. Generating the ﬁnal prediction: In the single layer case, the sum of the output vector o and the input embedding u is then passed through a ﬁnal weight matrix W (of size V ⇥ d) and a softmax to produce the predicted label: ˆ a = Softmax(W(o + u)) (3) The overall model is shown in Fig. 1(a). During training, all three embedding matrices A, B and C, as well as W are jointly learned by minimizing a standard cross-entropy loss between ˆ a and the true label a. Training is performed using stochastic gradient descent (see Section 4.2 for more details). Question q Output Input Embedding B Embedding C Weights Softmax Weighted Sum pi ci mi Sentences {xi } Embedding A o W Softmax Predicted Answer a ^ u u Inner Product Out 3 In 3 B Sentences W a ^ {xi } o1 u1 o2 u2 o3 u3 A1 C1 A3 C3 A2 C2 Question q Out 2 In 2 Out 1 In 1 Predicted Answer (a) (b) Figure 1: (a): A single layer version of our model. (b): A three layer version of our model. In practice, we can constrain several of the embedding matrices to be the same (see Section 2.2). 44VLICBBUBS "4[MBN +8FTUPO BOE3'FSHVT l&OE5P&OE.FNPSZ/FUXPSLTz /*14

.FN/FUʹ͓͚Δ"UUFOUJPOػߏͷ໾ׂ w ͲͷϝϞϦΛΞΫηε͢Δ͔Λ"UUFOUJPOػߏʹΑΓબ୒ 2VFSZ ,FZ 7BMVFͷߟ͑ํ͸࣮͸.FNPSZ/FUXPSL͕ൃ঵ Figure
1: The Key-Value Memory Network model for question answering. See Section 3 for details. ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$-

.FN/FUʹ͓͚Δ"UUFOUJPOػߏͷ໾ׂ w ͲͷϝϞϦΛΞΫηε͢Δ͔Λ"UUFOUJPOػߏʹΑΓબ୒ 2VFSZ ,FZ 7BMVFͷߟ͑ํ͸࣮͸.FNPSZ/FUXPSL͕ൃ঵ 2VFSZ
,FZ 7BMVF 8FJHIU ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$-

$7ʹ͓͚Δ"UUFOUJPOػߏ w $7ͷ"UUFOUJPOػߏ͸࣌ܥྻΛѻ͏͔൱͔Ͱߏ଄͕ҟͳΔ ࣌ܥྻΛ༻͍Δ"UUFOUJPOػߏ w 4IPX "UUFOEBOE5FMM ࣌ܥྻΛ༻͍ͳ͍"UUFOUJPOػߏ
w 3FTJEVBM"UUFOUJPO/FUXPSL w 4RVFF[FBOE&YDJUBUJPO/FUXPSL w /POMPDBM/FUXPSL w "UUFOUJPO#SBODI/FUXPSL

࣌ܥྻͷ༗ແʹ͓͚Δ"UUFOUJPOػߏͷҧ͍ w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ $BQUJPOJOH 72" ʜ ɿ ཁૉ ୯ޠ
͝ͱʹ"UUFOUJPOػߏͷॏΈΛࢉग़ w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ ը૾෼ྨɼݕग़ɼʜ ɿ ཁૉ͕ը૾Ұຕ͔͠ͳ͍ ˠͭͷωοτϫʔΫʹෳ਺ͷ"UUFOUJPOػߏΛಋೖ ͋Εɹ͸ɹϑΫϩ΢ɹͰ͢ɹɽ "UUFOUJPO ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ "UUFOUJPO ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ

$7ʹ͓͚Δ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ w 4IPX "UUFOEBOE5FMM ΩϟϓγϣχϯάϞσϧʹͭͷ"UUFOUJPOػߏΛऔΓೖΕͨख๏ w %FUFSNJOJTUJDTPGUBUUFOUJPOɿ4PGUNBYϕʔεͷ"UUFOUJPOػߏ w 4UPDIBTUJDIBSEBUUFOUJPOɿڧԽֶश
ϞϯςΧϧϩϕʔε ʹΑΔ"UUFOUJPOػߏ Neural Image Caption Generation with Visual Attention Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word Neural Image Caption Generation with Visual Attention Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word Neural Image Caption Generation with Visual Attention Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “sof (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word %FUFSNJOJTUJDTPGUBUUFOUJPO 4UPDIBTUJDIBSEBUUFOUJPO ,9V +#B 3,JSPT ,$IP "$PVSWJMMF 34BMBLIVEJOPW 3;FNFM BOE:#FOHJP l4IPX "UUFOEBOE5FMM/FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPOz *$.-

%FUFSNJOJTUJDTPGUBUUFOUJPO w ֤୯ޠ͝ͱͷಛ௃ʹରͯ͠ը૾্ۭؒͷॏΈΛࢉग़ 4PVSDF5BSHFU"UUFOUJPOͱͷҧ͍ɿ,FZ 7BMVF $POWPMVUJPOMBZFST 2VFSZ -45.
Model Representation CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Distribution over L locations a2 d1 h2 a3 d2 z2 y2 Weighted features: D Distribution over vocab

%FUFSNJOJTUJDTPGUBUUFOUJPO w ֤୯ޠ͝ͱͷಛ௃ʹରͯ͠ը૾্ۭؒͷॏΈΛࢉग़ 4PVSDF5BSHFU"UUFOUJPOͱͷҧ͍ɿ,FZ 7BMVF $POWPMVUJPOMBZFST 2VFSZ -45.
CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Distribution over L locations a2 d1 h2 a3 d2 z2 y2 Weighted features: D Distribution over vocab Model Representation CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Distribution over L locations a2 d1 h2 a3 d2 z2 y2 Weighted features: D Distribution over vocab Model Representation CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Distribution over L locations a2 d1 h2 a3 d2 z2 y2 Weighted features: D Distribution over vocab 2VFSZ ,FZ 7BMVF

࣌ܥྻͳ͠ͷ"UUFOUJPOػߏͷछྨ w 4QBDFXJTFBUUFOUJPO ಛ௃Ϛοϓશମʹରͯ͠ॏΈ෇͚ w $IBOOFMXJTFBUUFOUJPO ಛ௃Ϛοϓͷνϟϯωϧʹରͯ͠ॏΈ෇͚ w
4FMGBUUFOUJPO ֫ಘͨ͠ಛ௃Ϛοϓʹରͯ͠4FMG"UUFOUJPOΛࢉग़ 'FBUVSFNBQ 8FJHIU

4QBDFXJTF"UUFOUJPOͷ୅දతͳݚڀ w 3FTJEVBM"UUFOUJPO/FUXPSL 3FT/FUΛϕʔεʹ"UUFOUJPOػߏΛಋೖͨ͠Ϟσϧ 3FTJEVBM-FBSOJOHΛಋೖͨ͠"UUFOUJPOػߏΛఏҊ Origin
image Feature before mask Soft attention mask Feature after mask Feature before mask Feature after mask Low-level color feature Sky mask High-level part feature Balloon instance mask Classification Input Attention Attention mechanism Soft attention mask '8BOH .+JBOH $2JBO 4:BOH $-J );IBOH 98BOH BOE95BOH l3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPOz $713

4UBDLFE/FUXPSL4USVDUVSF w Ұ͚ͭͩͷ"UUFOUJPOػߏͰ͸ωοτϫʔΫͷߴਫ਼౓Խ͕ࠔ೉ ωοτϫʔΫͷಛ௃දݱ͕ૄʹͳΓ͗͢Δ͜ͱ͕ݪҼ 3FT/FUͷ֤ϒϩοΫͷޙ૚ʹ"UUFOUJPOػߏΛಋೖ ҄
҄ ҄ × × × max pooling residual unit max pooling interpolation residual unit 1x1 conv 1x1 conv interpolation residual unit residual unit residual unit sigmoid residual unit residual unit residual unit residual unit stage2 stage1 stage3 Attention Module Attention Module Attention Module Input Image ... ... ... ... ... ... ... ... ... ... t p p p p p p t t ... ... ... Soft Mask Branch r r 2r down sample up sample residual unit sigmoid function × element-wise product ҄ element-wise sum convolution pooling Figure 2: Example architecture of the proposed network for ImageNet. We use three hyper-parameters for the design of

"UUFOUJPO3FTJEVBM-FBSOJOH w "UUFOUJPOػߏͷ4DBMJOHʹՃ͑ͯ3FTJEVBM-FBSOJOHΛಋೖ Ծʹ"UUFOUJPOػߏͷॏΈ͕શͯʹͳͬͯ΋ಛ௃Ϛοϓ͕ফࣦ͠ͳ͍ w ࣮͸࠷ۙͷ"UUFOUJPOػߏͰ͸͔ͳΓ࢖ΘΕ͍ͯΔςΫχοΫ 5SBOTGPSNFS 4&/FU
3FT/FUܕ /POMPDBM// "#/ ʜ f′(x) = M(x) ⋅ f(x) f′(x) = (1 + M(x)) ⋅ f(x) Ұൠతͳ"UUFOUJPOػߏ "UUFOUJPO3FTJEVBM-FBSOJOH down sample down sample up sample up sample convolution receptive field Soft Mask Branch Trunk Branch 1 + # $ % '($) * + + , + Figure 3: The receptive ﬁeld comparison between mask Ac Table netwo M R Atte R M(x) f(x)ɿಛ௃Ϛοϓ ɿॏΈ

$IBOOFMXJTF"UUFOUJPOͷ୅දతͳݚڀ w 4RVFF[FBOE&YDJUBUJPO/FUXPSL ಛ௃Ϛοϓͷνϟϯωϧʹରͯ͠"UUFOUJPOػߏΛಋೖ w গྔͷύϥϝʔλͷ૿ՃͰը૾ೝࣝͷੑೳΛ޲্Մೳ w γϯϓϧͰಋೖ͠΍͍͢"UUFOUJPOػߏͳͷͰ༷ʑͳख๏ͰऔΓೖΕΒΕ͍ͯΔ
Figure 1: A Squeeze-and-Excitation block. features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the dinality (the size of the set of transformations) [15, 47]. Multi-branch convolutions can be interpreted as a generali- +)V -4IFO 4"MCBOJF (4VO BOE&8V l4RVFF[FBOE&YDJUBUJPO/FUXPSLz $713

4RVFF[FBOE&YDJUBUJPO.PEVMF w *ODFQUJPOܕͱ3FT/FUܕͷछྨΛఏҊ w (MPCBM"WFSBHF1PPMJOHͱ૚ͷGD૚͔Β"UUFOUJPOػߏΛߏங 4RVFF[Fɿ(MPCBM"WFSBHF1PPMJOH &YDJUBUJPOɿ૚ͷGD૚
Inception Global pooling FC SE-Inception Module FC X Inception X Inception Module X X Sigmoid Scale ReLU × W × C 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C × W × C Figure 2: The schema of the original Inception module (left) and the SE-Inception module (right). X X × W × C Residual Residual a single pass forwards and backwards through ResNet-50 takes 190 ms, compared to 209 ms for SE-ResNet-50 (both timings are performed on a server with 8 NVIDIA Titan X GPUs). We argue that this represents a reasonable overhead particularly since global pooling and small inner-product operations are less optimised in existing GPU libraries. Moreover, due to its importance for embedded device ap- plications, we also benchmark CPU inference time for each model: for a 224 × 224 pixel input image, ResNet-50 takes 164 ms, compared to 167 ms for SE-ResNet-50. The small additional computational overhead required by the SE block is justified by its contribution to model performance. Next, we consider the additional parameters introduced by the proposed block. All of them are contained in the two FC layers of the gating mechanism, which constitute a small fraction of the total network capacity. More precisely, the number of additional parameters introduced is given by: 2 S Ns · Cs 2 (5) Global pooling FC SE-Inception Module FC X Inception Module X Sigmoid Scale ReLU 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C × W × C Figure 2: The schema of the original Inception module (left) and the SE-Inception module (right). SE-ResNet Module + Global pooling FC ReLU + ResNet Module X X X X Sigmoid 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C Scale × W × C × W × C × W × C Residual Residual FC 1 × 1 × C Figure 3: The schema of the original Residual module (left) and the SE-ResNet module (right). GPUs). We argu particularly sinc operations are Moreover, due t plications, we al model: for a 224 164 ms, compar additional comp is justified by its Next, we con by the proposed FC layers of the fraction of the t number of addit where r denotes ber of stages (w blocks operating mension), Cs de and Ns denotes ResNet-50 intro beyond the ∼25 corresponding to parameters com excitation is per sions. However, final stage of SE in performance ( *ODFQUJPOܕ "MFY/FUɼ7((/FU౳ 3FT/FUܕ 3FT/FUɼ3FT/F9U౳

$7ʹ͓͚Δ4FMG"UUFOUJPO w 5SBOTGPSNFSͷ"UUFOUJPOػߏΛϕʔεʹը૾ $// ΁Ԡ༻ x i 98BOH
3(JSTIJDL "(VQUB BOE,)F l/POMPDBM/FVSBM/FUXPSLz $713 e 2. On the left are image completions from our best conditional generation model, where we sample the second half. On the right are ples from our four-fold super-resolution model trained on CIFAR-10. Our images look realistic and plausible, show good diversity ng the completion samples and observe the outputs carry surprising details for coarse inputs in super-resolution. mensional embedding vectors of the intensity values 255. For output intensities, we share a single, separate of 256 d-dimensional embeddings across the channels. an image of width w and height h, we combine the h and channel dimensions yielding a 3-dimensional or with shape [h, w · 3, d]. ordinal values, we run a 1x3 window size, 1x3 strided volution to combine the 3 channels per pixel to form an t representation with shape [h, w, d]. ach pixel representation, we add a d-dimensional en- ng of coordinates of that pixel. We evaluated two dif- nt coordinate encodings: sine and cosine functions of coordinates, with different frequencies across different ensions, following (Vaswani et al., 2017), and learned tion embeddings. Since we need to represent two co- nates, we use d/2 of the dimensions to encode the row ber and the other d/2 of the dimensions to encode the column and color channel. Self-Attention mage-conditioned generation, as in our super-resolution els, we use an encoder-decoder architecture. The en- er generates a contextualized, per-pixel-channel repre- ation of the source image. The decoder autoregressively softmax + + + + cmp cmp cmp q’ q m 1 m 2 m 3 p q p 1 p 2 p 3 MatMul Wv MatMul Wv · MatMul Wq MatMul Wk MatMul Wv LayerNorm Dropout Local Self-Attention LayerNorm FFNN Dropout Figure 1. A slice of one layer of the Image Transformer, recom- puting the representation q0 of a single channel of one pixel q by attending to a memory of previously generated pixels m1, m2, . . .. After performing local self-attention we apply a two-layer position- wise feed-forward neural network with the same parameters for all positions in a given layer. Self-attention and the feed-forward Input Attention maps Weighted average of the values Standard convolution Output W H Nh = 2 Values ² Head Head ¹ Figure 2. Attention-augmented convolution: For each spatial location (h, w), N h attention maps over the image are computed from queries and keys. These attention maps are used to compute N h weighted averages of the values V. The results are then concatenated, reshaped to match the original volume’s spatial dimensions and mixed with a pointwise convolution. Multi-head attention is applied in parallel to a standard convolution operation and the outputs are concatenated. interactions without increasing the number of parameters. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We develop a novel two-dimensional relative self- attention mechanism [35] that maintains translation equiv- ariance while being infused with relative position information, making it well suited for images. Our self-attention formulation proves competitive for replacing convolutions entirely, however we ﬁnd in control experiments that the best results are obtained when combining both. We there- fore do not completely abandon the idea of convolutions, but instead propose to augment convolutions with this self- attention mechanism. This is achieved by concatenating 2. Related Work 2.1. Convolutional networks Modern computer vision has been built on powerful image featurizers learned on image classiﬁcation tasks such as CIFAR-10 [22] and ImageNet [9]. These datasets have been used as benchmarks for delineating better image fea- turizations and network architectures across a broad range of tasks [21]. For example, improving the “backbone” network typically leads to improvements in object detection [19] and image segmentation [6]. These observations have inspired the research and design of new architectures, which are typically derived from the composition of convolution operations across an array of spatial scales and skip con- x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512 T × H × W × 512 S oftmax 1 × 1 × 1 ϕ : 1 × 1 × 1 θ : 1 × 1 × 1 T × H × W × 512 THW × THW f( ⋅ ) g ( ⋅ ) y W g /POMPDBM/FVSBM/FUXPSL *NBHF5SBOTGPSNFS "UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSL /1BSNBS "7BTXBOJ +6T[LPSFJU ,BJTFS /4IB[FFS ",V BOE%5SBO l*NBHF5SBOTGPSNFSz *$.- *#FMMP #;PQI "7BTXBOJ +4IMFOT BOE27-F l"UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSLz BS9JW

/POMPDBM/FVSBM/FUXPSL w 4FMG"UUFOUJPOΛ$//ʹԠ༻ͨ͠ωοτϫʔΫ $//΍3//ΑΓ௕͍εύϯͰೖྗಛ௃ΛࢀরՄೳ w $//ɿΧʔωϧαΠζ෼ͷࢀরྖҬ w 3//ɿ࣌ܥྻ෼ ͔Βಛఆͷ࣌ࠁ·Ͱ
ͷࢀরྖҬ x i ,FSOFMT 3FDFQUJWFpFMET 3FQSFTFOUBUJPO $POWPMVUJPOBM/FVSBM/FUXPSL JOQVU U JOQVU U JOQVU U5 -BZFS -BZFS -BZFS 0VUQVU 0VUQVU 0VUQVU JOQVU U -BZFS 0VUQVU 3FDVSSFOU/FVSBM/FUXPSL x i x j f( ⋅ ) g (x j ) ∑ j f(x i , x j ) ⋅ g (x j ) /POMPDBM 98BOH 3(JSTIJDL "(VQUB BOE,)F l/POMPDBM/FVSBM/FUXPSLz $713

/POMPDBM/FVSBM/FUXPSLͷண؟఺ w /POMPDBMpMUFSͱ4%1"UUFOUJPO͕ೖྗʹ֬཰෼෍Λ৐ࢉ /POMPDBMpMUFSɿλʔήοτྖҬͱۙ๣ྖҬ͔Βྨࣅ౓ɹɹɹΛࢉग़ 4%1"UUFOUJPOɿ2ͱ,Ωʔ͔Βྨࣅ౓Λࢉग़ 1SPCBCJMJUZ%JTUSJCVUJPO
*OQVU 0VUQVU 0VUQVU 1SPCBCJMJUZ%JTUSJCVUJPO *OQVU Atten tio n (Q, K, V) = so ftmax ( QKT dk ) ⋅ V 4DBMFE%PU1SPEVDU"UUFOUJPO y(i) = ∑ j= I w(i, j) ⋅ x(j) /POMPDBM.FBO'JMUFS w( ⋅ ) so ftmax ( QKT / d k) ˠ/POMPDBMpMUFSͷΞϧΰϦζϜΛ൓ө͢Δ͜ͱͰ4FMG"UUFOUJPOΛ$//ʹಋೖ

/POMPDBMCMPDLͷߏ଄ w ೖྗಛ௃ͱ֬཰෼෍ͷ৐ࢉʹΑΓࢉग़ ࠷ऴతͳԠ౴஋ɹ͸3FTJEVBM-FBSOJOHʹΑΓࢉग़ w /POMPDBMCMPDL &NCFEEFE(BVTTJBO
ɹͱɹΛ࢖ͬͯ4FMG"UUFOUJPOͷߏ଄Λߏங TPGUNBYΛ༻͍Δ͜ͱͰਖ਼نԽ͕༰қ 5SBOTGPSNFSͷTFMGBUUFOUJPOʹҰ൪ࣅ͍ͯΔߏ଄ x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512 T × H × W × 512 S o ftmax 1 × 1 × 1 ϕ : 1 × 1 × 1 θ : 1 × 1 × 1 T × H × W × 512 THW × THW f( ⋅ ) g ( ⋅ ) y y = 1 C(x) ∑ f(x) ⋅ g (x) = 1 C(x) so ftmax(xTWT θ W ϕ x) ⋅ g (x) 4JNJMBSJUZ 7BMVF W g y θ ϕ 2VFSZ ,FZ 7BMVF

/POMPDBM/FVSBM/FUXPSLͷ"UUFOUJPOͷྫ x i x i x i x
i

"UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSL w 4%1"UUFOUJPOΛ࣍ݩʹ֦ுͨ͠"UUFOUJPOػߏΛ$//ʹಋೖ \4UBOEBSE %FQUIXJTF^$POWPMVUJPOͱDPODBUΛۦ࢖ͯ͠.VMUJIFBEBUUFOUJPOΛ࠶ݱ ಛ௃Ϛοϓ͚ͩͰϞδϡʔϧΛߏங͢ΔͨΊ4&/FUΑΓগύϥϝʔλ
Input Attention maps Weighted average of the values Standard convolution Output W H Nh = 2 Values ² Head Head ¹ 4PGUNBY (MPCBM"WFSBHF1PPMJOH %FQUIXJTF$POWPMVUJPO

ͭͷ"UUFOUJPOػߏͷҧ͍ 4QBDFXJTF"UUFOUJPO $IBOOFMXJTF"UUFOUJPO 4FMG"UUFOUJPO 'FBUVSFNBQ "UUFOUJPOXFJHIU

ࢹ֮తઆ໌Λ"UUFOUJPOػߏ΁ w $//͕஫ࢹͨ͠ྖҬΛ"UUFOUJPOػߏͷॏΈʹԠ༻ "UUFOUJPOػߏ΁ͷԠ༻ʹΑΓೝࣝੑೳͷ޲্ ࢹ֮తઆ໌ʹ͓͚Δ"UUFOUJPONBQͷ֫ಘͱਫ਼౓޲্Λಉ࣌ʹ࣮ݱ "UUFOUJPOػߏ
'FBUVSF FYUSBDUPS ॏΈ ೖྗը૾ ೖྗը૾ "UUFOUJPONBQ ࢹ֮తઆ໌

ࢹ֮తઆ໌ w ਂ૚ֶश͕ਪ࿦࣌ʹ஫ࢹͨ͠ྖҬΛώʔτϚοϓͰදݱ $MBTT"DUJWBUJPO.BQQJOH $". ɼ(SBE$".౳ ೖྗը૾
b(JBOU@TDIOBV[FS` b.JOJBUVSF@TDIOBV[FS` b4UBOEBSE@TDIOBV[FS` b.JOJBUVSF@TDIOBV[FS` b4UBOEBSE@TDIOBV[FS` b*SJTI@UFSSJFS` (5b.JOJBUVSF@TDIOBV[FS` 3FT/FU (SBE$". 3FT/FU $".

"UUFOUJPO#SBODI/FUXPSL w ࢹ֮తઆ໌ͷ"UUFOUJPONBQΛॏΈͱͯ͠"UUFOUJPOػߏ΁Ԡ༻ $MBTT"DUJWBUJPO.BQQJOH͔ΒಘΒΕΔ"UUFOUJPONBQΛϕʔεʹ"UUFOUJPOػߏͷॏΈΛࢉग़ "UUFOUJPOػߏʹΑΔߴਫ਼౓Խͱࢹ֮తઆ໌Λಉ࣌ʹ࣮ݱ Prob.
score Attention map Input image Prob. score M(x i) Label Attention mechanism Attention branch L per(x i) L att(x i) x i (a) Overview of Attention Branch Network … Feature map g(x i) Classiﬁer Softmax … Feature map g (x i) Perception branch Convolution layers K ReLU Batch Normalization Batch Normalization Sigmoid 1x1 Conv., 1 1x1 Conv., K Softmax GAP 1x1 Conv., K Feature extractor : Convolution layer : Activation function : Batch Normalization )'VLVJ 5)JSBLBXB 5:BNBTIJUB BOE)'VKJZPTIJ l"UUFOUJPO#SBODI/FUXPSL-FBSOJOHPG"UUFOUJPO.FDIBOJTNGPS7JTVBM&YQMBOBUJPOz $713

"UUFOUJPONBQͷՄࢹԽྫʢ*NBHFDMBTTJpDBUJPOʣ (5b/PSGPML@UFSSJFS` (5b1PPM@UBCMF` (5b'MVUF` (5b,JNPOP` (5b4DPSFCPBSE`

"UUFOUJPONBQͷՄࢹԽྫʢ'BDJBMBUUSJCVUFTSFDPHOJUJPOʣ 0SJHJOBMJNBHF 4NJMJOH 8FBSJOH@OFDLMBDF :PVOH #MPOE@)BJS 8FBSJOH@OFDLUJF "UUFOUJPOCSBODI
1FSDFQUJPOCSBODI 1FSDFQUJPOCSBODI ʜ 8FJHIUTIBSJOH *OQVUJNBHF -BCFMT 'FBUVSF FYUSBDUPS ʜ } T "UUFOUJPONBQT ʜ 'FBUVSFNBQT g (x i ) Mt(x i ) Mt(x i )

"UUFOUJPONBQͷՄࢹԽྫʢ1MBZJOH"UBSJʣ *OQVUJNBHF "UUFOUJPONBQ 4QBDF*OWBEFST *OQVUJNBHF "UUFOUJPONBQ .T1BDNBO

ଞͷ"UUFOUJPOػߏͱͷҧ͍ 3FTJEVBM"UUFOUJPO/FUXPSL 4&/FU /POMPDBM/FUXPSL "#/ w ଞͷ"UUFOUJPOػߏͱҟͳΓҰՕॴʹͭͷॏΈ͚ͩ࢖༻
ଞͷ"UUFOUJPOػߏɿͭͷ"UUFOUJPOػߏʹෳ਺ͷॏΈΛ࢖༻ ɹɹɹɹɹɹɹɹɹෳ਺ͷϒϩοΫPSϞδϡʔϧʹ"UUFOUJPOػߏΛಋೖ Inception Global pooling FC SE-Inception Module FC X Inception X Inception Module X X Sigmoid Scale ReLU × W × C 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C × W × C Figure 2: The schema of the original Inception module (left) and the SE-Inception module (right). SE-ResNet Module + Global pooling FC ReLU + ResNet Module X X X X Sigmoid 1 × 1 × C 1 × 1 × C 1 × 1 × C 1 × 1 × C Scale × W × C × W × C × W × C Residual Residual FC 1 × 1 × C Figure 3: The schema of the original Residual module (left) and the SE-ResNet module (right). 4. Model and Computational Complexity For the proposed SE block to be viable in practice, it must provide an effective trade-off between model complexity and performance which is important for scalability. We set the reduction ratio r to be 16 in all experiments, ex- cept where stated otherwise (more discussion can be found a single pass forwards and backw takes 190 ms, compared to 209 m timings are performed on a server GPUs). We argue that this represe particularly since global pooling operations are less optimised in Moreover, due to its importance plications, we also benchmark CP model: for a 224 × 224 pixel inpu 164 ms, compared to 167 ms for S additional computational overhead is justiﬁed by its contribution to m Next, we consider the addition by the proposed block. All of them FC layers of the gating mechanism fraction of the total network capa number of additional parameters i 2 r S s=1 Ns · C where r denotes the reduction ra ber of stages (where each stage r blocks operating on feature maps mension), Cs denotes the dimensi and Ns denotes the repeated block ResNet-50 introduces ∼2.5 milli beyond the ∼25 million paramete corresponding to a ∼10% increas parameters come from the last sta excitation is performed across the sions. However, we found that the ﬁnal stage of SE blocks could be r in performance (<0.1% top-1 erro the relative parameter increase to useful in cases where parameter tion (see further discussion in Sec 5. Implementation Each plain network and its co part are trained with identical opt ֤3FTJEVBMVOJUʹಋೖ

ଞͷ"UUFOUJPOػߏͱͷҧ͍ w ଞͷ"UUFOUJPOػߏͱҟͳΓҰՕॴʹͭͷॏΈ͚ͩ࢖༻ ଞͷ"UUFOUJPOػߏɿͭͷ"UUFOUJPOػߏʹෳ਺ͷॏΈΛ࢖༻ ɹɹɹɹɹɹɹɹɹෳ਺ͷϒϩοΫPSϞδϡʔϧʹ"UUFOUJPOػߏΛಋೖ 3FTJEVBM"UUFOUJPO/FUXPSL 4&/FU
/POMPDBM/FUXPSL "#/ 'FBUVSFNBQ 8FJHIU

"#/ͷ"UUFOUJPOػߏͷಛੑ %FMFUF VOOFDFTTBSZ SFHJPO "GUFS "EESFHJPOPG JOUFSFTU #FGPSF
TPDDFSCBMM DPOGJEFODF (5EBMNBUJBO EBMNBUJBO DPOGJEFODF ʜ 4PGUNBY ʜ K 'FBUVSFFYUSBDUPS "UUFOUJPONBQ *OQVUJNBHF 'FBUVSFNBQ 'FBUVSFNBQ "UUFOUJPONFDIBOJTN "UUFOUJPOCSBODI 1FSDFQUJPOCSBODI $POWPMVUJPO MBZFST $MBTTJpFS #/ 4JHNPJE Y$POW #/ Y$POW 3F-6 ("1 4PGUNBY K Y$POW w ਓखͰ"UUFOUJPONBQΛमਖ਼͢Δ͜ͱͰೝࣝ݁ՌΛௐ੔Մೳ ..JUTVIBSB )'VLVJ :4BLBTIJUB 50HBUB 5)JSBLBXB 5:BNBTIJUB BOE)'VKJZPTIJ l&NCFEEJOH)VNBO,OPXMFEHFJO%FFQ/FVSBM/FUXPSLWJB"UUFOUJPO.BQz BS9JW

"#/ͷ"UUFOUJPOػߏͷಛੑ ɿ஫ࢹྖҬͷ࡟আ ɿ஫ࢹྖҬͷ௥Ճ

"UUFOUJPO#SBODI/FUXPSLXJUI)VNBOLOPXMFEHF w ਓखͰमਖ਼ͨ͠"UUFOUJPONBQΛ࢖ͬͯ"#/Λ࠶ֶश Prob. score Attention map Input
image Prob. score M(x i) Label Attention mechanism Attention branch L per(x i) L att(x i) x i (a) Overview of Attention Branch Network … Feature map g(x i) Classiﬁer Softmax … Feature map g (x i) Perception branch Convolution layers K ReLU Batch Normalization Batch Normalization Sigmoid 1x1 Conv., 1 1x1 Conv., K Softmax GAP 1x1 Conv., K Feature extractor : Convolution layer : Activation function : Batch Normalization BEKVTUFEBUUFOUJPONBQ M′(x i ) L map (x j ) = ||M(x j ) − M′(x j )|| 2

"UUFOUJPONBQͷվળ "#/ 'JOFUVOJOH bTQPUUFETBMBNBOEFS` bTQPUUFETBMBNBOEFS` bIBNNFSIFBETIBSL`
bIBNNFSIFBETIBSL` bCVMCVM` bCVMCVM` bIFO` bIPVTFpODI` bIPVTFpODI` bUBJMFEGSPH`

.*36ͷએ఻ ϩϯάΦʔϥϧ "UUFOUJPONBQΛհͨ͠%FFQ/FVSBM/FUXPSL΁ͷਓͷ஌ݟͷ૊ΈࠐΈ ൃදऀɿࡾ௡ݪ . γϣʔτΦʔϥϧ ճؼܕ"UUFOUJPO#SBODIʹΑΔҰ؏ֶशϕʔεͷࣗಈӡసʹ͓͚Δ ൑அࠜڌͷՄࢹԽ
ൃදऀɿ৿ .

.*36ͷએ఻ ion Branch Network ͷߴਫ਼౓Խ 4 ௩ݪ୓໵ ࢦಋڭतɿ౻٢߂࿱
Output Attention map Feature extractor Convolution layer 1 Residual Block 2 Input image Attention branch Conv. Conv. Convolution layers Perception branch Residual Block 4 + Dropout GAP Conv. Residual Block 3 + Dropout Uncertainty Sampling Softmax Uncertainty Sampling Softmax ਤ 1 : Bayesian ABN ͷߏ଄ y = patt H (patt ) < H pper pper H (patt ) H pper (3) 4.ධՁ࣮ݧ ఏҊख๏ΛධՁ͢ΔͨΊʹɼ Residual Network ʢResNetʣ ɼ ResNet Λϕʔεͱͨ͠ ABN ͓Αͼ Bayesian ABN Λ༻ ͍ͯҰൠ෺ମೝࣝͷ࣮ݧΛߦ͏ɽҰൠ෺ମೝࣝʹ͓͚Δೝ ࣝਫ਼౓Λൺֱ͠ɼෆ࣮֬ੑͷ༗ޮੑΛධՁ͢Δɽ 4.1.࣮ݧ֓ཁ ධՁσʔλʹ͸ɼCIFAR-100 σʔληοτ [3] Λ༻͍ ΔɽCIFAR-100 σʔληοτ͸ɼ32×32 ϐΫηϧͷΧϥʔ ը૾Ͱߏ੒͞Ε͓ͯΓɼΫϥε਺͸ 100 Ͱ͋Δɽຊ࣮ݧͰ ͸ɼֶशʹ 50, 000 ຕɼධՁʹ 10, 000 ຕΛ࢖༻͢Δɽ· ͨɼMCDO ʹΑΔαϯϓϦϯάճ਺Λ 50 ͱ͠ɼDropout ʹ͓͚Δυϩοϓ཰Λ 0.3 ͱ͢Δɽ Λࣝผ͢Δ Discriminator Λఢରతʹֶश͢Δ͜ͱͰɼσʔ ληοτʹؚ·Εͳ͍ը૾͕ੜ੒ՄೳͰ͋ΔɽCycleGAN ͸ɼҟͳΔ 2 ͭͷυϝΠϯͷը૾Λ૬ޓʹม׵͢Δ͜ͱͰɼ ελΠϧม׵࣌ʹೖྗը૾ͷಛ௃Λอ࣋ͨ͠ม׵͕Ͱ͖Δɽ ͔͠͠ɼCycleGAN Ͱ͸৭΍ܗঢ়͕ҟͳΔෳ਺ͷ෺ମΛ୯ ҰωοτϫʔΫͰελΠϧม׵͢Δ͜ͱ͕Ͱ͖ͳ͍ͱ͍͏ ໰୊͕͋Δɽ 3.ఏҊख๏ ຊݚڀͰ͸ɼࣝผର৅ͷ஫ࢹྖҬΛଊ͑Δ Attention ػ ߏΛಋೖͨ͠ CycleGAN ʹΑΔࣝผର৅ʹ஫ࢹͨ͠ը૾ ͷελΠϧม׵ख๏ΛఏҊ͢Δɽ 3.1.ωοτϫʔΫߏ଄ ఏҊख๏Ͱ͸ɼCycle Consistent Adversarial Domain Adaptation(CyCADA)[2] ͷ Semantic consistency loss Λಋೖ͢Δɽ·ͨɼΫϥε෼ྨ࣌ͷ஫ࢹྖҬΛυϝΠϯࣝ ผ࣌ʹ൓ө͢Δ Attention ػߏ΁ͷޡࠩͱͯ͠ Attention consistency loss Λಋೖ͢ΔɽఏҊख๏ͷωοτϫʔΫߏ ଄Λਤ 1 ʹࣔ͢ɽ͜͜Ͱɼx ͕ CG ը૾ɼy ͕࣮ը૾ɼˆ x ͕ CG ΁ͷม׵ը૾ɼˆ y ͕࣮ը૾΁ͷม׵ը૾Ͱ͋Δɽ ! " # Generator+,-. /012 ! # Generator/012-. +, " Discriminator/012 Discriminator+, Adversarial Branch Classification Branch ਤ 1 : ఏҊख๏ͷωοτϫʔΫߏ଄ 3.2.Attention consistency loss Semantic consistency loss ͷΈͰ͸ɼΫϥε෼ྨͨ͠ ࡍͷ൑அࠜڌ͕ҟͳΔՄೳੑ͕͋Δɽͦ͜ͰɼAttention ਤ 3 ʹ࣮ը૾ͱఏҊख๏Ͱม׵ͨ͠ը૾ͷҰ෦Λࣔ͢ɽ ୯Ұͷ Generator Ͱ Semantic ͳ৘ใΛܽଛ͢Δ͜ͱͳ͘ ม׵Ͱ͖͍ͯΔ͜ͱ͕Θ͔Δɽ () (Real to CG) (CG) ਤ 3 : ֤ը૾ͷൺֱ ࣝผ཰Λද 1 ʹࣔ͢ɽগྔͷ࣮ը૾Ͱֶशͨ͠৔߹ͱൺֱ͠ ͯɼCyCADA Λ༻͍ͯม׵ͨ͠৔߹͸ࣝผ཰͕ 1.80 ϙΠ ϯτ௿Լͨ͠ɽ͜ͷਫ਼౓௿ԼΛղফ͢ΔͨΊʹɼAttention ػߏΛಋೖ͢Δ͜ͱͰࣝผ཰͕ 0.79 ϙΠϯτ޲্ͨ͠ɽ ද 1 : ࣝผਫ਼౓ͷൺֱ ֶशը૾ ධՁը૾ ࣝผ཰ [%] ࣮ը૾ ࣮ը૾ 80.95 ࣮ը૾ 77.48 CG ը૾ ม׵ը૾ (CyCADA[2]) 79.15 ม׵ը૾ (ఏҊख๏) 81.74 Attention ػߏ͓Αͼ Attention consistency loss ͷ༗ແ ʹΑΔ஫ࢹྖҬͷมԽΛਤ 4 ʹࣔ͢ɽAttention ػߏͷಋ ೖʹΑΓɼ஫ࢹྖҬͷॖখ౳ͷมԽΛ཈੍ࣝ͠ผର৅ʹ஫ ࢹͨ͠ม׵͢Δ͜ͱ͕Ͱ͖ͨɽҎ্ΑΓɼఏҊख๏Λ༻͍ Δ͜ͱͰࣝผର৅ʹޮՌతͳελΠϧม׵͕ՄೳͰ͋Δɽ γϣʔτΦʔϥϧ ෆ࣮֬ੑͷಋೖʹΑΔ"UUFOUJPO#SBODI/FUXPSLͷ৴པੑͷ޲্ ൃදऀɿ௩ݪ . γϣʔτΦʔϥϧ "UUFOUJPOػߏΛಋೖͨ͠$ZDMF("/ʹΑΔࣝผʹ༗ޮͳελΠϧม׵ ൃදऀɿࠓࢬ .

·ͱΊ w /-1ͱ$7ʹ͓͚Δ"UUFOUJPOػߏʹ͍ͭͯ঺հ /-1ɿ&ODPEFS%FDPEFSϞσϧΛத৺ʹ"UUFOUJPOػߏΛಋೖ w 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO ʜ
$7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ 4QBDFXJTF"UUFOUJPO $IBOOFMXJTF"UUFOUJPO 4FMG"UUFOUJPO

$7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ 4QBDFXJTF"UUFOUJPO $IBOOFMXJTF"UUFOUJPO 4FMG"UUFOUJPO ͦͷલʹ

গ͕࣌ؒ͠༨ͬͨͷͰɺɺɺ w $713౤ߘͰֶΜͩ͜ͱΛগ͠࿩ͦ͏ͱࢥ͍·͢ "#/ͷ஀ੜൿ࿩ɼ౤ߘɼࠪಡ݁Ռ͔ΒֶΜͩ͜ͱΛத৺ʹ঺հ த෦஍۠ͷτοϓΧϯϑΝ࠾୒ऀΛ૿΍ͦ͏✊ Attention
Branch Network: Learning of Attention Mechanism for Visual Explanation Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University 1200 Matsumotocho, Kasugai, Aichi, Japan {[email protected], [email protected], yamashita@isc, fujiyoshi@isc}.chubu.ac.jp Abstract Visual explanation enables humans to understand the de- cision making of deep convolutional neural network (CNN), but it is insufficient to contribute to improving CNN performance. In this paper, we focus on the attention map for visual explanation, which represents a high response value as the attention location in image recognition. This attention region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends a response-based visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for the attention mechanism and is trainable for visual expla- Attention map Attention map … Great grey owl Ruffed grouse (a) Class Activation Mapping (b) Attention Branch Network wc GAP & fc. Attention branch Perception branch Label Label Lper (xi ) Latt (xi ) Feature extractor Input image Input image L(xi ) Attention mechanism Feature map Feature extractor Figure 1. Network structures of class activation mapping and proposed attention branch network.

"#/͸Ͳ͏΍ͬͯੜ·Εͨʁ w (MPCBM"WFSBHF1PPMJOHΛ৮͍ͬͯͨ࣌ʹߟҊ 13.6ΞϧίϯΛ΍͍ͬͯΔ࣌ʹ("1ͷੌ͞Λ஌Δ ͦͷޙɼ"UUFOUJPOػߏΛ஌ͬͯ"UUFOUJPO#SBODI/FUXPSLΛߟҊ ਤ
3 Ϩϕϧ 1 ͷݕग़݁Ռ ϓ ਤ 3 Ϩϕϧ 1 ͷݕग़݁Ռ 13.6Ξϧίϯͷ݁ՌҰྫ Origin image Feature before mask Soft attention mask Feature after mask Feature before mask Feature after mask Low-level color feature Sky mask High-level part feature Balloon instance mask Classification Input Attention Attention mechanism Soft attention mask "UUFOUJPOػߏ

ఏҊ͔Βൃද·ͰͷྲྀΕ த०ΠϯλʔϯͷதؒใࠂձͰ"#/Λ঺հ &$$7ʹ౤ߘˠ3FKFDU $713ʹ౤ߘ ެௌձͷ࣌ؒલʹࠪಡ݁Ռ͕དྷΔ ࠾୒௨஌" Φʔϥϧ௨஌ $713࠾୒·ͰͷྲྀΕ

&$$7͸ͳͥ3FKFDU͞Εͨʁ w ࣮ݧ਺͕଍Γͳ͔ͬͨ ύϒϦοΫͳσʔληοτͰ࣮ݧ͕ͨ͠ɼϚΠφʔͳσʔληοτͰ͔͠ධՁͯ͠ͳ͔ͬͨ w 'JOFHSBJOFESFDPHOJUJPOɿ$PNQSFIFOTJWFDBSTEBUBTFU w าߦऀݕग़ɿ$JUZ1FSTPOEBUBTFU w
.VMUJUBTL-FBSOJOHɿ$FMFC"EBUBTFU Ϟσϧ΋7((ͱ3FT/FUɼ'BTUFS3$//ͷΈͰ࣮ݧ "#/ͷߏ଄͕ෆे෼ w $*'"3ɼ*NBHF/FUͰ͸े෼ͳੑೳ͕ग़ͳ͍ঢ়ଶ w จষྗ ઌੜͷఴ࡟͕গ͔͠͠ड͚Εͳ͔ͬͨ ӳจߍਖ਼Λճ͔͠ग़ͤͳ͔ͬͨ

$713ʹ޲͚ͯ΍ͬͨ͜ͱ w "UUFOUJPO#SBODI/FUXPSLͷվྑ "UUFOUJPOػߏͷ࿦จΛαʔϕΠͯ͠"#/ͷ"UUFOUJPOػߏΛվྑ $*'"3ɼ47)/ɼ*NBHF/FUͰ΋ਫ਼౓͕޲্ w
ͱΓ࣮͋͑ͣݧ਺Λ૿͢ ϝδϟʔͳσʔληοτΛؚΊͨධՁ w ܭσʔλ$*'"3 47)/ *NBHF/FU $PNQ$BST $FMFC" (5"ᶛ $0$0 $JUZ1FSTPO ࣮ݧʹ࢖༻͢ΔϞσϧ΋૿΍ͨ͠ w 7((ɼ3FT/FUɼ8JEF3FT/FUɼ%FOTF/FUɼ3FT/F9Uɼ'BTUFS3$// w ৭ʑͳͱ͜ΖͰൃදͯ͠ίϝϯτΛ໯͏ .*36ɼ(5$+BQBOɼ34+ʢɼຊՈ(5$ʣʜFUD ઌੜͷߨԋͰ΋঺հɿ44**౳ ˡ"#/ͷ͓͔͛Ͱड৆݅਺͕ͭ૿͑·͠·ͨ͠"

$713౤ߘͰΘ͔ͬͨ͜ͱ w ධՁ࣮ݧ͸௒େࣄ ͳΔ΂͘ϝδϟʔͳσʔληοτͰධՁ͠Α͏ w ʮެ։͞ΕͯΔσʔληοτ͔ͩΒʯ͸͋·Γ௨༻͠ͳ͍Πϝʔδ w ʮ͍͔ͭ͘ͷϝδϟʔͳσʔληοτͰੑೳ͕ྑ͘ͳΔʯͰΑ͏΍͘આಘྗ͕૿͢
ʮਫ਼౓্͚͕ͩͬͨʯ͚ͩͰ͸௨ΓͮΒ͍ w ʮਫ਼౓্͕͕ͬͨʯʹՃ͑ͯԿ͔վྑ͕ͳ͍ͱධՁ͞ΕͮΒ͍ "#/ͷ৔߹ɿʮਫ਼౓޲্ʯˍʮࢹ֮తઆ໌ʢ&YQMBJOBCMF"*ʣʯ ଞʹ΋ʮਫ਼౓޲্ʯˍʮύϥϝʔλ਺࡟ݮʯɼʮਫ਼౓޲্ʯˍʮߴ଎Խʯ౳ ࣮ݧʹ࢖͏Ϟσϧ͸ͳΔ΂͘໢ཏతʹධՁ͢Δ w ʮಛఆͷϞσϧ͚ͩʹԠ༻ʯ͸ධՁ͞Εʹ͍͘ w ࠷ۙͷྲྀߦ͕ೖ͍ͬͯΔͱ௨Γ΍͍͢ʁ "#/͸&YQMBJOBCMF"*ͱ"UUFOUJPOػߏΛऔΓೖΕ͍͔ͯͨΒ৽نੑ͕ߴ͘ධՁ͞Εͨʁ

$713౤ߘͰΘ͔ͬͨ͜ͱ w ࿦จͷॻ͖ํ ਤ͸γϯϓϧͰݟ΍͘͢࡞Δ w ಛʹ࠷ॳͷਤɼख๏ͷશମਤ ࣗ෼͕ओு͍ͨ͜͠ͱΛͭʹߜͬͯ࿦จΛॻ͘ w
"#/ͷ৔߹ɿࢹ֮తઆ໌Λ"UUFOUJPOػߏʹԠ༻ w ओு఺͕ଟ͗͢ΔͱԿ͕ݴ͍͍ͨͷ͔Θ͔Βͳ͘ͳΔ w ৭ʑͳݚڀऀ͔ΒҙݟΛ΋Β͓͏ w ੵۃతʹ஌ࣝͱܦݧΛੵ΋͏ʂ "#/͸("1ͷಛੑͱ"UUFOUJPOػߏΛ஌Βͳ͚Ε͹ੜ·Εͳ͔ͬͨ ("1ͷಛੑͱ"UUFOUJPOػߏΛ஌͍ͬͯͯ΋ख๏Λ࡞Δܦݧ͕ແ͍ͱ"#/ʹḷΓண͚ͳ͔ͬͨ

$7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ 4QBDFXJTF"UUFOUJPO $IBOOFMXJTF"UUFOUJPO 4FMG"UUFOUJPO w $713ͷൿ࿩ 5IBOLZPVGPSZPVSl"UUFOUJPOz "UUFOUJPONBQ HMBTTFT

最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -

最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -

More Decks by Hiroshi Fukui

Other Decks in Research

Featured

Transcript