$30 off During Our Annual Pro Sale. View Details »

最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -

最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -

名古屋CVPRML勉強会の資料です.関東CV勉強会のAttention機構の話に,最近の研究事例とAttention Branch Networkの関連研究を加えました.
そして,CVPR投稿の苦労話も追加してます.

Hiroshi Fukui

July 06, 2019
Tweet

More Decks by Hiroshi Fukui

Other Decks in Research

Transcript

  1. ໊ݹ԰$713.-ษڧձ
    ࠷ۙͷਂ૚ֶशʹ͓͚Δ"UUFOUJPOػߏ
    ɹ໊ݹ԰$713.-ษڧձWFS
    )JSPTIJ'VLVJ

    View Slide

  2. ࣗݾ঺հ
    w ໊લɿ෱Ҫ޺
    ॴଐɿ/&$όΠΦϝτϦΫεݚڀॴ
    ݩ୅໨໊ݹ԰$713.-ษڧձװࣄ
    w 4/4ɿ
    ݸਓ)1ɿIUUQTTJUFTHPPHMFDPNTJUFGIJSPSFTFBSDIIPNF
    5XJUUFSɿIUUQTUXJUUFSDPN$BUFDIJOF
    'BDFCPPLɿIUUQTXXXGBDFCPPLDPNHSFFOUFB


    View Slide

  3. ൃද͖ͯ͠·ͨ͠!$713


    ॳ$713౤ߘ
    ॳࠃࡍձٞΦʔϥϧൃද

    View Slide

  4. ຊ೔ͷ಺༰
    w ࠓɼྲྀߦΓͷ"UUFOUJPOػߏʹ͍ͭͯޠΓ·͢
    ϝΠϯɿ$PNQVUFS7JTJPOʹ͓͚Δ"UUFOUJPOػߏ
    αϒɿ/BUVSBM-BOHVBHF1SPDFTTJOHʹ͓͚Δ"UUFOUJPOػߏ


    $7XJUI"UUFOUJPONFDIBOJTN /-1XJUI"UUFOUJPONFDIBOJTN

    View Slide

  5. "UUFOUJPOػߏͬͯԿʁ
    w ಛ௃ྔ΁ͷॏΈ෇͚ʹΑΔಛ௃நग़ͷվળ
    ώτͷ஫ҙػߏΛػցֶश΁ͱԠ༻ٕͨ͠ज़


    f′(x) = M(x) ⋅ f(x)
    ಛ௃ϕΫτϧPSಛ௃Ϛοϓ
    "UUFOUJPOػߏͷॏΈ
    *HOPSF
    "UUFOUJPO

    View Slide

  6. "UUFOUJPOػߏͬͯԿʁ
    w ಛ௃ྔ΁ͷॏΈ෇͚ʹΑΔಛ௃நग़ͷվળ
    ώτͷ஫ҙػߏΛػցֶश΁ͱԠ༻ٕͨ͠ज़
    w "UUFOUJPOػߏͷॏΈ͸αϯϓϧ ಛఆͷཁૉ
    ͝ͱʹҟͳΔ
    ωοτϫʔΫͷύϥϝʔλ஋͸શαϯϓϧͰҰఆ


    ˠαϯϓϧ͝ͱʹՄม
    ˠֶशޙ͸શαϯϓϧͰݻఆ
    f′(x) = M(x) ⋅ f(x)
    "UUFOUJPOػߏͷॏΈ
    *HOPSF
    "UUFOUJPO
    ಛ௃ϕΫτϧPSಛ௃Ϛοϓ

    View Slide

  7. "UUFOUJPOػߏ͸Ͳ͏΍ͬͯྲྀߦͬͨͷ͔ʁ


    Published as a conference paper at ICLR 2015
    (a) (b)
    (c) (d)
    Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot
    correspond to the words in the source sentence (English) and the generated translation (French),
    respectively. Each pixel shows the weight ↵ij
    of the annotation of the j-th source word for the i-th
    target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three
    randomly selected samples among the sentences without any unknown words and of length between
    10 and 20 words from the test set.
    One of the motivations behind the proposed approach was the use of a fixed-length context vector
    in the basic encoder–decoder approach. We conjectured that this limitation may make the basic
    encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the perfor-
    mance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand,
    both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-
    50, especially, shows no performance deterioration even with sentences of length 50 or more. This
    superiority of the proposed model over the basic encoder–decoder is further confirmed by the fact
    that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).
    6
    /FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOH
    UP"MJHOBOE5SBOTMBUF
    &⒎FDUJWF"QQSPBDIFTUP
    "UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPO
    yt
    ˜
    ht
    ct
    at
    ht
    pt
    ¯
    hs
    Attention Layer
    Context vector
    Local weights
    Aligned position
    Figure 3: Local attention model – the model first
    predicts a single aligned position pt
    for the current
    target word. A window centered around the source
    position pt
    is then used to compute a context vec-
    tor ct
    , a weighted average of the source hidden
    states in the window. The weights at
    are inferred
    refers to the global attention approach in which
    weights are placed “softly” over all patches in the
    source image. The hard attention, on the other
    hand, selects one patch of the image to attend to at
    a time. While less expensive at inference time, the
    hard attention model is non-differentiable and re-
    quires more complicated techniques such as vari-
    ance reduction or reinforcement learning to train.
    Our local attention mechanism selectively fo-
    cuses on a small window of context and is differ-
    entiable. This approach has an advantage of avoid-
    ing the expensive computation incurred in the soft
    attention and at the same time, is easier to train
    than the hard attention approach. In concrete de-
    tails, the model first generates an aligned position
    pt
    for each target word at time t. The context vec-
    tor ct
    is then derived as a weighted average over
    the set of source hidden states within the window
    [pt
    −D, pt
    +D]; D is empirically selected.8 Unlike
    Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corre
    two variants: a “hard” attention mechanism and a “soft”
    attention mechanism. We also show how one advantage of
    including attention is the ability to visualize what the model
    “sees”. Encouraged by recent advances in caption genera-
    tion and inspired by recent success in employing attention
    in machine translation (Bahdanau et al., 2014) and object
    recognition (Ba et al., 2014; Mnih et al., 2014), we investi-
    gate models that can attend to salient part of an image while
    generating its caption.
    2. Related Work
    In this section we provide relevant backgroun
    work on image caption generation and attent
    several methods have been proposed for gen
    descriptions. Many of these methods are b
    rent neural networks and inspired by the suc
    sequence to sequence training with neural ne
    chine translation (Cho et al., 2014; Bahdana
    4IPX "UUFOEBOE5FMM
    /FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPO
    3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPO
    down sample
    down sample
    up sample
    up sample
    convolution
    receptive field
    Soft Mask Branch Trunk Branch
    1 + # $ % '($)
    * +
    +
    , +
    Figure 3: The receptive field comparison between mask
    branch and trunk branch.
    range to [0, 1] after two consecutive 1 ⇥ 1 convolution lay-
    ers. We also added skip connections between bottom-up
    and top-down parts to capture information from different
    scales. The full module is illustrated in Fig.2.
    The bottom-up top-down structure has been applied to
    image segmentation and human pose estimation. However,
    the difference between our structure and the previous one
    lies in its intention. Our mask branch aims at improving
    Activation
    f1(
    f2(
    f3(
    Table 1: Th
    network with
    Layer
    Conv1
    Max pool
    Residual U
    Attention M
    Residual U
    Attention M
    Residual U
    Attention M
    Residual U
    Average po
    FC,Softm
    pa
    /-1
    /-1$7
    $7
    ࿦จͷओுͰ͸ɻɻɻ
    Stacked Hourglass Networks for
    Human Pose Estimation
    Alejandro Newell, Kaiyu Yang, and Jia Deng
    University of Michigan, Ann Arbor
    {alnewell,yangky,jiadeng}@umich.edu
    Abstract. This work introduces a novel convolutional network archi-
    tecture for the task of human pose estimation. Features are processed
    across all scales and consolidated to best capture the various spatial re-
    lationships associated with the body. We show how repeated bottom-up,
    top-down processing used in conjunction with intermediate supervision
    is critical to improving the performance of the network. We refer to the
    architecture as a “stacked hourglass” network based on the successive
    steps of pooling and upsampling that are done to produce a final set of
    predictions. State-of-the-art results are achieved on the FLIC and MPII
    benchmarks outcompeting all recent methods.
    Keywords: Human Pose Estimation
    Fig. 1. Our network for pose estimation consists of multiple stacked hourglass modules
    which allow for repeated bottom-up, top-down inference.
    1 Introduction
    A key step toward understanding people in images and video is accurate pose
    estimation. Given a single RGB image, we wish to determine the precise pixel
    location of important keypoints of the body. Achieving an understanding of a
    person’s posture and limb articulation is useful for higher level tasks like ac-
    tion recognition, and also serves as a fundamental tool in fields such as human-
    computer interaction and animation.
    arXiv:1603.06937v2 [cs.CV] 26 Jul 2016
    4UBDLFE)PVSHMBTT/FUXPSLT
    GPS)VNBO1PTF&TUJNBUJPO
    )JHIXBZ/FUXPSLT
    1.1. Notation
    We use boldface letters for vectors and matrices, and ital-
    icized capital letters to denote transformation functions. 0
    and 1 denote vectors of zeros and ones respectively, and I
    denotes an identity matrix. The function (x) is defined as
    (x) = 1
    1+e x , x 2 R.
    2. Highway Networks
    A plain feedforward neural network typically consists of L
    layers where the lth layer (l 2 {1, 2, ..., L}) applies a non-
    linear transform H (parameterized by W
    H,l
    ) on its input
    x
    l
    to produce its output y
    l
    . Thus, x
    1
    is the input to the
    network and y
    L
    is the network’s output. Omitting the layer
    index and biases for clarity,
    y = H(x, W
    H
    ). (1)
    H is usually an affine transform followed by a non-linear
    activation function, but in general it may take other forms.
    For a highway network, we additionally define two non-
    linear transforms T(x, W
    T
    ) and C(x, W
    C
    ) such that
    y = H(x, W
    H
    )· T(x, W
    T
    ) + x · C(x, W
    C
    ). (2)
    We refer to T as the transform gate and C as the carry gate,
    since they express how much of the output is produced by
    transforming the input and carrying it, respectively. For
    simplicity, in this paper we set C = 1 T, giving
    y = H(x, W
    H
    )· T(x, W
    T
    ) + x · (1 T(x, W
    T
    )). (3)
    The dimensionality of x, y, H(x, W
    H
    ) and T(x, W
    T
    )
    must be the same for Equation (3) to be valid. Note that
    ple computing units such that the i unit computes yi
    =
    Hi
    (x), a highway network consists of multiple blocks such
    that the ith block computes a block state Hi
    (x) and trans-
    form gate output Ti
    (x). Finally, it produces the block out-
    put yi
    = Hi
    (x) ⇤ Ti
    (x) + xi
    ⇤ (1 Ti
    (x)), which is con-
    nected to the next layer.
    2.1. Constructing Highway Networks
    As mentioned earlier, Equation (3) requires that the dimen-
    sionality of x, y, H(x, W
    H
    ) and T(x, W
    T
    ) be the same.
    In cases when it is desirable to change the size of the rep-
    resentation, one can replace x with ˆ
    x obtained by suitably
    sub-sampling or zero-padding x. Another alternative is to
    use a plain layer (without highways) to change dimension-
    ality and then continue with stacking highway layers. This
    is the alternative we use in this study.
    Convolutional highway layers are constructed similar to
    fully connected layers. Weight-sharing and local receptive
    fields are utilized for both H and T transforms. We use
    zero-padding to ensure that the block state and transform
    gate feature maps are the same size as the input.
    2.2. Training Deep Highway Networks
    For plain deep networks, training with SGD stalls at the
    beginning unless a specific weight initialization scheme is
    used such that the variance of the signals during forward
    and backward propagation is preserved initially (Glorot &
    Bengio, 2010; He et al., 2015). This initialization depends
    on the exact functional form of H.
    For highway layers, we use the transform gate defined as
    T(x) = (W
    T
    T x+b
    T
    ), where W
    T
    is the weight matrix
    and b
    T
    the bias vector for the transform gates. This sug-
    gests a simple initialization scheme which is independent
    of the nature of H: bT
    can be initialized with a negative
    value (e.g. -1, -3 etc.) such that the network is initially
    biased towards carry behavior. This scheme is strongly in-
    $7
    $7

    View Slide

  8. "UUFOUJPOػߏͷ܏޲
    w /-1ʹ͓͚Δ"UUFOUJPOػߏɿ୯ޠ จࣈ
    ͝ͱͷؔ܎ੑΛදݱ
    w $7ʹ͓͚Δ"UUFOUJPOػߏɿը૾্Ͱண໨͢΂͖ྖҬΛදݱ
    4&/FU͸গ͚ͩ͠ྫ֎


    ൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ
    "UUFOUJPO
    "UUFOUJPO
    /-1ʹ͓͚Δ"UUFOUJPOػߏ $7ʹ͓͚Δ"UUFOUJPOػߏ

    View Slide

  9. /-1ʹ͓͚Δ"UUFOUJPOػߏͷߟ͑ํ
    w "UUFOUJPOػߏ͸ࣙॻΦϒδΣΫτ
    2VFSZʹର͢ΔॏΈΛ2VFSZͱ,FZͷಛ௃͔Βࢉग़
    ࢉग़ͨ͠ॏΈΛ࢖ͬͯɼ2VFSZʹߩݙ͢Δ7BMVFΛऔΓग़͢


    2VFSZ
    ,FZ 7BMVF
    ॏΈΛࢉग़
    8FJHIU
    1JDLVQ
    7
    " # $ % &
    8 9 : ; &04
    &04 8 9 : ;
    &ODPEFS 4PVSDF
    %FDPEFS 5BSHFU

    Weig h t( )
    2 ,
    ɼ
    ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$-
    "7BTXBOJ /4IB[FFS /1BSNBS +6T[LPSFJU -+POFT "/(PNF[ -,BJTFS BOE-1PMPTVLIJO l"UUFOUJPOJT"MM:PV/FFEz /*14

    View Slide

  10. /FVSBM.BDIJOF5SBOTMBUJPO /.5

    w &ODPEFS%FDPEFS 4FR4FR
    Λ༻͍ͨϞσϧʹΑΓ຋༁
    &ODPEFS 4PVSDF
    ɿݪจΛೖྗ
    %FDPEFS 5BSHFU
    ɿ຋༁จΛग़ྗ
    ਂ૚ֶशʹ͓͚Δ"UUFOUJPOػߏͷ֓೦Λ࡞ͬͨݚڀ


    %#BIEBOBV ,$IP BOE:#FOHJP l/FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOHUP"MJHOBOE5SBOTMBUFz *$-3
    ction that outputs the probability of yt
    , and st
    is
    other architectures such as a hybrid of an RNN
    Kalchbrenner and Blunsom, 2013).
    TE
    eural machine translation. The new architecture
    c. 3.2) and a decoder that emulates searching
    tion (Sec. 3.1).
    x
    1
    x
    2
    x
    3
    x
    T
    +
    α
    t,1
    α
    t,2 α
    t,3
    α
    t,T
    y
    t-1
    y
    t
    h
    1
    h
    2
    h
    3
    h
    T
    h
    1
    h
    2
    h
    3
    h
    T
    s
    t-1
    s
    t
    Figure 1: The graphical illus-
    tration of the proposed model
    trying to generate the t-th tar-
    nal probability
    , (4)
    d by
    er–decoder ap-
    ed on a distinct
    of annotations
    sentence. Each
    input sequence
    th word of the
    ations are com-
    ed sum of these
    Published as a conference paper at ICLR 2015
    (a) (b)
    Published as a conference paper at ICLR 2015
    (a) (b)

    View Slide

  11. /.5ʹ͓͚Δ"UUFOUJPOͷՄࢹԽ


    "UUFOUJPOBOE"VHNFOUFE3FDVSSFOU/FVSBM/FUXPSLTIUUQTEJTUJMMQVCBVHNFOUFESOOT
    4PVSDF
    5BSHFU
    ˠ4FMG"UUFOUJPO͕஀ੜͨ͋ͨ͠Γʹ4PVSDF5BSHFU"UUFOUJPOͱݺ͹ΕΔ

    View Slide

  12. (MPCBM"UUFOUJPOͱ-PDBM"UUFOUJPO


    unit defined in
    ng objective is
    (y|x) (4)
    corpus.
    ls are classifed
    nd local. These
    the “attention”
    r on only a few
    ese two model
    y.
    odels is the fact
    ing phase, both
    hidden state ht
    M. The goal is
    hat captures rel-
    help predict the
    e models differ
    yt
    ˜
    ht
    ct
    at
    ht
    ¯
    hs
    Global align weights
    Attention Layer
    Context vector
    Figure 2: Global attentional model – at each time
    step t, the model infers a variable-length align-
    ment weight vector at
    based on the current target
    state ht
    and all source states ¯
    hs
    . A global context
    vector ct
    is then computed as the weighted aver-
    age, according to at
    , over all the source states.
    Here, score is referred as a content-based function
    yt
    ˜
    ht
    ct
    at
    ht
    pt
    ¯
    hs
    Attention Layer
    Context vector
    Local weights
    Aligned position
    Figure 3: Local attention model – the model first
    predicts a single aligned position pt
    for the current
    target word. A window centered around the source
    position pt
    is then used to compute a context vec-
    tor ct
    , a weighted average of the source hidden
    states in the window. The weights at
    are inferred
    from the current target state ht
    and those source
    states ¯
    hs
    in the window.
    refers to the glo
    weights are plac
    source image.
    hand, selects one
    a time. While le
    hard attention m
    quires more com
    ance reduction o
    Our local atte
    cuses on a smal
    entiable. This ap
    ing the expensiv
    attention and at
    than the hard at
    tails, the model
    pt
    for each targe
    tor ct
    is then de
    the set of source
    [pt
    −D, pt
    +D];
    the global appro
    is now fixed-dim
    sider two varian
    (MPCBM"UUFOUJPO -PDBM"UUFOUJPO
    ɾ&ODPEFSͷ͢΂ͯͷ૚ͷಛ௃͔ΒॏΈΛܭࢉ
    ɾೖྗจશͯͷ୯ޠΛߟྀͨ͠"UUFOUJPOػߏ
    ɾ&ODPEFSͷಛఆͷ૚ͷಛ௃͔ΒॏΈΛܭࢉ
    ɾಛఆͷ୯ޠؒͷؔ܎ੑΛߟྀͨ͠"UUFOUJPOػߏ
    5-.JOI 1)JFV BOE.%$ISJTUPQIFS l&⒎FDUJWF"QQSPBDIFTUP"UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPOz "$-

    View Slide

  13. 5SBOTGPSNFS
    w "UUFOUJPOػߏͷΈͰ຋༁͢Δ/.5
    $//΍3//Λ༻͍ͳ͍ωοτϫʔΫߏ଄
    .VMUJ)FBE"UUFOUJPOʹΑΓωοτϫʔΫΛߏ੒
    w ෳ਺ͷ4FMG"UUFOUJPOػߏΛಋೖͨ͠ߏ଄


    Figure 1: The Transformer - model architecture.
    "7BTXBOJ /4IB[FFS /1BSNBS +6T[LPSFJU -+POFT "/(PNF[ -,BJTFS BOE-1PMPTVLIJO l"UUFOUJPOJT"MM:PV/FFEz /*14

    View Slide

  14. 4FMG"UUFOUJPO
    w ೖྗͷಛ௃ͷΈ༻͍ͯ"UUFOUJPOػߏͷॏΈΛࢉग़
    4PVSDFͱ5BSHFUͷ֓೦͕ແ͍ͷ͕ಛ௃


    ൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ

    ͸
    ෱Ҫ͞Μ
    Ͱ͢
    ɽ
    ൴ɹ͸ɹ෱Ҫ͞ΜɹͰ͢ɹɽ
    "UUFOUJPOػߏ
    Λ
    ௐ΂ͨ

    ɽ
    จষ
    5BSHFU

    จষ
    4PVSDF

    จষ
    4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO

    View Slide

  15. 4FMG"UUFOUJPO
    w ೖྗͷಛ௃ͷΈ༻͍ͯ"UUFOUJPOػߏͷॏΈΛࢉग़
    4PVSDFͱ5BSHFUͷ֓೦͕ແ͍ͷ͕ಛ௃


    4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO
    จষ
    5BSHFU

    จষ
    4PVSDF

    จষ

    View Slide

  16. 5SBOTGPSNFSͷ"UUFOUJPOػߏ


    Multi-Head Attention
    ttention. (right) Multi-Head Attention consists of several
    Mu ltiHead(Q, K, V) = co
    n
    cat(h ead
    1
    , …, h ead
    h
    )WO
    4FMG"UUFOUJPO
    Scaled Dot-Product Attention Multi-Head Attention
    Atten
    tio
    n
    (Q, K, V) = so
    ftmax
    (
    QKT
    dk
    )
    ⋅ V
    4DBMFE%PU1SPEVDU"UUFOUJPO
    4PVSDF5BSHFU"UUFOUJPO

    h ead
    i
    = Atten
    tio
    n
    (
    QWQ
    i
    , KWK
    i
    , VWV
    i )

    View Slide

  17. .FNPSZ/FUXPSL
    w ֎෦ϝϞϦΛಋೖͨ͠ωοτϫʔΫ
    ͭͷϞδϡʔϧ͔Βߏ੒͞ΕͨωοτϫʔΫ
    ௕จΛهԱ͢Δ͜ͱͰߴਫ਼౓ͳจষཁ໿Λ࣮ݱ


    propagate through it. Other recently proposed forms of memory or attention take this approach,
    notably Bahdanau et al. [2] and Graves et al. [8], see also [9].
    Generating the final prediction: In the single layer case, the sum of the output vector o and the
    input embedding u is then passed through a final weight matrix W (of size V ⇥ d) and a softmax
    to produce the predicted label:
    ˆ
    a = Softmax(W(o + u)) (3)
    The overall model is shown in Fig. 1(a). During training, all three embedding matrices A, B and C,
    as well as W are jointly learned by minimizing a standard cross-entropy loss between ˆ
    a and the true
    label a. Training is performed using stochastic gradient descent (see Section 4.2 for more details).
    Question
    q
    Output Input
    Embedding B
    Embedding C
    Weights
    Softmax
    Weighted Sum
    pi
    ci
    mi
    Sentences
    {xi
    }
    Embedding A
    o W
    Softmax
    Predicted
    Answer
    a
    ^
    u
    u
    Inner Product
    Out
    3
    In
    3
    B
    Sentences
    W a
    ^
    {xi
    }
    o1
    u1
    o2
    u2
    o3
    u3
    A1
    C1
    A3
    C3
    A2
    C2
    Question q
    Out
    2
    In
    2
    Out
    1
    In
    1
    Predicted
    Answer
    (a) (b)
    Figure 1: (a): A single layer version of our model. (b): A three layer version of our model. In
    practice, we can constrain several of the embedding matrices to be the same (see Section 2.2).
    44VLICBBUBS "4[MBN +8FTUPO BOE3'FSHVT l&OE5P&OE.FNPSZ/FUXPSLTz /*14

    View Slide

  18. .FN/FUʹ͓͚Δ"UUFOUJPOػߏͷ໾ׂ
    w ͲͷϝϞϦΛΞΫηε͢Δ͔Λ"UUFOUJPOػߏʹΑΓબ୒
    2VFSZ ,FZ 7BMVFͷߟ͑ํ͸࣮͸.FNPSZ/FUXPSL͕ൃ঵


    Figure 1: The Key-Value Memory Network model for question answering. See Section 3 for details.
    ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$-

    View Slide

  19. .FN/FUʹ͓͚Δ"UUFOUJPOػߏͷ໾ׂ
    w ͲͷϝϞϦΛΞΫηε͢Δ͔Λ"UUFOUJPOػߏʹΑΓબ୒
    2VFSZ ,FZ 7BMVFͷߟ͑ํ͸࣮͸.FNPSZ/FUXPSL͕ൃ঵


    2VFSZ
    ,FZ 7BMVF
    8FJHIU
    ".JMMFS "'JTDI +%PEHF "),BSJNJ "#PSEFT BOE+8FTUPO l,FZ7BMVF.FNPSZ/FUXPSLTGPS%JSFDUMZ3FBEJOH%PDVNFOUTz "$-

    View Slide

  20. $7ʹ͓͚Δ"UUFOUJPOػߏ
    w $7ͷ"UUFOUJPOػߏ͸࣌ܥྻΛѻ͏͔൱͔Ͱߏ଄͕ҟͳΔ
    ࣌ܥྻΛ༻͍Δ"UUFOUJPOػߏ
    w 4IPX "UUFOEBOE5FMM
    ࣌ܥྻΛ༻͍ͳ͍"UUFOUJPOػߏ
    w 3FTJEVBM"UUFOUJPO/FUXPSL
    w 4RVFF[FBOE&YDJUBUJPO/FUXPSL
    w /POMPDBM/FUXPSL
    w "UUFOUJPO#SBODI/FUXPSL


    View Slide

  21. ࣌ܥྻͷ༗ແʹ͓͚Δ"UUFOUJPOػߏͷҧ͍
    w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ $BQUJPOJOH 72" ʜ
    ɿ
    ཁૉ ୯ޠ
    ͝ͱʹ"UUFOUJPOػߏͷॏΈΛࢉग़
    w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ ը૾෼ྨɼݕग़ɼʜ
    ɿ
    ཁૉ͕ը૾Ұຕ͔͠ͳ͍


    ˠͭͷωοτϫʔΫʹෳ਺ͷ"UUFOUJPOػߏΛಋೖ
    ͋Εɹ͸ɹϑΫϩ΢ɹͰ͢ɹɽ
    "UUFOUJPO
    ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ
    "UUFOUJPO
    ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ

    View Slide

  22. $7ʹ͓͚Δ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ
    w 4IPX "UUFOEBOE5FMM
    ΩϟϓγϣχϯάϞσϧʹͭͷ"UUFOUJPOػߏΛऔΓೖΕͨख๏
    w %FUFSNJOJTUJDTPGUBUUFOUJPOɿ4PGUNBYϕʔεͷ"UUFOUJPOػߏ
    w 4UPDIBTUJDIBSEBUUFOUJPOɿڧԽֶश ϞϯςΧϧϩϕʔε
    ʹΑΔ"UUFOUJPOػߏ


    Neural Image Caption Generation with Visual Attention
    Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft
    (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.)
    Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word
    Neural Image Caption Generation with Visual Attention
    Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft
    (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.)
    Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word
    Neural Image Caption Generation with Visual Attention
    Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “sof
    (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.)
    Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word
    %FUFSNJOJTUJDTPGUBUUFOUJPO
    4UPDIBTUJDIBSEBUUFOUJPO
    ,9V +#B 3,JSPT ,$IP "$PVSWJMMF 34BMBLIVEJOPW 3;FNFM BOE:#FOHJP l4IPX "UUFOEBOE5FMM/FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPOz *$.-

    View Slide

  23. %FUFSNJOJTUJDTPGUBUUFOUJPO
    w ֤୯ޠ͝ͱͷಛ௃ʹରͯ͠ը૾্ۭؒͷॏΈΛࢉग़
    4PVSDF5BSHFU"UUFOUJPOͱͷҧ͍ɿ,FZ 7BMVF $POWPMVUJPOMBZFST
    2VFSZ -45.



    Model Representation
    CNN
    Image:
    H x W x 3
    Features:
    L x D
    h0
    a1
    z1
    Weighted
    combination
    of features
    y1
    h1
    First
    word
    Distribution over L
    locations
    a2 d1
    h2
    a3 d2
    z2 y2
    Weighted
    features: D
    Distribution
    over vocab

    View Slide

  24. %FUFSNJOJTUJDTPGUBUUFOUJPO
    w ֤୯ޠ͝ͱͷಛ௃ʹରͯ͠ը૾্ۭؒͷॏΈΛࢉग़
    4PVSDF5BSHFU"UUFOUJPOͱͷҧ͍ɿ,FZ 7BMVF $POWPMVUJPOMBZFST
    2VFSZ -45.



    CNN
    Image:
    H x W x 3
    Features:
    L x D
    h0
    a1
    z1
    Weighted
    combination
    of features
    y1
    h1
    First
    word
    Distribution over L
    locations
    a2 d1
    h2
    a3 d2
    z2 y2
    Weighted
    features: D
    Distribution
    over vocab
    Model Representation
    CNN
    Image:
    H x W x 3
    Features:
    L x D
    h0
    a1
    z1
    Weighted
    combination
    of features
    y1
    h1
    First
    word
    Distribution over L
    locations
    a2 d1
    h2
    a3 d2
    z2 y2
    Weighted
    features: D
    Distribution
    over vocab
    Model Representation
    CNN
    Image:
    H x W x 3
    Features:
    L x D
    h0
    a1
    z1
    Weighted
    combination
    of features
    y1
    h1
    First
    word
    Distribution over L
    locations
    a2 d1
    h2
    a3 d2
    z2 y2
    Weighted
    features: D
    Distribution
    over vocab
    2VFSZ
    ,FZ 7BMVF

    View Slide

  25. ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏͷछྨ
    w 4QBDFXJTFBUUFOUJPO
    ಛ௃Ϛοϓશମʹରͯ͠ॏΈ෇͚
    w $IBOOFMXJTFBUUFOUJPO
    ಛ௃Ϛοϓͷνϟϯωϧʹରͯ͠ॏΈ෇͚
    w 4FMGBUUFOUJPO
    ֫ಘͨ͠ಛ௃Ϛοϓʹରͯ͠4FMG"UUFOUJPOΛࢉग़


    'FBUVSFNBQ 8FJHIU

    View Slide

  26. 4QBDFXJTF"UUFOUJPOͷ୅දతͳݚڀ
    w 3FTJEVBM"UUFOUJPO/FUXPSL
    3FT/FUΛϕʔεʹ"UUFOUJPOػߏΛಋೖͨ͠Ϟσϧ
    3FTJEVBM-FBSOJOHΛಋೖͨ͠"UUFOUJPOػߏΛఏҊ


    Origin image
    Feature
    before mask
    Soft attention
    mask
    Feature
    after mask
    Feature
    before mask
    Feature
    after mask
    Low-level color feature
    Sky mask
    High-level part feature
    Balloon instance mask
    Classification
    Input
    Attention
    Attention mechanism
    Soft attention
    mask
    '8BOH .+JBOH $2JBO 4:BOH $-J );IBOH 98BOH BOE95BOH l3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPOz $713

    View Slide

  27. 4UBDLFE/FUXPSL4USVDUVSF
    w Ұ͚ͭͩͷ"UUFOUJPOػߏͰ͸ωοτϫʔΫͷߴਫ਼౓Խ͕ࠔ೉
    ωοτϫʔΫͷಛ௃දݱ͕ૄʹͳΓ͗͢Δ͜ͱ͕ݪҼ
    3FT/FUͷ֤ϒϩοΫͷޙ૚ʹ"UUFOUJPOػߏΛಋೖ


    ҄
    ҄
    ҄
    ×
    ×
    ×
    max pooling
    residual unit
    max pooling
    interpolation
    residual unit
    1x1 conv
    1x1 conv
    interpolation
    residual unit
    residual unit
    residual unit
    sigmoid
    residual unit
    residual unit
    residual unit
    residual unit
    stage2
    stage1 stage3
    Attention Module Attention Module Attention Module
    Input
    Image
    ...
    ...
    ... ... ...
    ...
    ... ...
    ...
    ...
    t
    p p
    p p p p
    t t
    ...
    ...
    ...
    Soft Mask Branch
    r r
    2r
    down sample
    up sample
    residual unit sigmoid function
    × element-wise
    product
    ҄ element-wise
    sum
    convolution
    pooling
    Figure 2: Example architecture of the proposed network for ImageNet. We use three hyper-parameters for the design of

    View Slide

  28. "UUFOUJPO3FTJEVBM-FBSOJOH
    w "UUFOUJPOػߏͷ4DBMJOHʹՃ͑ͯ3FTJEVBM-FBSOJOHΛಋೖ
    Ծʹ"UUFOUJPOػߏͷॏΈ͕શͯʹͳͬͯ΋ಛ௃Ϛοϓ͕ফࣦ͠ͳ͍
    w ࣮͸࠷ۙͷ"UUFOUJPOػߏͰ͸͔ͳΓ࢖ΘΕ͍ͯΔςΫχοΫ
    5SBOTGPSNFS 4&/FU 3FT/FUܕ
    /POMPDBM// "#/ ʜ


    f′(x) = M(x) ⋅ f(x)
    f′(x) = (1 + M(x)) ⋅ f(x)
    Ұൠతͳ"UUFOUJPOػߏ
    "UUFOUJPO3FTJEVBM-FBSOJOH
    down sample
    down sample
    up sample
    up sample
    convolution
    receptive field
    Soft Mask Branch Trunk Branch
    1 + # $ % '($)
    * +
    +
    , +
    Figure 3: The receptive field comparison between mask
    Ac
    Table
    netwo
    M
    R
    Atte
    R
    M(x)
    f(x)ɿಛ௃Ϛοϓ
    ɿॏΈ

    View Slide

  29. $IBOOFMXJTF"UUFOUJPOͷ୅දతͳݚڀ
    w 4RVFF[FBOE&YDJUBUJPO/FUXPSL
    ಛ௃Ϛοϓͷνϟϯωϧʹରͯ͠"UUFOUJPOػߏΛಋೖ
    w গྔͷύϥϝʔλͷ૿ՃͰը૾ೝࣝͷੑೳΛ޲্Մೳ
    w γϯϓϧͰಋೖ͠΍͍͢"UUFOUJPOػߏͳͷͰ༷ʑͳख๏ͰऔΓೖΕΒΕ͍ͯΔ


    Figure 1: A Squeeze-and-Excitation block.
    features in a class agnostic manner, bolstering the quality of
    the shared lower level representations. In later layers, the
    dinality (the size of the set of transformations) [15, 47].
    Multi-branch convolutions can be interpreted as a generali-
    +)V -4IFO 4"MCBOJF (4VO BOE&8V l4RVFF[FBOE&YDJUBUJPO/FUXPSLz $713

    View Slide

  30. 4RVFF[FBOE&YDJUBUJPO.PEVMF
    w *ODFQUJPOܕͱ3FT/FUܕͷछྨΛఏҊ
    w (MPCBM"WFSBHF1PPMJOHͱ૚ͷGD૚͔Β"UUFOUJPOػߏΛߏங
    4RVFF[Fɿ(MPCBM"WFSBHF1PPMJOH
    &YDJUBUJPOɿ૚ͷGD૚


    Inception
    Global pooling
    FC
    SE-Inception Module
    FC
    X
    Inception
    X
    Inception Module
    X
    X
    Sigmoid
    Scale
    ReLU
    × W × C
    1 × 1 × C
    1 × 1 × C
    1 × 1 × C
    1 × 1 ×
    C

    1 × 1 ×
    C

    × W × C
    Figure 2: The schema of the original Inception module (left) and
    the SE-Inception module (right).
    X X
    × W × C
    Residual Residual
    a single pass forwards and backwards through ResNet-50
    takes 190 ms, compared to 209 ms for SE-ResNet-50 (both
    timings are performed on a server with 8 NVIDIA Titan X
    GPUs). We argue that this represents a reasonable overhead
    particularly since global pooling and small inner-product
    operations are less optimised in existing GPU libraries.
    Moreover, due to its importance for embedded device ap-
    plications, we also benchmark CPU inference time for each
    model: for a 224 × 224 pixel input image, ResNet-50 takes
    164 ms, compared to 167 ms for SE-ResNet-50. The small
    additional computational overhead required by the SE block
    is justified by its contribution to model performance.
    Next, we consider the additional parameters introduced
    by the proposed block. All of them are contained in the two
    FC layers of the gating mechanism, which constitute a small
    fraction of the total network capacity. More precisely, the
    number of additional parameters introduced is given by:
    2 S
    Ns · Cs
    2 (5)
    Global pooling
    FC
    SE-Inception Module
    FC
    X
    Inception Module
    X
    Sigmoid
    Scale
    ReLU
    1 × 1 × C
    1 × 1 × C
    1 × 1 × C
    1 × 1 ×
    C

    1 × 1 ×
    C

    × W × C
    Figure 2: The schema of the original Inception module (left) and
    the SE-Inception module (right).
    SE-ResNet Module
    +
    Global pooling
    FC
    ReLU
    +
    ResNet Module
    X
    X
    X
    X
    Sigmoid
    1 × 1 × C
    1 × 1 ×
    C

    1 × 1 × C
    1 × 1 × C
    Scale
    × W × C
    × W × C
    × W × C
    Residual Residual
    FC
    1 × 1 ×
    C

    Figure 3: The schema of the original Residual module (left) and
    the SE-ResNet module (right).
    GPUs). We argu
    particularly sinc
    operations are
    Moreover, due t
    plications, we al
    model: for a 224
    164 ms, compar
    additional comp
    is justified by its
    Next, we con
    by the proposed
    FC layers of the
    fraction of the t
    number of addit
    where r denotes
    ber of stages (w
    blocks operating
    mension), Cs de
    and Ns denotes
    ResNet-50 intro
    beyond the ∼25
    corresponding to
    parameters com
    excitation is per
    sions. However,
    final stage of SE
    in performance (
    *ODFQUJPOܕ
    "MFY/FUɼ7((/FU౳
    3FT/FUܕ
    3FT/FUɼ3FT/F9U౳

    View Slide

  31. $7ʹ͓͚Δ4FMG"UUFOUJPO
    w 5SBOTGPSNFSͷ"UUFOUJPOػߏΛϕʔεʹը૾ $//
    ΁Ԡ༻


    x
    i
    98BOH 3(JSTIJDL "(VQUB BOE,)F l/POMPDBM/FVSBM/FUXPSLz $713
    e 2. On the left are image completions from our best conditional generation model, where we sample the second half. On the right are
    ples from our four-fold super-resolution model trained on CIFAR-10. Our images look realistic and plausible, show good diversity
    ng the completion samples and observe the outputs carry surprising details for coarse inputs in super-resolution.
    mensional embedding vectors of the intensity values
    255. For output intensities, we share a single, separate
    of 256 d-dimensional embeddings across the channels.
    an image of width w and height h, we combine the
    h and channel dimensions yielding a 3-dimensional
    or with shape [h, w · 3, d].
    ordinal values, we run a 1x3 window size, 1x3 strided
    volution to combine the 3 channels per pixel to form an
    t representation with shape [h, w, d].
    ach pixel representation, we add a d-dimensional en-
    ng of coordinates of that pixel. We evaluated two dif-
    nt coordinate encodings: sine and cosine functions of
    coordinates, with different frequencies across different
    ensions, following (Vaswani et al., 2017), and learned
    tion embeddings. Since we need to represent two co-
    nates, we use d/2 of the dimensions to encode the row
    ber and the other d/2 of the dimensions to encode the
    column and color channel.
    Self-Attention
    mage-conditioned generation, as in our super-resolution
    els, we use an encoder-decoder architecture. The en-
    er generates a contextualized, per-pixel-channel repre-
    ation of the source image. The decoder autoregressively
    softmax
    + + + +
    cmp cmp cmp
    q’
    q m
    1
    m
    2
    m
    3
    p
    q
    p
    1
    p
    2
    p
    3
    MatMul Wv
    MatMul Wv
    ·
    MatMul Wq MatMul Wk
    MatMul Wv
    LayerNorm
    Dropout
    Local Self-Attention
    LayerNorm FFNN
    Dropout
    Figure 1. A slice of one layer of the Image Transformer, recom-
    puting the representation q0 of a single channel of one pixel q by
    attending to a memory of previously generated pixels m1, m2, . . ..
    After performing local self-attention we apply a two-layer position-
    wise feed-forward neural network with the same parameters for
    all positions in a given layer. Self-attention and the feed-forward
    Input Attention maps Weighted average of the values
    Standard convolution
    Output
    W
    H
    Nh = 2
    Values
    ²
    Head Head
    ¹
    Figure 2. Attention-augmented convolution: For each spatial location (h, w), N
    h
    attention maps over the image are computed from
    queries and keys. These attention maps are used to compute N
    h
    weighted averages of the values V. The results are then concatenated,
    reshaped to match the original volume’s spatial dimensions and mixed with a pointwise convolution. Multi-head attention is applied in
    parallel to a standard convolution operation and the outputs are concatenated.
    interactions without increasing the number of parameters.
    In this paper, we consider the use of self-attention for
    discriminative visual tasks as an alternative to convolu-
    tions. We develop a novel two-dimensional relative self-
    attention mechanism [35] that maintains translation equiv-
    ariance while being infused with relative position informa-
    tion, making it well suited for images. Our self-attention
    formulation proves competitive for replacing convolutions
    entirely, however we find in control experiments that the
    best results are obtained when combining both. We there-
    fore do not completely abandon the idea of convolutions,
    but instead propose to augment convolutions with this self-
    attention mechanism. This is achieved by concatenating
    2. Related Work
    2.1. Convolutional networks
    Modern computer vision has been built on powerful im-
    age featurizers learned on image classification tasks such
    as CIFAR-10 [22] and ImageNet [9]. These datasets have
    been used as benchmarks for delineating better image fea-
    turizations and network architectures across a broad range
    of tasks [21]. For example, improving the “backbone” net-
    work typically leads to improvements in object detection
    [19] and image segmentation [6]. These observations have
    inspired the research and design of new architectures, which
    are typically derived from the composition of convolution
    operations across an array of spatial scales and skip con-
    x
    z
    T × H × W × 1024
    T × H × W × 1024
    1 × 1 × 1
    T × H × W × 512
    THW × 512
    T × H × W × 512
    T × H × W × 512
    S
    oftmax
    1 × 1 × 1
    ϕ : 1 × 1 × 1
    θ : 1 × 1 × 1
    T × H × W × 512
    THW × THW
    f( ⋅ )
    g ( ⋅ )
    y
    W
    g
    /POMPDBM/FVSBM/FUXPSL *NBHF5SBOTGPSNFS "UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSL
    /1BSNBS "7BTXBOJ +6T[LPSFJU ,BJTFS /4IB[FFS ",V BOE%5SBO l*NBHF5SBOTGPSNFSz *$.-
    *#FMMP #;PQI "7BTXBOJ +4IMFOT BOE27-F l"UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSLz BS9JW

    View Slide

  32. /POMPDBM/FVSBM/FUXPSL
    w 4FMG"UUFOUJPOΛ$//ʹԠ༻ͨ͠ωοτϫʔΫ
    $//΍3//ΑΓ௕͍εύϯͰೖྗಛ௃ΛࢀরՄೳ
    w $//ɿΧʔωϧαΠζ෼ͷࢀরྖҬ
    w 3//ɿ࣌ܥྻ෼ ͔Βಛఆͷ࣌ࠁ·Ͱ
    ͷࢀরྖҬ


    x
    i
    ,FSOFMT
    3FDFQUJWFpFMET 3FQSFTFOUBUJPO
    $POWPMVUJPOBM/FVSBM/FUXPSL
    JOQVU U
    JOQVU U
    JOQVU U5

    -BZFS -BZFS -BZFS
    0VUQVU 0VUQVU 0VUQVU
    JOQVU U

    -BZFS
    0VUQVU
    3FDVSSFOU/FVSBM/FUXPSL
    x
    i
    x
    j
    f( ⋅ )
    g (x
    j
    )

    j
    f(x
    i
    , x
    j
    ) ⋅ g (x
    j
    )
    /POMPDBM
    98BOH 3(JSTIJDL "(VQUB BOE,)F l/POMPDBM/FVSBM/FUXPSLz $713

    View Slide

  33. /POMPDBM/FVSBM/FUXPSLͷண؟఺
    w /POMPDBMpMUFSͱ4%1"UUFOUJPO͕ೖྗʹ֬཰෼෍Λ৐ࢉ
    /POMPDBMpMUFSɿλʔήοτྖҬͱۙ๣ྖҬ͔Βྨࣅ౓ɹɹɹΛࢉग़
    4%1"UUFOUJPOɿ2ͱ,Ωʔ͔Βྨࣅ౓Λࢉग़


    1SPCBCJMJUZ%JTUSJCVUJPO
    *OQVU
    0VUQVU
    0VUQVU 1SPCBCJMJUZ%JTUSJCVUJPO
    *OQVU
    Atten
    tio
    n
    (Q, K, V) = so
    ftmax
    (
    QKT
    dk
    )
    ⋅ V
    4DBMFE%PU1SPEVDU"UUFOUJPO
    y(i) =

    j= I
    w(i, j) ⋅ x(j)
    /POMPDBM.FBO'JMUFS
    w( ⋅ )
    so
    ftmax
    (
    QKT / d
    k)
    ˠ/POMPDBMpMUFSͷΞϧΰϦζϜΛ൓ө͢Δ͜ͱͰ4FMG"UUFOUJPOΛ$//ʹಋೖ

    View Slide

  34. /POMPDBMCMPDLͷߏ଄
    w ೖྗಛ௃ͱ֬཰෼෍ͷ৐ࢉʹΑΓࢉग़
    ࠷ऴతͳԠ౴஋ɹ͸3FTJEVBM-FBSOJOHʹΑΓࢉग़
    w /POMPDBMCMPDL &NCFEEFE(BVTTJBO

    ɹͱɹΛ࢖ͬͯ4FMG"UUFOUJPOͷߏ଄Λߏங
    TPGUNBYΛ༻͍Δ͜ͱͰਖ਼نԽ͕༰қ
    5SBOTGPSNFSͷTFMGBUUFOUJPOʹҰ൪ࣅ͍ͯΔߏ଄


    x
    z
    T × H × W × 1024
    T × H × W × 1024
    1 × 1 × 1
    T × H × W × 512
    THW × 512
    T × H × W × 512
    T × H × W × 512
    S
    o
    ftmax
    1 × 1 × 1
    ϕ : 1 × 1 × 1
    θ : 1 × 1 × 1
    T × H × W × 512
    THW × THW
    f( ⋅ )
    g ( ⋅ )
    y
    y =
    1
    C(x) ∑
    f(x) ⋅ g (x)
    =
    1
    C(x)
    so
    ftmax(xTWT
    θ
    W
    ϕ
    x) ⋅ g (x)
    4JNJMBSJUZ
    7BMVF
    W
    g
    y
    θ ϕ
    2VFSZ ,FZ 7BMVF

    View Slide

  35. /POMPDBM/FVSBM/FUXPSLͷ"UUFOUJPOͷྫ


    x
    i
    x
    i
    x
    i
    x
    i

    View Slide

  36. "UUFOUJPO"VHNFOUFE$POWPMVUJPO/FUXPSL
    w 4%1"UUFOUJPOΛ࣍ݩʹ֦ுͨ͠"UUFOUJPOػߏΛ$//ʹಋೖ
    \4UBOEBSE %FQUIXJTF^$POWPMVUJPOͱDPODBUΛۦ࢖ͯ͠.VMUJIFBEBUUFOUJPOΛ࠶ݱ
    ಛ௃Ϛοϓ͚ͩͰϞδϡʔϧΛߏங͢ΔͨΊ4&/FUΑΓগύϥϝʔλ


    Input Attention maps Weighted average of the values
    Standard convolution
    Output
    W
    H
    Nh = 2
    Values
    ²
    Head Head
    ¹
    4PGUNBY (MPCBM"WFSBHF1PPMJOH %FQUIXJTF$POWPMVUJPO

    View Slide

  37. ͭͷ"UUFOUJPOػߏͷҧ͍


    4QBDFXJTF"UUFOUJPO $IBOOFMXJTF"UUFOUJPO 4FMG"UUFOUJPO
    'FBUVSFNBQ
    "UUFOUJPOXFJHIU

    View Slide

  38. ࢹ֮తઆ໌Λ"UUFOUJPOػߏ΁
    w $//͕஫ࢹͨ͠ྖҬΛ"UUFOUJPOػߏͷॏΈʹԠ༻
    "UUFOUJPOػߏ΁ͷԠ༻ʹΑΓೝࣝੑೳͷ޲্
    ࢹ֮తઆ໌ʹ͓͚Δ"UUFOUJPONBQͷ֫ಘͱਫ਼౓޲্Λಉ࣌ʹ࣮ݱ


    "UUFOUJPOػߏ
    'FBUVSF
    FYUSBDUPS
    ॏΈ
    ೖྗը૾
    ೖྗը૾ "UUFOUJPONBQ
    ࢹ֮తઆ໌

    View Slide

  39. ࢹ֮తઆ໌
    w ਂ૚ֶश͕ਪ࿦࣌ʹ஫ࢹͨ͠ྖҬΛώʔτϚοϓͰදݱ
    $MBTT"DUJWBUJPO.BQQJOH $".
    ɼ(SBE$".౳


    ೖྗը૾
    b(JBOU@TDIOBV[FS` b.JOJBUVSF@TDIOBV[FS` b4UBOEBSE@TDIOBV[FS`
    b.JOJBUVSF@TDIOBV[FS` b4UBOEBSE@TDIOBV[FS` b*SJTI@UFSSJFS`
    (5b.JOJBUVSF@TDIOBV[FS`
    3FT/FU(SBE$".
    3FT/FU$".

    View Slide

  40. "UUFOUJPO#SBODI/FUXPSL
    w ࢹ֮తઆ໌ͷ"UUFOUJPONBQΛॏΈͱͯ͠"UUFOUJPOػߏ΁Ԡ༻
    $MBTT"DUJWBUJPO.BQQJOH͔ΒಘΒΕΔ"UUFOUJPONBQΛϕʔεʹ"UUFOUJPOػߏͷॏΈΛࢉग़
    "UUFOUJPOػߏʹΑΔߴਫ਼౓Խͱࢹ֮తઆ໌Λಉ࣌ʹ࣮ݱ


    Prob.
    score
    Attention map
    Input image
    Prob.
    score
    M(x
    i)
    Label
    Attention mechanism
    Attention branch
    L
    per(x
    i)
    L
    att(x
    i)
    x
    i
    (a) Overview of Attention Branch Network

    Feature map
    g(x
    i)
    Classifier
    Softmax

    Feature map
    g (x
    i)
    Perception branch
    Convolution
    layers
    K
    ReLU
    Batch Normalization
    Batch Normalization
    Sigmoid
    1x1 Conv., 1
    1x1 Conv.,
    K
    Softmax
    GAP
    1x1 Conv., K
    Feature extractor
    : Convolution layer
    : Activation function
    : Batch Normalization
    )'VLVJ 5)JSBLBXB 5:BNBTIJUB BOE)'VKJZPTIJ l"UUFOUJPO#SBODI/FUXPSL-FBSOJOHPG"UUFOUJPO.FDIBOJTNGPS7JTVBM&YQMBOBUJPOz $713

    View Slide

  41. "UUFOUJPONBQͷՄࢹԽྫʢ*NBHFDMBTTJpDBUJPOʣ


    (5b/PSGPML@UFSSJFS` (5b1PPM@UBCMF`
    (5b'MVUF` (5b,JNPOP`
    (5b4DPSFCPBSE`

    View Slide

  42. "UUFOUJPONBQͷՄࢹԽྫʢ'BDJBMBUUSJCVUFTSFDPHOJUJPOʣ


    0SJHJOBMJNBHF
    4NJMJOH 8FBSJOH@OFDLMBDF
    :PVOH
    #MPOE@)BJS
    8FBSJOH@OFDLUJF
    "UUFOUJPOCSBODI
    1FSDFQUJPOCSBODI
    1FSDFQUJPOCSBODI
    ʜ
    8FJHIUTIBSJOH
    *OQVUJNBHF
    -BCFMT
    'FBUVSF
    FYUSBDUPS
    ʜ
    }
    T
    "UUFOUJPONBQT
    ʜ
    'FBUVSFNBQT
    g (x
    i
    )
    Mt(x
    i
    )
    Mt(x
    i
    )

    View Slide

  43. "UUFOUJPONBQͷՄࢹԽྫʢ1MBZJOH"UBSJʣ


    *OQVUJNBHF "UUFOUJPONBQ
    4QBDF*OWBEFST
    *OQVUJNBHF "UUFOUJPONBQ
    .T1BDNBO

    View Slide

  44. ଞͷ"UUFOUJPOػߏͱͷҧ͍


    3FTJEVBM"UUFOUJPO/FUXPSL 4&/FU /POMPDBM/FUXPSL
    "#/
    w ଞͷ"UUFOUJPOػߏͱҟͳΓҰՕॴʹͭͷॏΈ͚ͩ࢖༻
    ଞͷ"UUFOUJPOػߏɿͭͷ"UUFOUJPOػߏʹෳ਺ͷॏΈΛ࢖༻
    ɹɹɹɹɹɹɹɹɹෳ਺ͷϒϩοΫPSϞδϡʔϧʹ"UUFOUJPOػߏΛಋೖ
    Inception
    Global pooling
    FC
    SE-Inception Module
    FC
    X
    Inception
    X
    Inception Module
    X
    X
    Sigmoid
    Scale
    ReLU
    × W × C
    1 × 1 × C
    1 × 1 × C
    1 × 1 × C
    1 × 1 ×
    C

    1 × 1 ×
    C

    × W × C
    Figure 2: The schema of the original Inception module (left) and
    the SE-Inception module (right).
    SE-ResNet Module
    +
    Global pooling
    FC
    ReLU
    +
    ResNet Module
    X
    X
    X
    X
    Sigmoid
    1 × 1 × C
    1 × 1 ×
    C

    1 × 1 × C
    1 × 1 × C
    Scale
    × W × C
    × W × C
    × W × C
    Residual Residual
    FC
    1 × 1 ×
    C

    Figure 3: The schema of the original Residual module (left) and
    the SE-ResNet module (right).
    4. Model and Computational Complexity
    For the proposed SE block to be viable in practice, it
    must provide an effective trade-off between model com-
    plexity and performance which is important for scalability.
    We set the reduction ratio r to be 16 in all experiments, ex-
    cept where stated otherwise (more discussion can be found
    a single pass forwards and backw
    takes 190 ms, compared to 209 m
    timings are performed on a server
    GPUs). We argue that this represe
    particularly since global pooling
    operations are less optimised in
    Moreover, due to its importance
    plications, we also benchmark CP
    model: for a 224 × 224 pixel inpu
    164 ms, compared to 167 ms for S
    additional computational overhead
    is justified by its contribution to m
    Next, we consider the addition
    by the proposed block. All of them
    FC layers of the gating mechanism
    fraction of the total network capa
    number of additional parameters i
    2
    r
    S
    s=1
    Ns · C
    where r denotes the reduction ra
    ber of stages (where each stage r
    blocks operating on feature maps
    mension), Cs denotes the dimensi
    and Ns denotes the repeated block
    ResNet-50 introduces ∼2.5 milli
    beyond the ∼25 million paramete
    corresponding to a ∼10% increas
    parameters come from the last sta
    excitation is performed across the
    sions. However, we found that the
    final stage of SE blocks could be r
    in performance (<0.1% top-1 erro
    the relative parameter increase to
    useful in cases where parameter
    tion (see further discussion in Sec
    5. Implementation
    Each plain network and its co
    part are trained with identical opt
    ֤3FTJEVBMVOJUʹಋೖ

    View Slide

  45. ଞͷ"UUFOUJPOػߏͱͷҧ͍
    w ଞͷ"UUFOUJPOػߏͱҟͳΓҰՕॴʹͭͷॏΈ͚ͩ࢖༻
    ଞͷ"UUFOUJPOػߏɿͭͷ"UUFOUJPOػߏʹෳ਺ͷॏΈΛ࢖༻
    ɹɹɹɹɹɹɹɹɹෳ਺ͷϒϩοΫPSϞδϡʔϧʹ"UUFOUJPOػߏΛಋೖ


    3FTJEVBM"UUFOUJPO/FUXPSL 4&/FU /POMPDBM/FUXPSL
    "#/
    'FBUVSFNBQ
    8FJHIU

    View Slide

  46. "#/ͷ"UUFOUJPOػߏͷಛੑ


    %FMFUF
    VOOFDFTTBSZ
    SFHJPO
    "GUFS
    "EESFHJPOPG
    JOUFSFTU
    #FGPSF
    TPDDFSCBMM
    DPOGJEFODF
    (5EBMNBUJBO
    EBMNBUJBO
    DPOGJEFODF
    ʜ
    4PGUNBY
    ʜ
    K
    'FBUVSFFYUSBDUPS
    "UUFOUJPONBQ
    *OQVUJNBHF
    'FBUVSFNBQ
    'FBUVSFNBQ
    "UUFOUJPONFDIBOJTN
    "UUFOUJPOCSBODI
    1FSDFQUJPOCSBODI
    $POWPMVUJPO
    MBZFST
    $MBTTJpFS
    #/
    4JHNPJE
    Y$POW
    #/
    Y$POW
    3F-6
    ("1
    4PGUNBY
    K
    Y$POW
    w ਓखͰ"UUFOUJPONBQΛमਖ਼͢Δ͜ͱͰೝࣝ݁ՌΛௐ੔Մೳ
    ..JUTVIBSB )'VLVJ :4BLBTIJUB 50HBUB 5)JSBLBXB 5:BNBTIJUB BOE)'VKJZPTIJ l&NCFEEJOH)VNBO,OPXMFEHFJO%FFQ/FVSBM/FUXPSLWJB"UUFOUJPO.BQz BS9JW

    View Slide

  47. "#/ͷ"UUFOUJPOػߏͷಛੑ


    ɿ஫ࢹྖҬͷ࡟আ
    ɿ஫ࢹྖҬͷ௥Ճ

    View Slide

  48. "UUFOUJPO#SBODI/FUXPSLXJUI)VNBOLOPXMFEHF
    w ਓखͰमਖ਼ͨ͠"UUFOUJPONBQΛ࢖ͬͯ"#/Λ࠶ֶश


    Prob.
    score
    Attention map
    Input image
    Prob.
    score
    M(x
    i)
    Label
    Attention mechanism
    Attention branch
    L
    per(x
    i)
    L
    att(x
    i)
    x
    i
    (a) Overview of Attention Branch Network

    Feature map
    g(x
    i)
    Classifier
    Softmax

    Feature map
    g (x
    i)
    Perception branch
    Convolution
    layers
    K
    ReLU
    Batch Normalization
    Batch Normalization
    Sigmoid
    1x1 Conv., 1
    1x1 Conv.,
    K
    Softmax
    GAP
    1x1 Conv., K
    Feature extractor
    : Convolution layer
    : Activation function
    : Batch Normalization
    BEKVTUFEBUUFOUJPONBQ
    M′(x
    i
    )
    L
    map
    (x
    j
    ) = ||M(x
    j
    ) − M′(x
    j
    )||
    2

    View Slide

  49. "UUFOUJPONBQͷվળ


    "#/
    'JOFUVOJOH
    bTQPUUFETBMBNBOEFS`

    bTQPUUFETBMBNBOEFS`

    bIBNNFSIFBETIBSL`

    bIBNNFSIFBETIBSL`

    bCVMCVM`

    bCVMCVM`

    bIFO`

    bIPVTFpODI`

    bIPVTFpODI`

    bUBJMFEGSPH`

    View Slide

  50. .*36ͷએ఻


    ϩϯάΦʔϥϧ
    "UUFOUJPONBQΛհͨ͠%FFQ/FVSBM/FUXPSL΁ͷਓͷ஌ݟͷ૊ΈࠐΈ
    ൃදऀɿࡾ௡ݪ .

    γϣʔτΦʔϥϧ
    ճؼܕ"UUFOUJPO#SBODIʹΑΔҰ؏ֶशϕʔεͷࣗಈӡసʹ͓͚Δ
    ൑அࠜڌͷՄࢹԽ
    ൃදऀɿ৿ .

    View Slide

  51. .*36ͷએ఻


    ion Branch Network ͷߴਫ਼౓Խ
    4 ௩ݪ୓໵ ࢦಋڭतɿ౻٢߂࿱
    Output
    Attention
    map
    Feature extractor
    Convolution layer 1
    Residual
    Block 2
    Input image
    Attention branch
    Conv.
    Conv.
    Convolution
    layers
    Perception branch
    Residual
    Block 4
    +
    Dropout
    GAP
    Conv.
    Residual
    Block 3
    +
    Dropout
    Uncertainty
    Sampling
    Softmax
    Uncertainty
    Sampling
    Softmax
    ਤ 1 : Bayesian ABN ͷߏ଄
    y =
    patt
    H (patt
    ) < H pper
    pper
    H (patt
    ) H pper
    (3)
    4.ධՁ࣮ݧ
    ఏҊख๏ΛධՁ͢ΔͨΊʹɼ
    Residual Network
    ʢResNetʣ
    ɼ
    ResNet Λϕʔεͱͨ͠ ABN ͓Αͼ Bayesian ABN Λ༻
    ͍ͯҰൠ෺ମೝࣝͷ࣮ݧΛߦ͏ɽҰൠ෺ମೝࣝʹ͓͚Δೝ
    ࣝਫ਼౓Λൺֱ͠ɼෆ࣮֬ੑͷ༗ޮੑΛධՁ͢Δɽ
    4.1.࣮ݧ֓ཁ
    ධՁσʔλʹ͸ɼCIFAR-100 σʔληοτ [3] Λ༻͍
    ΔɽCIFAR-100 σʔληοτ͸ɼ32×32 ϐΫηϧͷΧϥʔ
    ը૾Ͱߏ੒͞Ε͓ͯΓɼΫϥε਺͸ 100 Ͱ͋Δɽຊ࣮ݧͰ
    ͸ɼֶशʹ 50, 000 ຕɼධՁʹ 10, 000 ຕΛ࢖༻͢Δɽ·
    ͨɼMCDO ʹΑΔαϯϓϦϯάճ਺Λ 50 ͱ͠ɼDropout
    ʹ͓͚Δυϩοϓ཰Λ 0.3 ͱ͢Δɽ
    Λࣝผ͢Δ Discriminator Λఢରతʹֶश͢Δ͜ͱͰɼσʔ
    ληοτʹؚ·Εͳ͍ը૾͕ੜ੒ՄೳͰ͋ΔɽCycleGAN
    ͸ɼҟͳΔ 2 ͭͷυϝΠϯͷը૾Λ૬ޓʹม׵͢Δ͜ͱͰɼ
    ελΠϧม׵࣌ʹೖྗը૾ͷಛ௃Λอ࣋ͨ͠ม׵͕Ͱ͖Δɽ
    ͔͠͠ɼCycleGAN Ͱ͸৭΍ܗঢ়͕ҟͳΔෳ਺ͷ෺ମΛ୯
    ҰωοτϫʔΫͰελΠϧม׵͢Δ͜ͱ͕Ͱ͖ͳ͍ͱ͍͏
    ໰୊͕͋Δɽ
    3.ఏҊख๏
    ຊݚڀͰ͸ɼࣝผର৅ͷ஫ࢹྖҬΛଊ͑Δ Attention ػ
    ߏΛಋೖͨ͠ CycleGAN ʹΑΔࣝผର৅ʹ஫ࢹͨ͠ը૾
    ͷελΠϧม׵ख๏ΛఏҊ͢Δɽ
    3.1.ωοτϫʔΫߏ଄
    ఏҊख๏Ͱ͸ɼCycle Consistent Adversarial Domain
    Adaptation(CyCADA)[2] ͷ Semantic consistency loss
    Λಋೖ͢Δɽ·ͨɼΫϥε෼ྨ࣌ͷ஫ࢹྖҬΛυϝΠϯࣝ
    ผ࣌ʹ൓ө͢Δ Attention ػߏ΁ͷޡࠩͱͯ͠ Attention
    consistency loss Λಋೖ͢ΔɽఏҊख๏ͷωοτϫʔΫߏ
    ଄Λਤ 1 ʹࣔ͢ɽ͜͜Ͱɼx ͕ CG ը૾ɼy ͕࣮ը૾ɼˆ
    x
    ͕ CG ΁ͷม׵ը૾ɼˆ
    y ͕࣮ը૾΁ͷม׵ը૾Ͱ͋Δɽ
    ! "
    #
    Generator+,-.
    /012
    !
    #
    Generator/012-.
    +,
    "
    Discriminator/012
    Discriminator+,























    Adversarial Branch Classification Branch
    ਤ 1 : ఏҊख๏ͷωοτϫʔΫߏ଄
    3.2.Attention consistency loss
    Semantic consistency loss ͷΈͰ͸ɼΫϥε෼ྨͨ͠
    ࡍͷ൑அࠜڌ͕ҟͳΔՄೳੑ͕͋Δɽͦ͜ͰɼAttention
    ਤ 3 ʹ࣮ը૾ͱఏҊख๏Ͱม׵ͨ͠ը૾ͷҰ෦Λࣔ͢ɽ
    ୯Ұͷ Generator Ͱ Semantic ͳ৘ใΛܽଛ͢Δ͜ͱͳ͘
    ม׵Ͱ͖͍ͯΔ͜ͱ͕Θ͔Δɽ

    ()

    (Real to CG)

    (CG)
    ਤ 3 : ֤ը૾ͷൺֱ
    ࣝผ཰Λද 1 ʹࣔ͢ɽগྔͷ࣮ը૾Ͱֶशͨ͠৔߹ͱൺֱ͠
    ͯɼCyCADA Λ༻͍ͯม׵ͨ͠৔߹͸ࣝผ཰͕ 1.80 ϙΠ
    ϯτ௿Լͨ͠ɽ͜ͷਫ਼౓௿ԼΛղফ͢ΔͨΊʹɼAttention
    ػߏΛಋೖ͢Δ͜ͱͰࣝผ཰͕ 0.79 ϙΠϯτ޲্ͨ͠ɽ
    ද 1 : ࣝผਫ਼౓ͷൺֱ
    ֶशը૾ ධՁը૾ ࣝผ཰ [%]
    ࣮ը૾ ࣮ը૾ 80.95
    ࣮ը૾ 77.48
    CG ը૾ ม׵ը૾ (CyCADA[2]) 79.15
    ม׵ը૾ (ఏҊख๏) 81.74
    Attention ػߏ͓Αͼ Attention consistency loss ͷ༗ແ
    ʹΑΔ஫ࢹྖҬͷมԽΛਤ 4 ʹࣔ͢ɽAttention ػߏͷಋ
    ೖʹΑΓɼ஫ࢹྖҬͷॖখ౳ͷมԽΛ཈੍ࣝ͠ผର৅ʹ஫
    ࢹͨ͠ม׵͢Δ͜ͱ͕Ͱ͖ͨɽҎ্ΑΓɼఏҊख๏Λ༻͍
    Δ͜ͱͰࣝผର৅ʹޮՌతͳελΠϧม׵͕ՄೳͰ͋Δɽ




    γϣʔτΦʔϥϧ
    ෆ࣮֬ੑͷಋೖʹΑΔ"UUFOUJPO#SBODI/FUXPSLͷ৴པੑͷ޲্
    ൃදऀɿ௩ݪ .

    γϣʔτΦʔϥϧ
    "UUFOUJPOػߏΛಋೖͨ͠$ZDMF("/ʹΑΔࣝผʹ༗ޮͳελΠϧม׵
    ൃදऀɿࠓࢬ .

    View Slide

  52. ·ͱΊ
    w /-1ͱ$7ʹ͓͚Δ"UUFOUJPOػߏʹ͍ͭͯ঺հ
    /-1ɿ&ODPEFS%FDPEFSϞσϧΛத৺ʹ"UUFOUJPOػߏΛಋೖ
    w 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO ʜ
    $7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ
    w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ
    w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ
    4QBDFXJTF"UUFOUJPO
    $IBOOFMXJTF"UUFOUJPO
    4FMG"UUFOUJPO


    View Slide

  53. ·ͱΊ
    w /-1ͱ$7ʹ͓͚Δ"UUFOUJPOػߏʹ͍ͭͯ঺հ
    /-1ɿ&ODPEFS%FDPEFSϞσϧΛத৺ʹ"UUFOUJPOػߏΛಋೖ
    w 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO ʜ
    $7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ
    w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ
    w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ
    4QBDFXJTF"UUFOUJPO
    $IBOOFMXJTF"UUFOUJPO
    4FMG"UUFOUJPO


    ͦͷલʹ

    View Slide

  54. গ͕࣌ؒ͠༨ͬͨͷͰɺɺɺ
    w $713౤ߘͰֶΜͩ͜ͱΛগ͠࿩ͦ͏ͱࢥ͍·͢
    "#/ͷ஀ੜൿ࿩ɼ౤ߘɼࠪಡ݁Ռ͔ΒֶΜͩ͜ͱΛத৺ʹ঺հ
    த෦஍۠ͷτοϓΧϯϑΝ࠾୒ऀΛ૿΍ͦ͏✊


    Attention Branch Network:
    Learning of Attention Mechanism for Visual Explanation
    Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi
    Chubu University
    1200 Matsumotocho, Kasugai, Aichi, Japan
    {[email protected], [email protected], yamashita@isc, fujiyoshi@isc}.chubu.ac.jp
    Abstract
    Visual explanation enables humans to understand the de-
    cision making of deep convolutional neural network (CNN),
    but it is insufficient to contribute to improving CNN perfor-
    mance. In this paper, we focus on the attention map for vi-
    sual explanation, which represents a high response value as
    the attention location in image recognition. This attention
    region significantly improves the performance of CNN by
    introducing an attention mechanism that focuses on a spe-
    cific region in an image. In this work, we propose Attention
    Branch Network (ABN), which extends a response-based vi-
    sual explanation model by introducing a branch structure
    with an attention mechanism. ABN can be applicable to
    several image recognition tasks by introducing a branch for
    the attention mechanism and is trainable for visual expla-
    Attention map
    Attention map

    Great grey owl
    Ruffed grouse
    (a) Class Activation Mapping
    (b) Attention Branch Network
    wc
    GAP & fc.
    Attention branch
    Perception branch
    Label
    Label
    Lper
    (xi
    )
    Latt
    (xi
    )
    Feature
    extractor
    Input image
    Input image
    L(xi
    )
    Attention
    mechanism
    Feature map
    Feature
    extractor
    Figure 1. Network structures of class activation mapping and pro-
    posed attention branch network.

    View Slide

  55. "#/͸Ͳ͏΍ͬͯੜ·Εͨʁ
    w (MPCBM"WFSBHF1PPMJOHΛ৮͍ͬͯͨ࣌ʹߟҊ
    13.6ΞϧίϯΛ΍͍ͬͯΔ࣌ʹ("1ͷੌ͞Λ஌Δ
    ͦͷޙɼ"UUFOUJPOػߏΛ஌ͬͯ"UUFOUJPO#SBODI/FUXPSLΛߟҊ


    ਤ 3 Ϩϕϧ 1 ͷݕग़݁Ռ
    ϓ
    ਤ 3 Ϩϕϧ 1 ͷݕग़݁Ռ
    13.6Ξϧίϯͷ݁ՌҰྫ
    Origin image
    Feature
    before mask
    Soft attention
    mask
    Feature
    after mask
    Feature
    before mask
    Feature
    after mask
    Low-level color feature
    Sky mask
    High-level part feature
    Balloon instance mask
    Classification
    Input
    Attention
    Attention mechanism
    Soft attention
    mask
    "UUFOUJPOػߏ

    View Slide

  56. ఏҊ͔Βൃද·ͰͷྲྀΕ


    த०ΠϯλʔϯͷதؒใࠂձͰ"#/Λ঺հ
    &$$7ʹ౤ߘˠ3FKFDU
    $713ʹ౤ߘ
    ެௌձͷ࣌ؒલʹࠪಡ݁Ռ͕དྷΔ
    ࠾୒௨஌"
    Φʔϥϧ௨஌
    $713࠾୒·ͰͷྲྀΕ

    View Slide

  57. &$$7͸ͳͥ3FKFDU͞Εͨʁ
    w ࣮ݧ਺͕଍Γͳ͔ͬͨ
    ύϒϦοΫͳσʔληοτͰ࣮ݧ͕ͨ͠ɼϚΠφʔͳσʔληοτͰ͔͠ධՁͯ͠ͳ͔ͬͨ
    w 'JOFHSBJOFESFDPHOJUJPOɿ$PNQSFIFOTJWFDBSTEBUBTFU
    w าߦऀݕग़ɿ$JUZ1FSTPOEBUBTFU
    w .VMUJUBTL-FBSOJOHɿ$FMFC"EBUBTFU
    Ϟσϧ΋7((ͱ3FT/FUɼ'BTUFS3$//ͷΈͰ࣮ݧ
    "#/ͷߏ଄͕ෆे෼
    w $*'"3ɼ*NBHF/FUͰ͸े෼ͳੑೳ͕ग़ͳ͍ঢ়ଶ


    w จষྗ
    ઌੜͷఴ࡟͕গ͔͠͠ड͚Εͳ͔ͬͨ
    ӳจߍਖ਼Λճ͔͠ग़ͤͳ͔ͬͨ

    View Slide

  58. $713ʹ޲͚ͯ΍ͬͨ͜ͱ
    w "UUFOUJPO#SBODI/FUXPSLͷվྑ
    "UUFOUJPOػߏͷ࿦จΛαʔϕΠͯ͠"#/ͷ"UUFOUJPOػߏΛվྑ
    $*'"3ɼ47)/ɼ*NBHF/FUͰ΋ਫ਼౓͕޲্


    w ͱΓ࣮͋͑ͣݧ਺Λ૿͢
    ϝδϟʔͳσʔληοτΛؚΊͨධՁ
    w ܭσʔλ$*'"3 47)/ *NBHF/FU $PNQ$BST $FMFC" (5"ᶛ $0$0 $JUZ1FSTPO

    ࣮ݧʹ࢖༻͢ΔϞσϧ΋૿΍ͨ͠
    w 7((ɼ3FT/FUɼ8JEF3FT/FUɼ%FOTF/FUɼ3FT/F9Uɼ'BTUFS3$//
    w ৭ʑͳͱ͜ΖͰൃදͯ͠ίϝϯτΛ໯͏
    .*36ɼ(5$+BQBOɼ34+ʢɼຊՈ(5$ʣʜFUD
    ઌੜͷߨԋͰ΋঺հɿ44**౳
    ˡ"#/ͷ͓͔͛Ͱड৆݅਺͕ͭ૿͑·͠·ͨ͠"

    View Slide

  59. $713౤ߘͰΘ͔ͬͨ͜ͱ
    w ධՁ࣮ݧ͸௒େࣄ
    ͳΔ΂͘ϝδϟʔͳσʔληοτͰධՁ͠Α͏
    w ʮެ։͞ΕͯΔσʔληοτ͔ͩΒʯ͸͋·Γ௨༻͠ͳ͍Πϝʔδ
    w ʮ͍͔ͭ͘ͷϝδϟʔͳσʔληοτͰੑೳ͕ྑ͘ͳΔʯͰΑ͏΍͘આಘྗ͕૿͢
    ʮਫ਼౓্͚͕ͩͬͨʯ͚ͩͰ͸௨ΓͮΒ͍
    w ʮਫ਼౓্͕͕ͬͨʯʹՃ͑ͯԿ͔վྑ͕ͳ͍ͱධՁ͞ΕͮΒ͍
    "#/ͷ৔߹ɿʮਫ਼౓޲্ʯˍʮࢹ֮తઆ໌ʢ&YQMBJOBCMF"*ʣʯ
    ଞʹ΋ʮਫ਼౓޲্ʯˍʮύϥϝʔλ਺࡟ݮʯɼʮਫ਼౓޲্ʯˍʮߴ଎Խʯ౳
    ࣮ݧʹ࢖͏Ϟσϧ͸ͳΔ΂͘໢ཏతʹධՁ͢Δ
    w ʮಛఆͷϞσϧ͚ͩʹԠ༻ʯ͸ධՁ͞Εʹ͍͘
    w ࠷ۙͷྲྀߦ͕ೖ͍ͬͯΔͱ௨Γ΍͍͢ʁ
    "#/͸&YQMBJOBCMF"*ͱ"UUFOUJPOػߏΛऔΓೖΕ͍͔ͯͨΒ৽نੑ͕ߴ͘ධՁ͞Εͨʁ


    View Slide

  60. $713౤ߘͰΘ͔ͬͨ͜ͱ
    w ࿦จͷॻ͖ํ
    ਤ͸γϯϓϧͰݟ΍͘͢࡞Δ
    w ಛʹ࠷ॳͷਤɼख๏ͷશମਤ
    ࣗ෼͕ओு͍ͨ͜͠ͱΛͭʹߜͬͯ࿦จΛॻ͘
    w "#/ͷ৔߹ɿࢹ֮తઆ໌Λ"UUFOUJPOػߏʹԠ༻
    w ओு఺͕ଟ͗͢ΔͱԿ͕ݴ͍͍ͨͷ͔Θ͔Βͳ͘ͳΔ
    w ৭ʑͳݚڀऀ͔ΒҙݟΛ΋Β͓͏
    w ੵۃతʹ஌ࣝͱܦݧΛੵ΋͏ʂ
    "#/͸("1ͷಛੑͱ"UUFOUJPOػߏΛ஌Βͳ͚Ε͹ੜ·Εͳ͔ͬͨ
    ("1ͷಛੑͱ"UUFOUJPOػߏΛ஌͍ͬͯͯ΋ख๏Λ࡞Δܦݧ͕ແ͍ͱ"#/ʹḷΓண͚ͳ͔ͬͨ


    View Slide

  61. ·ͱΊ
    w /-1ͱ$7ʹ͓͚Δ"UUFOUJPOػߏʹ͍ͭͯ঺հ
    /-1ɿ&ODPEFS%FDPEFSϞσϧΛத৺ʹ"UUFOUJPOػߏΛಋೖ
    w 4PVSDF5BSHFU"UUFOUJPO 4FMG"UUFOUJPO ʜ
    $7ɿ$//Λத৺ʹ"UUFOUJPOػߏΛಋೖ
    w ࣌ܥྻ͋Γͷ"UUFOUJPOػߏ
    w ࣌ܥྻͳ͠ͷ"UUFOUJPOػߏ
    4QBDFXJTF"UUFOUJPO
    $IBOOFMXJTF"UUFOUJPO
    4FMG"UUFOUJPO
    w $713ͷൿ࿩


    5IBOLZPVGPSZPVSl"UUFOUJPOz
    "UUFOUJPONBQ HMBTTFT

    View Slide