Upgrade to Pro — share decks privately, control downloads, hide ads and more …

深層学習によるセマンティックセグメンテーションとその最新動向

 深層学習によるセマンティックセグメンテーションとその最新動向

日本顕微鏡学会
生体機能ボリュームデータ解析研究部会 第6回研究会
2022年3月24日
藤吉弘亘(中部大学)

Hironobu Fujiyoshi

March 23, 2022
Tweet

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Transcript

  1. ೔ຊݦඍڸֶձ
    ੜମػೳ
    ボ
    ϦϡʔϜ
    デ
    ʔλղੳݚڀ෦ձୈճݚڀձ
    ਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯͱͦͷ࠷৽ಈ޲
    ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ
    IUUQNQSHKQ

    View full-size slide

  2. Ұൠ෺ମೝࣝͱ͸ʁ

    View full-size slide

  3. ίϯϐϡʔλϏδϣϯʢը૾ೝࣝʣͷ࠷ऴ໨ඪʁ

    Ұൠ෺ମೝࣝɿ
    ੍໿ͷͳ͍࣮ੈքγʔϯͷը૾ʹରͯ͠ɺܭࢉػ
    が
    ͦͷதʹؚ·ΕΔ෺ମΛҰൠతͳ໊শ
    で
    ೝࣝ͢Δ͜ͱ
    ਂ૚ֶशલ d
    ɿඇৗʹ೉͍͠໰୊ͱ͞Ε͍ͯͨ
    ਂ૚ֶशޙ d
    ɿղ͔Ε࢝Ίͨ

    View full-size slide

  4. Ұൠ෺ମೝࣝ໰୊ͷࡉ෼Խ

    ͜Ε͸ݐ෺Ͱ͔͢ʁ
    র߹
    ਓ͸Ͳ͜Ͱ͔͢ʁ
    ෺ମݕग़
    Կͷը૾Ͱ͔͢ʁ
    ը૾෼ྨ
    ͲͷΑ͏ͳγʔϯͰ͔͢ʁ
    γʔϯͷཧղ
    ಛఆ෺ମೝࣝ
    Ұൠ෺ମೝࣝ
    ඪࣝzࢭ·ΕzͰ͔͢ʁ

    View full-size slide

  5. w ը૾தͷըૉ୯ҐͰ෺ମΧςΰϦΛٻΊΔ໰୊
    4FNBOUJDUFYUPOGPSFTUT<4IPUUPO $713`>
    ηϚϯςΟοΫηάϝϯςʔγϣϯλεΫ

    sts of T decision trees. A feature vector is classified by descending each tree. This gives,
    lass distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow)
    ure vector. This paper shows how to simultaneously exploit both the hierarchical clustering
    ss distributions. (b) Semantic texton forests features. The split nodes in semantic texton
    els within a d d patch: either the raw value of a single pixel, or the sum, difference, or
    mentation by em-
    present.
    of this work are:
    ntly provide both
    xtons and a local
    xtons model, and
    entation; and (iii)
    ove segmentation
    ion 2 gives a brief
    ch form the basis
    uced in Section 3.
    in Section 4 and
    ing, we also use it as a classifier, which enables us to use
    semantic context for image segmentation; and (iii) in addi-
    tion to the leaf nodes used in [19], we include the branch
    nodes as hierarchical clusters. A related method, the pyra-
    mid match kernel [9], exploits a hierarchy in descriptor
    space, though it requires the computation of feature descrip-
    tors and is only applicable to kernel based classifiers. The
    pixel-based features we use are similar to those in [14], but
    our forests are trained to recognize object categories, not
    match particular feature points.
    Other work has also looked at alternatives to k-means.
    Recent work [29] quantizes feature space into a hyper-grid,
    but requires descriptor computation and can result in very
    uilding
    rass
    ee
    ow
    heep
    ky
    irplane
    ater
    ce
    ar
    icycle
    ower
    ign
    ird
    ook
    hair
    ad
    at
    og
    ody
    oat
    lobal
    verage
    .BDIJOFMFBSOJOHBMHPSJUIN
    )BOEDSBGUFEGFBUVSF
    Figure 2. (a) Decision forests. A forest consists of T decision trees. A feature vector is classified by descending each tree. This gives,
    for each tree, a path from root to leaf, and a class distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow)
    and class distributions (red) for one input feature vector. This paper shows how to simultaneously exploit both the hierarchical clustering
    implicit in the tree structure and the node class distributions. (b) Semantic texton forests features. The split nodes in semantic texton
    forests use simple functions of raw image pixels within a d d patch: either the raw value of a single pixel, or the sum, difference, or
    absolute difference of a pair of pixels (red).
    act as an image-level prior to improve segmentation by em-
    phasizing the categories most likely to be present.
    To summarize, the main contributions of this work are:
    (i) semantic texton forests which efficiently provide both
    a hierarchical clustering into semantic textons and a local
    classification; (ii) the bag of semantic textons model, and
    its applications in categorization and segmentation; and (iii)
    the use of the image-level prior to improve segmentation
    performance.
    The paper is structured as follows. Section 2 gives a brief
    recap of randomized decision forests which form the basis
    of our new semantic texton forests, introduced in Section 3.
    ing, we also use it as a classifier, which enables us to use
    semantic context for image segmentation; and (iii) in addi-
    tion to the leaf nodes used in [19], we include the branch
    nodes as hierarchical clusters. A related method, the pyra-
    mid match kernel [9], exploits a hierarchy in descriptor
    space, though it requires the computation of feature descrip-
    tors and is only applicable to kernel based classifiers. The
    pixel-based features we use are similar to those in [14], but
    our forests are trained to recognize object categories, not
    match particular feature points.
    Other work has also looked at alternatives to k-means.
    Recent work [29] quantizes feature space into a hyper-grid,
    3BOEPNGPSFTU
    1JYFMDPNQBSJTPO

    View full-size slide

  6. w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭ
    ֤ը૾ೝࣝλεΫʹ͓͚Δ$//ͷߏ଄

    ʜ
    z1FSTPOz
    W
    H
    W′

    H′

    H
    W
    W
    H
    ըૉ͝ͱʹΫϥε֬཰Λग़ྗ
    W
    H
    1FSTPO
    1FSTPO 1FSTPO
    1FSTPO
    C
    ʜ
    ʜ
    ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚
    C
    άϦου͝ͱʹ
    Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ
    Ϋϥε֬཰Λग़ྗ
    ೖྗ ग़ྗ
    C + B
    $//
    ग़ྗ݁ՌͷՄࢹԽ
    $//
    $//
    ෺ମݕग़
    ը૾෼ྨ
    ηϚϯςΟοΫ
    ηάϝϯςʔγϣϯ

    View full-size slide

  7. w 4FH/FU<#BESJOBSBZBOBO 1".*`>
    Τϯίʔμɾσίʔμߏ଄
    $//ʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ

    2
    Fig. 1. SegNet predictions on urban and highway scene test samples from the wild. The class colour codes can be obtained from Brostow et al. [3].
    To try our system yourself, please see our online web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
    1PPMJOHJOEJDFT
    ϓʔϦϯάҐஔΛهԱ͠ɼΞοϓαϯϓϦϯά࣌ʹར༻

    View full-size slide

  8. ηϚϯςΟοΫηάϝϯςʔγϣϯͷωοτϫʔΫߏ଄

    '$/ GVMMZDPOWPMVUJPOBMOFUXPSL
    6/FU
    141/FU
    +-POH l'VMMZ$POWPMVUJPOBM/FUXPSLTGPS4FNBOUJD4FHNFOUBUJPOz
    $713
    03POOFCFSHFS l6/FU$POWPMVUJPOBM/FUXPSLTGPS#JPNFEJDBM
    *NBHF4FHNFOUBUJPOz .*$$"*
    );IBP l1ZSBNJE4DFOF1BSTJOH/FUXPSLz $713

    View full-size slide

  9. w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ







    ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ

    4FH/FU
    141/FU %FFQ-BC7
    )3/FU
    0$3
    $JUZTDBQFTWBMTFUϕϯνϚʔΫ
    %FFQ-BC7
    %FFQ-BC
    %FFQ-BC7
    4FH'PSNFS
    ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ

    $//ϕʔεͷ4UBUFPGUIFBSU
    .FBO*P6 $MBTT

    ൃද೥

    View full-size slide

  10. w "USPVT$POWPMVUJPOʹΑΓಛ௃ϚοϓαΠζΛখ͘͞͠ͳ͍
    %FFQ-BCW<$IFO &$$7>

    ҰൠతͳωοτϫʔΫ
    "USPVT$POWPMVUJPOͷωοτϫʔΫ
    "USPVT4QBUJBMQZSBNJEQPPMJOH
    "USPVT$POWPMVUJPO Y

    View full-size slide

  11. w "USPVT4QBUJBM1ZSBNJE1PPMJOH "411

    ҟͳΔͷ%JMBUJPOͷ৞ΈࠐΈΛฒྻॲཧˠ༷ʑͳ3FDFQUJWF
    fi
    FMEΛߟྀ
    w Τϯίʔμɾσίʔμߏ଄
    ෺ମڥքपΓͷࣝผਫ਼౓޲্
    %FFQ-BCW<$IFO &$$7>

    1x1 Conv
    3x3 Conv
    rate 6
    3x3 Conv
    rate 12
    3x3 Conv
    rate 18
    Image
    Pooling
    1x1 Conv
    1x1 Conv
    Low-Level
    Features
    Upsample
    by 4
    Concat 3x3 Conv
    Encoder
    Decoder
    Atrous Conv
    DCNN
    Image
    Prediction
    Upsample
    by 4
    -$IFO l&ODPEFS%FDPEFSXJUI"USPVT4FQBSBCMF$POWPMVUJPOGPS4FNBOUJD*NBHF4FHNFOUBUJPOz &$$7

    View full-size slide

  12. w ߴղ૾౓ɾ௿ղ૾౓ͷαϒωοτϫʔΫʹΑΔฒྻॲཧ
    ہॴతͳಛ௃ͱେہతͳಛ௃Λ֫ಘՄೳ
    αϒωοτϫʔΫؒͷ઀ଓˠ֤εέʔϧͷಛ௃ϚοϓΛ଍͠߹Θͤͯ৘ใڞ༗
    )3/FU<8BOH 1".*`>

    View full-size slide

  13. w ΦϕδΣΫτίϯςΩετදݱΛఏҊ
    ΦϒδΣΫτྖҬதؒಛ௃ΛιϑτΞςϯγϣϯͱͯ͠࢖༻
    ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠ಛ௃Λදݱ
    0$3<:VBO &$$7`>

    Backbone
    Pixel Representations Pixel-Region Relation
    Object Contextual Representations
    Augmented Representations
    N
    N
    N
    Soft Object Regions Object Region Representations
    Loss
    ΦϒδΣΫτΫϥεͷදݱ
    Λར༻ͯ͠ϐΫηϧ͝ͱͷ
    ಛ௃Λ֫ಘ
    ΦϒδΣΫτྖҬʹଘࡏ͢Δ
    ϐΫηϧͷදݱΛू໿ू໿
    ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠
    ΦϒδΣΫτίϯςΩετදݱ
    "411 0$3
    ιϑτΞςϯγϣϯ

    View full-size slide

  14. w γʔϯղੳ
    ࣗಈӡస
    ϩϘςΟΫεϏδϣϯ
    Ӵ੕ը૾ղੳ
    w ҩྍը૾ղੳ
    ଁثηάϝϯςʔγϣϯ
    පมηάϝϯςʔγϣϯ
    w ޻ۀݕࠪ
    ҟৗྖҬͷݕ஌
    ηϚϯςΟοΫηάϝϯςʔγϣϯͷԠ༻ྫ

    )"MFNPIBNNBE l-BOE$PWFS/FU"HMPCBMCFODINBSLMBOEDPWFS
    DMBTTJ
    fi
    DBUJPOUSBJOJOHEBUBTFUz/FVS*14
    Ӵ੕ը૾ղੳ
    )3PUI l"OBQQMJDBUJPOPGDBTDBEFE%GVMMZDPOWPMVUJPOBMOFUXPSLTGPS
    NFEJDBMJNBHFTFHNFOUBUJPOz +$.*(
    ଁثηάϝϯςʔγϣϯ
    (5
    ,,BNOJUTBT l&
    ff
    i
    DJFOU.VMUJ4DBMF%$//XJUI'VMMZ$POOFDUFE$3'GPS
    "DDVSBUF#SBJO-FTJPO4FHNFOUBUJPOz .FE*"
    පมηάϝϯςʔγϣϯ
    ᜊ౻জّlਂ૚ֶशʹΑΔίϯΫϦʔτޢ؛ྼԽྖҬݕग़γεςϜͷ։ൃz
    σδλϧϓϥΫςΟε
    ҟৗྖҬͷݕ஌

    View full-size slide

  15. w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ
    ֶशσʔλɿԤभͰࡱӨ $JUZTDBQFTEBUBTFU

    ςετσʔλɿ೔ຊͰࡱӨ
    ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

    View full-size slide

  16. w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ
    ֶशσʔλɿ೔ຊͰࡱӨ
    ςετσʔλɿ೔ຊͰࡱӨ
    ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

    View full-size slide

  17. w ϚϧνϔουʹΑΓෳ਺ͷσʔληοτͷηϚϯςΟοΫηάϝϯςʔγϣϯΛ࣮ݱ
    ڞ༗ωοτϫʔΫɿ3FT/FUʹ%PNBJO"UUFOUJPONPEVMFΛద༻
    ResNet101
    +
    DA module
    ASPP
    1×1conv.
    concat
    Cityscapes
    !"
    !#
    !$
    マルチヘッド構造
    3×3conv.
    3×3conv.
    1×1conv.
    Head 1
    %"
    %#
    %$
    Head 2
    Head &
    3×3conv.
    3×3conv.
    1×1conv.
    3×3conv.
    3×3conv.
    1×1conv.
    Cityscapes
    共有ネットワーク
    A2D2
    Mapillary
    A2D2
    Mapillary
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    View full-size slide

  18. w %PNBJO"UUFOUJPO %"
    NPEVMF
    4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़
    %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM


    SE module A
    C×N
    N×1
    SE module B
    SE module C

    C×H×W
    C×H×W
    C×1
    C×1
    C×1
    C×1
    C×1
    Domain Assignment
    softmax
    FC
    GAP
    SE Adapter
    C×H×W
    Concat.
    GAP
    υϝΠϯ#
    3FTJEVBM
    #MPDL
    υϝΠϯ"
    υϝΠϯ$
    υϝΠϯʹର͢ΔΞςϯγϣϯ

    View full-size slide

  19. ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM


    SE module A
    C×N
    N×1
    SE module B
    SE module C

    C×H×W
    C×H×W
    C×1
    C×1
    C×1
    C×1
    C×1
    Domain Assignment
    softmax
    FC
    GAP
    SE Adapter
    C×H×W
    Concat.
    GAP ⨂
    SE module A
    C×N
    N×1
    SE module B
    SE module C

    C×H×W
    C×H×W
    C×1
    C×1
    C×1
    C×1
    C×1
    Domain Assignment
    softmax
    FC
    GAP
    SE Adapter
    C×H×W
    C×H×W
    GAP Concat.
    υϝΠϯ#
    3FTJEVBM
    #MPDL
    υϝΠϯ"
    υϝΠϯ$
    w %PNBJO"UUFOUJPO %"
    NPEVMF
    4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़
    %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚
    υϝΠϯʹର͢ΔΞςϯγϣϯ

    View full-size slide

  20. w %PNBJO"UUFOUJPO %"
    NPEVMF
    4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़
    %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚

    SE module A
    C×N
    N×1
    SE module B
    SE module C

    C×H×W C×H×W
    C×1
    C×1
    C×1
    C×1
    C×1
    Domain Assignment
    softmax
    FC
    GAP
    SE Adapter
    C×H×W
    C×H×W
    GAP Concat.
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    υϝΠϯ#
    3FTJEVBM
    #MPDL
    υϝΠϯ"
    υϝΠϯ$
    ˠυϝΠϯ͝ͱʹ4&"EBQUFS͔Β࠷దͳಛ௃Λ֫ಘ
    υϝΠϯʹର͢ΔΞςϯγϣϯ

    View full-size slide

  21. w Ϛϧνϔουߏ଄
    σʔληοτ͝ͱʹݻ༗ͷग़ྗϔουΛ༻ҙ
    ҟͳΔΦϒδΣΫτΫϥεΛ࣋ͭσʔληοτͰ΋ֶशՄೳ
    ResNet101
    +
    DA module
    ASPP
    1×1conv.
    concat
    Cityscapes
    !"
    !#
    !$
    マルチヘッド構造
    3×3conv.
    3×3conv.
    1×1conv.
    Head 1
    %"
    %#
    %$
    Head 2
    Head &
    3×3conv.
    3×3conv.
    1×1conv.
    3×3conv.
    3×3conv.
    1×1conv.
    Cityscapes
    共有ネットワーク
    A2D2
    Mapillary
    A2D2
    Mapillary
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    Ϋϥεͷग़ྗ
    Ϋϥεͷग़ྗ
    Ϋϥεͷग़ྗ

    View full-size slide

  22. w ֶशํ๏
    ޡࠩΛྦྷੵͯ͠ಉ࣌ʹٯ఻೻͢Δ.JY-PTTΛ࢖༻
    ಉ࣌ʹڞ༗ωοτϫʔΫͷύϥϝʔλΛߋ৽͢Δ͜ͱͰόϥ͖ͭΛ௿ݮ
    ResNet101
    +
    DA module
    ASPP
    1×1conv.
    concat
    Cityscapes
    !"
    !#
    !$
    マルチヘッド構造
    3×3conv.
    3×3conv.
    1×1conv.
    Head 1
    %"
    %#
    %$
    Head 2
    Head &
    3×3conv.
    3×3conv.
    1×1conv.
    3×3conv.
    3×3conv.
    1×1conv.
    Cityscapes
    共有ネットワーク
    A2D2
    Mapillary
    A2D2
    Mapillary
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    LCE
    (x1
    )
    LCE
    (x2
    )
    +
    LCE
    (xn
    )
    L =
    N

    n=1
    LCE
    (xn
    )
    .JY-PTT

    View full-size slide

  23. w ࣮ݧ֓ཁ
    ࣮ݧ৚݅
    ೖྗαΠζɿºϐΫηϧ
    ֶशճ਺ɿFQPDI
    ΦϓςΟϚΠβɿ.PNFOUVN4(%
    σʔληοτͷ૊Έ߹Θͤ
    $JUZTDBQFT#%%4ZOTDBQFT
    ධՁࢦඪ
    .FBO*P6
    ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    $JUZTDBQFTɿ
    #%%ɿ
    4ZOTDBQFTɿ
    υΠπͷ౎ࢢͰࡱӨ͞Εͨंࡌը૾ͷσʔληοτ
    ΞϝϦΧͷ౎ࢢʢχϡʔϤʔΫɼόʔΫϨΠɼαϯϑϥϯγε
    ίɼϕΠΞϦΞʣͰࡱӨ͞Εͨσʔληοτ
    ϑΥτϦΞϦεςΟοΫϨϯμϦϯάٕज़Λ༻͍ͯੜ੒ͨ͠σʔ
    ληοτ

    View full-size slide

  24. ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

    5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT ఏҊख๏ʢ%"ͳ͠ʣ
    $JUZTDBQFT

    #%%

    4ZOTDBQFT

    .FBO*P6ʹΑΔൺֱ

    View full-size slide

  25. w Ξϊςʔγϣϯ
    Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ
    ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺

    ˠΞϊςʔγϣϯϥϕϧ͕ҟͳΔͱֶश͕͏·͍͔͘ͳ͍
    Ξϊςʔλʔ"͞Μ Ξϊςʔλʔ#͞Μ
    थ໦

    View full-size slide

  26. w Ξϊςʔγϣϯ
    Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ
    Ξϊςʔγϣϯίετɿ ຕº෼ʹ ࣌ؒ
    ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺

    $JUZTDBQFTͷ৔߹
    ຕ͋ͨΓ෼
    M. Cordts, “The Cityscapes Dataset for Semantic Urban Scene Understanding”, CVPR2016

    View full-size slide

  27. w 41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPNTJNVMBUJPO<-FF *$-3`>
    Ξϊςʔγϣϯແ͠ͷ࣮ը૾ͱγϛϡϨʔλΛ༻ֶ͍ͨश
    ϥϕϧແ࣮͠ը૾ʴγϛϡϨʔγϣϯ

    $(Λೖྗ࣮ͯ͠ը૾ʹελΠϧม׵
    γϛϡϨʔλΛ༻͍ͯ
    $(ը૾ɺΞϊςʔγϣϯɺσϓεΛੜ੒
    ໨తλεΫ̍ɿ
    ηϚϯςΟοΫηάϝϯςʔγϣϯ
    ໨తλεΫɿ
    σϓεਪఆ
    ϥϕϧແ࣮͠ը૾܈
    γϛϡϨʔλ

    View full-size slide

  28. w 4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI%JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ<9-BJ $713>
    ڭࢣͳ͠σʔλ͔ΒϥϯμϜΫϩοϓͨ͠ೋͭͷύον͔Βରরֶश
    1PTJUJWFQBJSɿڞ௨ྖҬ
    /FHBUJWFQBJSɿҟͳΔྖҬ
    ൒ڭࢣ͋Γֶश

    ϥϕϧແ͠ը૾
    ϥϕϧ͋Γը૾
    1PTJUJWFQBJS
    /FHBUJWFQBJS
    1PTJUJWFQBJS
    /FHBUJWFQBJS
    1PTJUJWFQBJSɿ
    ͷಉ͡Ґஔʹ͋Δͭͷಛ௃Λ͚ۙͮΔ
    /FHBUJWFQBJSɿ
    ͷҟͳΔҐஔʹ͋Δͭͷಛ௃Λ཭͢
    𝜙
    𝑜
    1
    ͱ
    𝜙
    𝑜
    2
    𝜙
    𝑢
    1
    ͱ
    𝜙
    𝑢
    2

    View full-size slide

  29. w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ







    ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ

    4FH/FU
    141/FU %FFQ-BC7
    )3/FU
    0$3
    $JUZTDBQFTWBMTFUϕϯνϚʔΫ
    %FFQ-BC7
    %FFQ-BC
    %FFQ-BC7
    4FH'PSNFS
    ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ

    5SBOT'PSNFSϕʔε
    .FBO*P6 $MBTT

    ൃද೥

    View full-size slide

  30. w "UUFOUJPOػߏͷΈΛ༻͍ͨϞσϧ
    3//΍$//ʹ୅Θͬͯจষੜ੒΍຋༁λεΫͰ4P5"
    w "UUFOUJPOػߏͷΈͰߏ੒
    $//ͷΑ͏ʹฒྻܭࢉ͕Մೳ
    3//ͷΑ͏ʹ௕ظґଘϞσϧΛߏஙՄೳ
    w 1PTJUJPOBM&ODPEJOHͷߏங
    ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ
    w 4FMG"UUFOUJPOϞσϧͰߏ੒
    ೖྗग़ྗؒͷরԠؔ܎Λ௕ظతʹ֫ಘՄೳ
    5SBOTGPSNFS<7BTXBOJ /FVS*14>
    &ODPEFS
    %FDPEFS

    View full-size slide

  31. w ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ
    ܥྻσʔλΛਖ਼֬ͳॱྻʹอͭ
    3//΍$//ͷ૬ରత͔ͭઈରతͳҐஔ৘ใΛՃࢉ͢ΔΠϝʔδ
    1PTJUJPOBM&ODPEJOH
    3//

    3//
    ͸
    3//
    Ϧϯΰ
    3//
    ͕
    3//
    ޷͖
    3//
    Ͱ͢
    U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ
    PE(pos,2i)
    = sin(pos/10,0002i/dmodel)
    PE(pos,2i+1)
    = cos(pos/10,0002i/dmodel)
    1PTJUJPOBM&ODPEJOHͷఆࣜԽ
    dmodel
    i
    pos
    1&ͷ࣍ݩ਺
    ܥྻσʔλͷҐஔ
    1&ͷ࣍ݩͷ੒෼
    ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ

    2π 10,000 ⋅ 2π
    ܥྻσʔλͷҐஔ
    1&ͷ࣍ݩ
    1PTJUJPOBM&ODPEJOHͷՄࢹԽ

    View full-size slide

  32. w 5SBOTGPSNFSͷ伴ͱͳΔ෦෼
    .VMUJ)FBE"UUFOUJPOͷதͷϞδϡʔϧ
    &ODPEFS%FDPEFSͷ྆ํͰ࢖༻
    4FMG"UUFOUJPO 2VFSZ ,FZ 7BMVF
    Ͱߏ੒
    4DBMFE%PU1SPEVDU"UUFOUJPO
    Attention(Q, K, V) = softmax(
    QKT
    dk
    )V
    Scaled Dot-Product Attention Multi-Head Attention
    Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
    attention layers running in parallel.
    3.2.1 Scaled Dot-Product Attention
    We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
    queries and keys of dimension dk
    , and values of dimension dv
    . We compute the dot products of the
    query with all keys, divide each by
    p
    dk
    , and apply a softmax function to obtain the weights on the
    values.
    In practice, we compute the attention function on a set of queries simultaneously, packed together
    into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
    the matrix of outputs as:
    Attention(Q, K, V ) = softmax(
    QKT
    p
    dk
    )V (1)
    The two most commonly used attention functions are additive attention [2], and dot-product (multi-
    ఆࣜԽ
    2VFSZͷ࣍ݩ਺
    dk
    ྫɿ2 ,ͷฏۉ͕ɼ෼ࢄ͕ͱԾఆ͢Δͱɼ͜ΕΒͷߦྻੵͷฏۉ஋͕ɼ෼ࢄ ͱͳΔ
    dk
    4PGUNBYؔ਺ͷޯ഑Λܭࢉ࣌ʹɼҰ෦ͷ಺ੵ஋͕ඇৗʹେ͖͍ͱɼ
    ಺ੵ஋͕࠷େͷཁૉҎ֎ͷޯ഑͕ඇৗʹখ͘͞ͳΔ
    2 ,ͷಛ௃ྔΛ ͰεέʔϦϯά͢Δ͜ͱͰɼฏۉɼ෼ࢄͱͳΓ׈Β͔ͳޯ഑ΛͱΔ
    dk
    Scaled Dot-Product Attention Multi-Head Attention
    Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
    attention layers running in parallel.
    3.2.1 Scaled Dot-Product Attention

    View full-size slide

  33. 4FMG"UUFOUJPOͷৄࡉ
    q1
    k1
    v1
    q2
    k2
    v2
    q3
    k3
    v3
    q4
    k4
    v4
    q5
    k5
    v5
    x1
    e1
    x2
    e2
    x3
    e3
    x4
    e4
    x5
    e5
    &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH
    1&
    Figure 1: The Transformer - model architecture.
    der and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each
    qi
    = Wq
    ei
    ki
    = Wk
    ei
    vi
    = Wv
    ei
    &NCFEEJOHͨ͠ಛ௃ྔ ͔Β2VFSZɼ,FZɼ7BMVFಛ௃ྔΛͦΕͧΕͷઢܗม׵ͰٻΊΔ
    ei
    2VFSZ ,FZ 7BMVF

    View full-size slide

  34. 4FMG"UUFOUJPOͷৄࡉ
    q1
    k1
    v1
    q2
    k2
    v2
    q3
    k3
    v3
    q4
    k4
    v4
    q5
    k5
    v5
    [α1,1
    α1,2
    α1,3
    α1,4
    α1,5
    ] [α2,1
    α2,2
    α2,3
    α2,4
    α2,5
    ] [α3,1
    α3,2
    α3,3
    α3,4
    α3,5
    ] [α4,1
    α4,2
    α4,3
    α4,4
    α4,5
    ] [α5,1
    α5,2
    α5,3
    α5,4
    α5,5
    ]
    [ ̂
    α1,1
    ̂
    α1,2
    ̂
    α1,3
    ̂
    α1,4
    ̂
    α1,5
    ] [ ̂
    α2,1
    ̂
    α2,2
    ̂
    α2,3
    ̂
    α2,4
    ̂
    α2,5
    ] [ ̂
    α3,1
    ̂
    α3,2
    ̂
    α3,3
    ̂
    α3,4
    ̂
    α3,5
    ] [ ̂
    α4,1
    ̂
    α4,2
    ̂
    α4,3
    ̂
    α4,4
    ̂
    α4,5
    ] [ ̂
    α5,1
    ̂
    α5,2
    ̂
    α5,3
    ̂
    α5,4
    ̂
    α5,5
    ]
    TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY
    x1
    e1
    x2
    e2
    x3
    e3
    x4
    e4
    x5
    e5
    &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH
    1&
    Figure 1: The Transformer - model architecture.
    der and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    2VFSZͱ,FZಛ௃ྔͷ಺ੵΛͱΓɼTPGUNBYؔ਺Ͱܥྻؒͷؔ࿈౓ "UUFOUJPOXFJHIU
    Λऔಘ
    ̂
    α = softmax(
    QKT
    dk
    )
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

    View full-size slide

  35. 4FMG"UUFOUJPOͷৄࡉ
    q1
    k1
    v1
    q2
    k2
    v2
    q3
    k3
    v3
    q4
    k4
    v4
    q5
    k5
    v5
    [α1,1
    α1,2
    α1,3
    α1,4
    α1,5
    ] [α2,1
    α2,2
    α2,3
    α2,4
    α2,5
    ] [α3,1
    α3,2
    α3,3
    α3,4
    α3,5
    ] [α4,1
    α4,2
    α4,3
    α4,4
    α4,5
    ] [α5,1
    α5,2
    α5,3
    α5,4
    α5,5
    ]
    [ ̂
    α1,1
    ̂
    α1,2
    ̂
    α1,3
    ̂
    α1,4
    ̂
    α1,5
    ] [ ̂
    α2,1
    ̂
    α2,2
    ̂
    α2,3
    ̂
    α2,4
    ̂
    α2,5
    ] [ ̂
    α3,1
    ̂
    α3,2
    ̂
    α3,3
    ̂
    α3,4
    ̂
    α3,5
    ] [ ̂
    α4,1
    ̂
    α4,2
    ̂
    α4,3
    ̂
    α4,4
    ̂
    α4,5
    ] [ ̂
    α5,1
    ̂
    α5,2
    ̂
    α5,3
    ̂
    α5,4
    ̂
    α5,5
    ]

    ⊗ ⊗ ⊗ ⊗ ⊗

    ⊗ ⊗ ⊗ ⊗ ⊗

    ⊗ ⊗ ⊗ ⊗ ⊗

    ⊗ ⊗ ⊗ ⊗ ⊗

    ⊗ ⊗ ⊗ ⊗ ⊗
    output1
    output2
    output3
    output4
    output5
    TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY
    x1
    e1
    x2
    e2
    x3
    e3
    x4
    e4
    x5
    e5
    &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH
    1&
    Figure 1: The Transformer - model architecture.
    der and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    1&
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Figure 1: The Transformer - model architecture.
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each
    "UUFOUJPOXFJHIUͱ7BMVFಛ௃ྔΛ৐ࢉ͠৘ใ෇༩͢Δ͜ͱͰɼ࣌ࠁؒͷಛ௃ྔͷؔ܎ੑΛٻΊΔ
    Attention(Q, K, V) = ̂
    αV

    View full-size slide

  36. 5SBOTGPSNFS#MPDLͷશମॲཧ
    ϕΫτϧΛ࣌ؒํ޲ʹ

    ࠞͥͯม׵

    ن
    Խ
    ϕΫτϧΛEFQUIํ޲ʹ
    ݸผʹม׵

    ن
    Խ
    ఺ઢ෦Λ/ճ܁Γฦ͢
    TFMGBUUFOUJPO GFFEGPSXBSE
    ࢲ͸ݘ͕޷͖ͩɻ
    ಛ௃ϕΫτϧ
    lࢲz
    l͸z

    l͕z
    l޷͖z
    lͩz
    lɻz
    ୯ޠ਺Y࣍ݩ਺
    Wfeed1
    '$ޙͷಛ௃ϕΫτϧ
    Wfeed2

    RVFSZ
    LFZ
    WBMVF
    ୯ޠ਺Y࣍ݩ਺
    Wq
    Wk
    Wv
    ୯ޠ਺Y୯ޠ਺
    ͱ7BMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ

    ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱ
    ͷॏཁ౓͕֨ೲ͞Ε͍ͯΔ
    RVFSZͱLFZؒͷߦྻԋࢉ
    Wout

    సஔ
    ྻํ޲ʹ
    TPGUNBY


    ୯ޠ਺Y࣍ݩ਺

    α
    ̂
    α
    ೖྗɿ୯ޠྻ
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

    View full-size slide

  37. ਪ࿦࣌ͷσίʔμͷॲཧ
    ࢲ͸ݘ͕޷͖ͩɻ
    ೖྗɿ୯ޠྻ
    .VMUJ)FBE
    "UUFOUJPO

    ن
    Խ
    'FFE'PSXBSE

    ن
    Խ
    ఺ઢ෦Λ/ճ܁Γฦ͢
    ಛ௃ϕΫτϧ
    lࢲz
    l͸z

    l͕z
    l޷͖z
    lͩz
    lɻz
    &ODPEFS
    .BTLFE
    .VMUJ)FBE
    "UUFOUJPO

    ن
    Խ
    .VMUJ)FBE
    "UUFOUJPO

    ن
    Խ
    ಛ௃ϕΫτϧ
    GFFEGPSXBSE

    ن
    Խ


    ث
    l*z
    &04
    .BTLFE
    .VMUJ)FBE
    "UUFOUJPO

    ن
    Խ

    ن
    Խ
    ಛ௃ϕΫτϧ
    GFFEGPSXBSE

    ن
    Խ


    ث
    lMJLFz
    &04l*z
    .BTLFE
    .VMUJ)FBE
    "UUFOUJPO

    ن
    Խ

    ن
    Խ
    ಛ௃ϕΫτϧ
    GFFEGPSXBSE

    ن
    Խ


    ث
    &/%
    &04l*zlMJLFzlEPHTzlz
    ఺ઢ෦Λ/ճ܁Γฦ͢
    ఺ઢ෦Λ/ճ܁Γฦ͢
    ఺ઢ෦Λ/ճ܁Γฦ͢
    %FDPEFS
    ʜ
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
    sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
    wise fully connected feed-forward network. We employ a residual connection [11] around each of
    the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
    LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
    itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
    layers, produce outputs of dimension dmodel = 512.
    Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
    sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
    attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
    around each of the sub-layers, followed by layer normalization. We also modify the self-attention
    sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
    masking, combined with fact that the output embeddings are offset by one position, ensures that the
    predictions for position i can depend only on the known outputs at positions less than i.
    3.2 Attention
    An attention function can be described as mapping a query and a set of key-value pairs to an output,
    where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
    of the values, where the weight assigned to each value is computed by a compatibility function of the
    query with the corresponding key.
    3
    2
    , 7
    .VMUJ)FBE
    "UUFOUJPO
    2
    , 7
    .VMUJ)FBE
    "UUFOUJPO
    2
    , 7

    View full-size slide

  38. w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏
    ը૾Λݻఆύονʹ෼ղ
    ෼ղͨ͠ύονΛ
    fl
    BUUFOʹ͠ɼ5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃ྔΛಘΔ
    *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5"
    7JTJPO5SBOTGPSNFS<%PTPWJUTLJZ *$-3>
    "%PTPWJUTLJZ l"/*."(&*48035)9803%453"/4'03.&34'03*."(&3&$0(/*5*0/"54$"-& z*$-3

    View full-size slide

  39. w 7J5͸$//ͷSFDFQUJWF
    fi
    FMEͷΑ͏ͳಛ௃Λଊ͑Δ<>

    ύονʹ෼ղ͠5SBOTGPSNFSͰύονؒͷಛ௃Λֶश
    $//Ͱ͸ଊ͖͑Εͳ͔ͬͨը૾શମͷಛ௃Λଊ͑Δ
    $//ͱൺ΂ͯԿ͕ڧ͍͔ʁ


    $//
    7J5
    YͷྖҬ SFDFQUJWF
    fi
    FME
    ͷಛ௃Λଊ͑Δ
    ύονʹ෼ղ͠5SBOTGPSNFSͰը૾શମͷಛ௃Λଊ͑Δ
    ʜ
    7J5
    ˞ճ৞ΈࠐΜͩ৔߹
    <>+$PSEPOOJFS l0/5)&3&-"5*0/4)*1#&58&&/4&-'"55&/5*0/"/%$0/70-65*0/"--":&34 z*$-3

    View full-size slide

  40. w ը૾Λݻఆύονʹ෼ղͯ͠5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃Λநग़
    ωοτϫʔΫ֓ཁ
    Preprint. Under review.
    Transformer Encoder
    MLP
    Head
    Vision Transformer (ViT)
    *
    Linear Projection of Flattened Patches
    * Extra learnable
    [ cl ass] embedding
    1 2 3 4 5 6 7 8 9
    0
    Patch + Position
    Embedding
    Class
    Bird
    Ball
    Car
    ...
    Embedded
    Patches
    Multi-Head
    Attention
    Norm
    MLP
    Norm
    +
    L x
    +
    Transformer Encoder
    Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

    View full-size slide

  41. w ෼ྨ໰୊༻ʹ৽͘͠$-45PLFOΛ௥Ճ
    ֶशՄೳͳύϥϝʔλ
    5SBOTGPSNFS&ODPEFSͷ$-45PLFOͷग़ྗ͔ΒΫϥε෼ྨ
    $-45PLFO
    Transformer Encoder
    MLP
    Head
    Vision Transformer (ViT)
    *
    Linear Projection of Flattened Patches
    * Extra learnable
    [ cl ass] embedding
    1 2 3 4 5 6 7 8 9
    0
    Patch + Position
    Embedding
    Class
    Bird
    Ball
    Car
    ...
    Embedded
    Patches
    Multi-Head
    Attention
    Norm
    MLP
    Norm
    +
    L x
    +
    Transformer Encoder
    Preprint. Under review.
    Transformer Encoder
    MLP
    Head
    Vision Transformer (ViT)
    *
    Linear Projection of Flattened Patches
    * Extra learnable
    [ cl ass] embedding
    1 2 3 4 5 6 7 8 9
    0
    Patch + Position
    Embedding
    Class
    Bird
    Ball
    Car
    ...
    Embedded
    Patches
    Multi-Head
    Attention
    Norm
    MLP
    Norm
    +
    L x
    +
    Transformer Encoder
    Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,
    Preprint. Under review.
    Transformer Encoder
    MLP
    Head
    Vision Transformer (ViT)
    *
    Linear Projection of Flattened Patches
    * Extra learnable
    [ cl ass] embedding
    1 2 3 4 5 6 7 8 9
    0
    Patch + Position
    Embedding
    Class
    Bird
    Ball
    Car
    ...
    Embedded
    Patches
    Multi-Head
    Attention
    Norm
    MLP
    Norm
    +
    L x
    +
    Transformer Encoder
    Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,
    Preprint. Under review.
    Transformer Encoder
    MLP
    Head
    Vision Transformer (ViT)
    *
    Linear Projection of Flattened Patches
    * Extra learnable
    [ cl ass] embedding
    1 2 3 4 5 6 7 8 9
    0
    Patch + Position
    Embedding
    Class
    Bird
    Ball
    Car
    ...
    Embedded
    Patches
    Multi-Head
    Attention
    Norm
    MLP
    Norm
    +
    L x
    +
    Transformer Encoder
    Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

    View full-size slide

  42. w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁
    &NCFEEJOH1PTJUJPO&NCFEEJOH
    asis functions for a low-dimensional
    the
    arns
    em-
    em-
    the
    idal
    That
    ex-
    ield
    tire
    the
    the
    nte-
    ten-
    find
    west
    y is
    &NCFEEJOH
    'MBUUFO
    x1
    p
    x2
    p
    x3
    p
    x4
    p
    x1
    p
    E
    x2
    p
    E
    x3
    p
    E
    x4
    p
    E
    E ∈ ℝ(P2⋅C)×D
    UPLFO
    xN
    p
    ∈ ℝP2⋅C
    +
    +
    +
    +
    +
    1PTJUJPO&NCFEEJOH
    Epos
    ∈ ℝ(N+1)×D
    7J5
    ඍ෼ՄೳͳύϥϝʔλͰ
    ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ
    ࢝఺ ϕΫτϧͰදݱ
    Ͱ͋Γ
    ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ
    5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺
    ɿ࣍ݩ਺
    D
    ɿνϟϯωϧ਺
    ɿύονը૾αΠζ
    C
    P
    ೖྗը૾ΛYʹখ෼͚
    ʜ
    ʜ
    Epos
    ϕʔεϞσϧج४

    View full-size slide

  43. w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁
    &NCFEEJOH1PTJUJPO&NCFEEJOH
    asis functions for a low-dimensional
    the
    arns
    em-
    em-
    the
    idal
    That
    ex-
    ield
    tire
    the
    the
    nte-
    ten-
    find
    west
    y is
    &NCFEEJOH
    'MBUUFO
    x1
    p
    x2
    p
    x3
    p
    x4
    p
    x1
    p
    E
    x2
    p
    E
    x3
    p
    E
    x4
    p
    E
    E ∈ ℝ(P2⋅C)×D
    UPLFO
    xN
    p
    ∈ ℝP2⋅C
    +
    +
    +
    +
    +
    1PTJUJPO&NCFEEJOH
    Epos
    ∈ ℝ(N+1)×D
    ඍ෼ՄೳͳύϥϝʔλͰ
    ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ
    ࢝఺ ϕΫτϧͰදݱ
    Ͱ͋Γ
    ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ
    5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺
    ɿ࣍ݩ਺
    D
    ɿνϟϯωϧ਺
    ɿύονը૾αΠζ
    C
    P
    ೖྗը૾ΛYʹখ෼͚
    ʜ
    ʜ
    Epos
    ϕʔεϞσϧج४
    Preprint. Under review.
    Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim-
    ilarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position
    $//ͷ௿࣍ݩ૚ͷϑΟϧλͷΑ͏ͳ΋ͷֶ͕श͞ΕΔ

    7J5

    View full-size slide

  44. 7J5
    w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁
    &NCFEEJOH1PTJUJPO&NCFEEJOH
    asis functions for a low-dimensional
    the
    arns
    em-
    em-
    the
    idal
    That
    ex-
    ield
    tire
    the
    the
    nte-
    ten-
    find
    west
    y is
    &NCFEEJOH
    'MBUUFO
    x1
    p
    x2
    p
    x3
    p
    x4
    p
    x1
    p
    E
    x2
    p
    E
    x3
    p
    E
    x4
    p
    E
    E ∈ ℝ(P2⋅C)×D
    UPLFO
    xN
    p
    ∈ ℝP2⋅C
    +
    +
    +
    +
    +
    1PTJUJPO&NCFEEJOH
    Epos
    ∈ ℝ(N+1)×D
    ඍ෼ՄೳͳύϥϝʔλͰ
    ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ
    ࢝఺ ϕΫτϧͰදݱ
    Ͱ͋Γ
    ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ
    5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺
    ɿ࣍ݩ਺
    D
    ɿνϟϯωϧ਺
    ɿύονը૾αΠζ
    C
    P
    ೖྗը૾ΛYʹখ෼͚
    ʜ
    ʜ
    Epos
    ϕʔεϞσϧج४
    Preprint. Under review.
    Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim-
    ͋Δ1&ͱଞͷ1&ͱͷྨࣅ౓Λදݱͨ͠΋ͷ

    ෇ۙͷଞͷҐஔͱͷྨࣅ౓͕ߴ͍

    ͔ΒԕํͷଞͷҐஔͱͷྨࣅ౓͕௿͍
    ෇ۙͷ1&ಉ࢜͸ࣅͨ஋ʹͳΔΑ͏ʹֶश͞Ε͍ͯΔ

    View full-size slide

  45. 7JTJPO5SBOTGPSNFSͷ#MPDLͷશମॲཧ
    ಛ௃ϕΫτϧ
    DMTUPLFO
    ɾ7JTJPO5SBOTGPSNFS
    ɾ5SBOTGPSNFS
    ϕΫτϧΛ࣌ؒํ޲ʹ

    ࠞͥͯม׵

    ن
    Խ
    ϕΫτϧΛEFQUIํ޲ʹ
    ݸผʹม׵

    ن
    Խ
    ఺ઢ෦Λ/ճ܁Γฦ͢
    TFMGBUUFOUJPO GFFEGPSXBSE
    ࢲ͸ݘ͕޷͖ͩɻ
    ಛ௃ϕΫτϧ
    lࢲz
    l͸z

    l͕z
    l޷͖z
    lͩz
    lɻz
    ೖྗɿ୯ޠྻ
    ϕΫτϧΛۭؒํ޲ʹ

    ࠞͥͯม׵

    ن
    Խ
    ϕΫτϧΛEFQUIํ޲ʹ
    ݸผʹม׵

    ن
    Խ
    ఺ઢ෦Λ/ճ܁Γฦ͢
    TFMGBUUFOUJPO GFFEGPSXBSE
    DMTUPLFO
    DMTUPLFOͷΈ༻͍ͯ

    Ϋϥε෼ྨΛߦ͏

    View full-size slide

  46. w ը૾෼ྨʹ͓͍ͯ4P5"ͱͷൺֱ
    +'5.Ͱࣄલֶश֤ͯ͠σʔληοτͰసҠֶश
    7JTJPO5SBOTGPSNFSͷੑೳ
    Published as a conference paper at ICLR 2021
    Ours-JFT Ours-JFT Ours-I21k BiT-L Noisy Student
    (ViT-H/14) (ViT-L/16) (ViT-L/16) (ResNet152x4) (EfficientNet-L2)
    ImageNet 88.55± 0.04 87.76± 0.03 85.30± 0.02 87.54± 0.02 88.4/88.5⇤
    ImageNet ReaL 90.72± 0.05 90.54± 0.03 88.62± 0.05 90.54 90.55
    CIFAR-10 99.50± 0.06 99.42± 0.03 99.15± 0.03 99.37± 0.06
    CIFAR-100 94.55± 0.04 93.90± 0.05 93.25± 0.05 93.51± 0.08
    Oxford-IIIT Pets 97.56± 0.03 97.32± 0.11 94.67± 0.15 96.62± 0.23
    Oxford Flowers-102 99.68± 0.02 99.74± 0.00 99.61± 0.02 99.63± 0.03
    VTAB (19 tasks) 77.63± 0.23 76.28± 0.46 72.72± 0.21 76.29± 1.70
    TPUv3-core-days 2.5k 0.68k 0.23k 9.9k 12.3k
    Table 2: Comparison with state of the art on popular image classification benchmarks. We re-
    port mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision
    Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all
    datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the
    smaller public ImageNet-21k dataset performs well too. ⇤Slightly improved 88.5% result reported
    in Touvron et al. (2020).
    ԯ ԯ
    ԯ
    ԯ
    ύϥϝʔλ਺
    ˞516WDPSFEBZTɿֶशʹ࢖༻ͨ͠516WίΞ਺ νοϓ͋ͨΓݸ
    ʹֶश࣌ؒ ೔਺
    Λ͔͚ͨ΋ͷ
    ˞
    ˠશͯͷσʔληοτͰ4P5"Λୡ੒

    View full-size slide

  47. w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ
    ("/ʹΑΔελΠϧม׵Ͱը૾
    ධՁର৅ɿ7J5 $// ਓؒ
    ධՁࢦඪ
    ܗঢ়ͷׂ߹ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผਖ਼͍͠ςΫενϟΫϥεͱࣝผ

    7J5ʹ͓͚Δಛ௃දݱ֫ಘ
    <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW
    ("/ʹΑΔ
    ελΠϧม׵
    ೣ ৅
    ೣˠܗঢ়Λଊ͍͑ͯΔ
    ৅ˠςΫενϟΛଊ͍͑ͯΔ
    $//
    PS
    7J5
    ෼ྨ݁Ռ

    View full-size slide

  48. w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ
    $//͸ςΫενϟΛॏࢹ
    7J5͸෺ମͷܗঢ়Λॏࢹ
    7J5ʹ͓͚Δಛ௃දݱ֫ಘ
    <>3(FJSIPT l*."(&/&553"*/&%$//4"3&#*"4&%508"3%45&9563&*/$3&"4*/(4)"1&#*"4*.1307&4"$$63"$:"/%30#645/&44 z*$-3
    blished as a conference paper at ICLR 2019
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    AlexNet
    100
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    GoogLeNet
    100
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    VGG−16
    100
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    ResNet−50
    100
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    Humans
    99 97 99 100100 98
    44 49 48
    54
    75
    40
    28 24
    18
    87
    100100100100
    90
    original greyscale silhouette edges texture
    Figure 2: Accuracies and example stimuli for five different experiments without cue conflict.
    anging biases, and discovering emergent benefits of changed biases. We show that the texture bias
    standard CNNs can be overcome and changed towards a shape bias if trained on a suitable data
    . Remarkably, networks with a higher shape bias are inherently more robust to many different
    age distortions (for some even reaching or surpassing human performance, despite never being
    <>ΑΓҾ༻
    <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW
    Fig. 4: Error consistency results on SIN dataset.
    distribution (i.e., p 2 D240 corresponding to the off-diagonal
    entries of the 16 ⇥ 16 confusion matrix) by taking the error
    counts to be the off-diagonal elements of the confusion ma-
    trix:
    ei j = CMi, j, 8 j 6= i
    In this context, the inter-class JS distance compares what
    classes were misclassified as what.
    An interesting finding is that, instead of a strong correla-
    tion shown by class-wise JS in Figure 3(a), Figure 3(b) sug-
    gests that there is no correlation of inter-class JS distance with
    Cohen’s k implying that this metric gives insight beyond Co-
    hen’s k in measuring error-consistency with humans.
    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
    1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
    Fraction of 'texture' decisions
    Fraction of 'shape' decisions
    Shape categories


































    ResNet−50
    AlexNet
    VGG−16
    GoogLeNet
    ViT−B_16
    ViT−L_32
    Humans (avg.)
    Fig. 5: Shape bias for different networks for the SIN dataset
    (Geirhos et al., 2019). Vertical lines indicate averages.
    range of models on the Stylized ImageNet (SIN) dataset (Fig-
    <>ΑΓҾ༻

    $//ͱਓؒͷൺֱ
    7J5 ˝˝
    $// ˔˙˛˔
    ਓؒ
    ♦︎

    ͷൺֱ
    ςΫενϟ
    ܗঢ়

    View full-size slide

  49. w 5SBOTGPSNFSΛηάϝϯςʔγϣϯλεΫ΁Ԡ༻
    .JY5SBOTGPSNFS
    w ϚϧνϨϕϧͳಛ௃Λ֫ಘՄೳͳ֊૚ܕ5SBOTGPSNFS
    w ܭࢉίετΛ࡟ݮ͢Δߏ଄
    ܰྔ͔ͭγϯϓϧͳ.-1σίʔμΛ࠾༻
    w 5SBOTGPSNFS͸ہॴత͔ͭ޿Ҭతͳಛ௃ΛऔಘՄೳ
    w .-1͸ہॴతͳಛ௃Λิ׬Ͱ͖ɼڧྗͳදݱΛ֫ಘՄೳ
    7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW>
    &9JF l4FH'PSNFS4JNQMFBOE&
    ff
    i
    DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW
    Overlap Patch
    Embeddings
    Transformer
    Block 1
    MLP
    Layer
    !
    "
    × #
    "
    ×"$
    !
    %
    × #
    %
    ×"&
    !
    '&
    × #
    '&
    ×""
    !
    $(
    × #
    $(
    ×"'
    !
    "
    × #
    "
    ×4"
    MLP
    !
    "
    × #
    "
    ×$)*+
    Transformer
    Block 2
    Transformer
    Block 3
    Transformer
    Block 4
    Overlap Patch
    Merging
    Efficient
    Self-Attn
    Mix-FFN
    ×"
    UpSample
    MLP
    !
    "!"#
    × #
    "!"#
    ×"$
    !
    "!"#
    × #
    "!"#
    ×" !
    %
    × #
    %
    ×"
    Encoder Decoder
    Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer
    encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level
    features and predict the semantic segmentation mask. “FFN” indicates feed-forward network.
    i
    the MiT encoder go through an MLP layer to unify the channel dimension. Then, in a second
    step, features are up-sampled to 1/4th and concatenated together. Third, a MLP layer is adopted to
    fuse the concatenated features F. Finally, another MLP layer takes the fused feature to predict the
    segmentation mask M with a H
    4
    ⇥ W
    4
    ⇥ N
    cls
    resolution, where N
    cls
    is the number of categories.
    This lets us formulate the decoder as:
    ˆ
    F
    i
    = Linear(C
    i
    , C)(F
    i
    ), 8i
    ˆ
    F
    i
    = Upsample(
    W
    4

    W
    4
    )( ˆ
    F
    i
    ), 8i
    F = Linear(4C, C)(Concat( ˆ
    F
    i
    )), 8i
    M = Linear(C, N
    cls
    )(F),
    (4)
    where M refers to the predicted mask, and Linear(C
    in
    , C
    out
    )(·) refers to a linear layer with C
    in
    and
    C
    out
    as input and output vector dimensions respectively.
    DeepLabv3+
    SegFormer
    Stage-1 Stage-2 Stage-3 Head
    Stage-4
    Figure 3: Effective Receptive Field (ERF) on Cityscapes (aver-
    age over 100 images). Top row: Deeplabv3+. Bottom row: Seg-
    Former. ERFs of the four stages and the decoder heads of both
    Effective Receptive Field Analysis.
    For semantic segmentation, maintain-
    ing large receptive field to include con-
    text information has been a central is-
    sue [5, 19, 20]. Here, we use effec-
    tive receptive field (ERF) [70] as a
    toolkit to visualize and interpret why
    our MLP decoder design is so effec-
    tive on Transformers. In Figure 3, we
    visualize ERFs of the four encoder
    stages and the decoder heads for both
    3FE#PYɿ4UBHFͷ4FMG"UUFOUJPOͷSFDFQUJWF
    fi
    FMEͷେ͖͞
    #MVF#PYɿ.-1-BZFSͷ.-1ͷSFDFQUJWF
    fi
    FMEͷେ͖͞
    $JUZTDBQFTʹ͓͚ΔSFDFQUJWF
    fi
    FMEͷޮՌΛ෼ੳ
    ˠ$//ϕʔε͸ہॴతͳಛ௃ͷΈଊ͑Δ͕ɼ5SBOTGPSNFSϕʔε͸ہॴతɾ޿Ҭతͳಛ௃Λଊ͑Δ
    ˠ.-1ͷSFDFQUJWF
    fi
    FME͸ہॴతྖҬͷີ౓͕ߴ͍খ͍͞෺ମͷਖ਼֬ͳηάϝϯςʔγϣϯ͕ظ଴

    View full-size slide


  50. &9JF l4FH'PSNFS4JNQMFBOE&
    ff
    i
    DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW
    7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW>
    C Flops # Params # mIoU "
    256 25.7 24.7 44.9
    512 39.8 25.8 45.0
    768 62.4 27.5 45.4
    1024 93.6 29.6 45.2
    2048 304.4 43.4 45.6
    Inf Res Enc Type mIoU "
    768⇥768 PE 77.3
    1024⇥2048 PE 74.0
    768⇥768 Mix-FFN 80.5
    1024⇥2048 Mix-FFN 79.8
    Encoder Flops # Params # mIoU "
    ResNet50 (S1-4) 69.2 29.0 34.7
    ResNet101 (S1-4) 88.7 47.9 38.7
    ResNeXt101 (S1-4) 127.5 86.8 39.8
    MiT-B2 (S4) 22.3 24.7 43.1
    MiT-B2 (S1-4) 62.4 27.7 45.4
    MiT-B3 (S1-4) 79.0 47.3 48.6
    Table 2: Comparison to state of the art methods on ADE20K and Cityscapes. SegFormer has significant
    advantages on #Params, #Flops, #Speed and #Accuracy. Note that for SegFormer-B0 we scale the short side of
    image to {1024, 768, 640, 512} to get speed-accuracy tradeoffs.
    Method Encoder Params #
    ADE20K Cityscapes
    Flops # FPS " mIoU " Flops # FPS " mIoU "
    Real-Time
    FCN [1] MobileNetV2 9.8 39.6 64.4 19.7 317.1 14.2 61.5
    ICNet [11] - - - - - - 30.3 67.7
    PSPNet [17] MobileNetV2 13.7 52.9 57.7 29.6 423.4 11.2 70.2
    DeepLabV3+ [20] MobileNetV2 15.4 69.4 43.1 34.0 555.4 8.4 75.2
    SegFormer (Ours) MiT-B0 3.8
    8.4 50.5 37.4 125.5 15.2 76.2
    - - - 51.7 26.3 75.3
    - - - 31.5 37.1 73.7
    - - - 17.7 47.6 71.9
    Non Real-Time
    FCN [1] ResNet-101 68.6 275.7 14.8 41.4 2203.3 1.2 76.6
    EncNet [24] ResNet-101 55.1 218.8 14.9 44.7 1748.0 1.3 76.9
    PSPNet [17] ResNet-101 68.1 256.4 15.3 44.4 2048.9 1.2 78.5
    CCNet [41] ResNet-101 68.9 278.4 14.1 45.2 2224.8 1.0 80.2
    DeeplabV3+ [20] ResNet-101 62.7 255.1 14.1 44.1 2032.3 1.2 80.9
    OCRNet [23] HRNet-W48 70.5 164.8 17.0 45.6 1296.8 4.2 81.1
    GSCNN [35] WideResNet38 - - - - - - 80.8
    Axial-DeepLab [74] AxialResNet-XL - - - - 2446.8 - 81.1
    Dynamic Routing [75] Dynamic-L33-PSP - - - - 270.0 - 80.7
    Auto-Deeplab [50] NAS-F48-ASPP - - - 44.0 695.0 - 80.3
    SETR [7] ViT-Large 318.3 - 5.4 50.2 - 0.5 82.2
    SegFormer (Ours) MiT-B4 64.1 95.7 15.4 51.1 1240.6 3.0 83.8
    SegFormer (Ours) MiT-B5 84.7 183.3 9.8 51.8 1447.6 2.5 84.0
    however, it leads to larger and less efficient models. Interestingly, this performance plateaus for
    channel dimensions wider than 768. Given these results, we choose C = 256 for our real-time
    SegFormer
    SETR DeepLabv3+
    Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predic
    tially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduce
    highlighted in red. Best viewed in screen.
    4.4 Robustness to natural corruptions
    Model robustness is important for many safety-critical tasks such as autonomous
    experiment, we evaluate the robustness of SegFormer to common corruptions a
    this end, we follow [77] and generate Cityscapes-C, which expands the Cityscape
    16 types of algorithmically generated corruptions from noise, blur, weather and d
    SegFormer
    SETR SegFormer
    DeepLabv3+
    Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predicts masks with substan-
    tially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduces long-range errors as
    highlighted in red. Best viewed in screen.


    View full-size slide

  51. 4FH'PSNFS͕΋ͨΒ͢Մೳੑ
    IUUQTXXXZPVUVCFDPNXBUDI W+.P32[;F6
    ϊΠζͷӨڹΛड͚ੑೳ͕ྼԽ ϊΠζʹର͠ϩόετ
    ˠ5SBOTGPSNFS͸෺ମͷܗঢ়Λֶश͢ΔͨΊɼϊΠζͷӨڹΛड͚ʹ͍͘

    View full-size slide

  52. w 4FH'PSNFSΛϚϧνϔουԽͯ͠ෳ਺ͷσʔληοτʹରԠ
    %FDPEFSΛσʔληοτ͝ͱʹ༻ҙ
    5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ
    4FH'PSNFSͷϚϧνϔουԽ

    Transformer
    Block 1
    Transformer
    Block 2
    MLP
    Layer
    MLP
    Layer
    MLP
    Layer
    Efficient
    self-attention
    Mix FFN
    Overlap Patch
    Merging
    DA module


    ×N
    Transformer
    Block 4
    Transformer
    Block 3
    Encoder Decoder
    Concat
    MLP
    Concat
    MLP
    Concat
    MLP
    υϝΠϯ#
    υϝΠϯ"
    υϝΠϯ$
    &ODPEFS .VMUJIFBE%FDPEFS

    View full-size slide

  53. w ϚϧνυϝΠϯԽͨ͠4FH'PSNFSͱ$//ख๏ͱͷൺֱ
    5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ
    4FH'PSNFSͷϚϧνϔουԽ

    5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT ఏҊख๏
    $JUZTDBQFT
    #%%
    4ZOTDBQFT
    5SBJOWBM
    $JUZTDBQF
    T
    #%% 4ZOTDBQFT ఏҊख๏
    $JUZTDBQFT
    #%%
    4ZOTDBQFT
    %FFQ-BCW 4FH'PSNFS
    ˠϚϧνυϝΠϯʹ͓͍ͯ΋4FH'PSNFS͸ޮՌେ

    View full-size slide

  54. w $//ϕʔεͷ4UBUFPGUIFBSU
    %FFQ-BC7 )3/FU 0$/
    w ϚϧνυϝΠϯ΁ͷରԠ
    %"ϞδϡʔϧΛಋೖͯ͠ϚϧνυϝΠϯֶश
    w 4FH'PSNFSͷޮՌ
    7JTJPO5SBOTGPSNFSͷʹΑΔରϊΠζੑͷ޲্
    ·ͱΊɿਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ

    ResNet101
    +
    DA module
    ASPP
    1×1conv.
    concat
    Cityscapes
    !"
    !#
    !$
    マルチヘッド構造
    3×3conv.
    3×3conv.
    1×1conv.
    Head 1
    %"
    %#
    %$
    Head 2
    Head &
    3×3conv.
    3×3conv.
    1×1conv.
    3×3conv.
    3×3conv.
    1×1conv.
    Cityscapes
    共有ネットワーク
    A2D2
    Mapillary
    A2D2
    Mapillary
    Overlap Patch
    Embeddings
    Transformer
    Block 1
    MLP
    Layer
    !
    "
    × #
    "
    ×"$
    !
    %
    × #
    %
    ×"&
    !
    '&
    × #
    '&
    ×""
    !
    $(
    × #
    $(
    ×"'
    !
    "
    × #
    "
    ×4"
    MLP
    !
    "
    × #
    "
    ×$)*+
    Transformer
    Block 2
    Transformer
    Block 3
    Transformer
    Block 4
    Overlap Patch
    Merging
    Efficient
    Self-Attn
    Mix-FFN
    ×"
    UpSample
    MLP
    !
    "!"#
    × #
    "!"#
    ×"$
    !
    "!"#
    × #
    "!"#
    ×" !
    %
    × #
    %
    ×"
    Encoder Decoder
    Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer
    encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level
    features and predict the semantic segmentation mask. “FFN” indicates feed-forward network.
    in an end-to-end manner. After that, researchers focused on improving FCN from different aspects
    such as: enlarging the receptive field [17–19, 5, 2, 4, 20]; refining the contextual information [21–
    29]; introducing boundary information [30–37]; designing various attention modules [38–46]; or
    using AutoML technologies [47–51]. These methods significantly improve semantic segmentation
    performance at the expense of introducing many empirical modules, making the resulting framework
    computationally demanding and complicated. More recent methods have proved the effectiveness of
    Transformer-based architectures for semantic segmentation [7, 46]. However, these methods are still
    computationally demanding.
    Transformer backbones. ViT [6] is the first work to prove that a pure Transformer can achieve
    state-of-the-art performance in image classification. ViT treats each image as a sequence of tokens and
    ϚϧνυϝΠϯରԠ
    4FH'PSNFSͷޮՌ

    View full-size slide

  55. ࢀߟจݙ

    <4IPUUPO $713`>+BNJF4IPUUPO .BUUIFX+PIOTPO 3PCFSUP$JQPMMB l4FNBOUJDUFYUPOGPSFTUTGPSJNBHFDBUFHPSJ[BUJPOBOETFHNFOUBUJPO z
    $713
    <#BESJOBSBZBOBO 1".*`>7JKBZ#BESJOBSBZBOBO "MFY,FOEBMM 3PCFSUP$JQPMMB l4FHOFU"EFFQDPOWPMVUJPOBMFODPEFSEFDPEFSBSDIJUFDUVSF
    GPSJNBHFTFHNFOUBUJPOz1".*
    <-POH $713`>+POBUIBO-POH &WBO4IFMIBNFS BOE5SFWPS%BSSFMM l'VMMZDPOWPMVUJPOBMOFUXPSLTGPSTFNBOUJDTFHNFOUBUJPO z$713
    <3POOFCFSHFS.*$$"*`>0MBG3POOFCFSHFS 1IJMJQQ'JTDIFS 5IPNBT#SPY l6OFU$POWPMVUJPOBMOFUXPSLTGPSCJPNFEJDBMJNBHF
    TFHNFOUBUJPO z.*$$"*
    <;IBP$713`>)FOHTIVBOH;IBP +JBOQJOH4IJ 9JBPKVBO2J 9JBPHBOH8BOH +JBZB+JB l1ZSBNJETDFOFQBSTJOHOFUXPSL z$713
    <$IFO&$$7`>-JBOH$IJFI$IFO :VLVO;IV (FPSHF1BQBOESFPV 'MPSJBO4DISP
    ff
    )BSUXJH"EBNl&ODPEFSEFDPEFSXJUIBUSPVTTFQBSBCMF
    DPOWPMVUJPOGPSTFNBOUJDJNBHFTFHNFOUBUJPO z&$$7
    <8BOH1".*`>+JOHEPOH8BOH ,F4VO 5JBOIFOH$IFOH #PSVJ+JBOH $IBPSVJ%FOH :BOH;IBP %POH-JV :BEPOH.V .JOHLVJ5BO
    9JOHHBOH8BOH 8FOZV-JV #JO9JBPl%FFQIJHISFTPMVUJPOSFQSFTFOUBUJPOMFBSOJOHGPSWJTVBMSFDPHOJUJPO z1".*
    <:VBO&$$7`>:VIVJ:VBO 9JMJO$IFO +JOHEPOH8BOHl0CKFDUDPOUFYUVBMSFQSFTFOUBUJPOTGPSTFNBOUJDTFHNFOUBUJPO z&$$7
    <$PSEUT$713`>.BSJVT$PSEUT .PIBNFE0NSBO 4FCBTUJBO3BNPT 5JNP3FIGFME .BSLVT&O[XFJMFS 3PESJHP#FOFOTPO 6XF'SBOLF
    4UFGBO3PUI #FSOU4DIJFMFl5IFDJUZTDBQFTEBUBTFUGPSTFNBOUJDVSCBOTDFOFVOEFSTUBOEJOHz$713
    <-FF*$-3`>,VBO)VJ-FF (FSNBO3PT +JF-J "ESJFO(BJEPO l41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPN4JNVMBUJPO z*$-3
    <-BJ$713`>9JO-BJ
    ;IVPUBP5JBO
    -J+JBOH
    4IV-JV
    )FOHTIVBOH;IBP
    -JXFJ8BOH
    +JBZB+JB l4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI
    %JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ z$713

    View full-size slide

  56. ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ

    த෦େֶϩΰ
    த෦େֶϩΰ
    ڭत
    ౻٢߂࿱ Hironobu Fujiyoshi E-mail: [email protected]
    1997೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ, 1997೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴPostdoctoral Fellow, 2000೥ த෦େֶ޻ֶ෦৘ใ޻ֶՊߨࢣ, 2004೥ த෦େֶ।ڭत,
    2005೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһ(ʙ2006೥), 2010೥ த෦େֶڭत, 2014೥໊ݹ԰େֶ٬һڭत.

    ܭࢉػࢹ֮ɼಈը૾ॲཧɼύλʔϯೝࣝɾཧղͷݚڀʹैࣄɽ

    ϩϘΧοϓݚڀ৆(2005೥)ɼ৘ใॲཧֶձ࿦จࢽCVIM༏ल࿦จ৆(2009೥)ɼ৘ใॲཧֶձࢁԼه೦ݚڀ৆(2009೥)ɼը૾ηϯγϯάγϯϙδ΢Ϝ༏लֶज़৆(2010, 2013, 2014೥) ɼ
    ిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ଞ
    ڭत
    ࢁԼོٛ Takayoshi Yamashita E-mail:[email protected]
    2002೥ ಸྑઌ୺Պֶٕज़େֶӃେֶത࢜લظ՝ఔमྃ, 2002೥ ΦϜϩϯגࣜձࣾೖࣾ, 2009೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ(ࣾձਓυΫλʔ), 2014೥ த෦େֶߨࢣɼ
    2017೥ த෦େֶ।ڭतɼ2021೥ த෦େֶڭतɽ

    ਓͷཧղʹ޲͚ͨಈը૾ॲཧɼύλʔϯೝࣝɾػցֶशͷݚڀʹैࣄɽ

    ը૾ηϯγϯάγϯϙδ΢Ϝߴ໦৆(2009೥)ɼిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ɼిࢠ৘ใ௨৴ֶձPRMUݚڀձݚڀ঑ྭ৆(2013೥)ड৆ɽ
    ߨࢣ
    ฏ઒ཌྷ Tsubasa Hirakawa E-mail:[email protected]
    2013೥ ޿ౡେֶେֶӃത࢜՝ఔલظऴྃɼ2014೥ ޿ౡେֶେֶӃത࢜՝ఔޙظೖֶɼ2017೥ த෦େֶݚڀһ (ʙ2019೥)ɼ2017೥ ޿ౡେֶେֶӃത࢜ޙظ՝ఔमྃɽ2019
    ೥ த෦େֶಛ೚ॿڭɼ2021೥ த෦େֶߨࢣɽ2014೥ ಠཱߦ੓๏ਓ೔ຊֶज़ৼڵձಛผݚڀһDC1ɽ2014೥ ESIEE Paris٬һݚڀһ (ʙ2015೥)ɽ
    ίϯϐϡʔλϏδϣϯɼύλʔϯೝࣝɼҩ༻ը૾ॲཧͷݚڀʹैࣄ

    View full-size slide