深層学習によるセマンティックセグメンテーションとその最新動向

೔ຊݦඍڸֶձ ੜମػೳ ボ ϦϡʔϜ デ ʔλղੳݚڀ෦ձୈճݚڀձ ਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯͱͦͷ࠷৽ಈ޲ ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

Ұൠ෺ମೝࣝͱ͸ʁ

ίϯϐϡʔλϏδϣϯʢը૾ೝࣝʣͷ࠷ऴ໨ඪʁ Ұൠ෺ମೝࣝɿ ੍໿ͷͳ͍࣮ੈքγʔϯͷը૾ʹରͯ͠ɺܭࢉػ が ͦͷதʹؚ·ΕΔ෺ମΛҰൠతͳ໊শ で ೝࣝ͢Δ͜ͱ ਂ૚ֶशલ
d ɿඇৗʹ೉͍͠໰୊ͱ͞Ε͍ͯͨ ਂ૚ֶशޙ d ɿղ͔Ε࢝Ίͨ

Ұൠ෺ମೝࣝ໰୊ͷࡉ෼Խ ͜Ε͸ݐ෺Ͱ͔͢ʁ র߹ ਓ͸Ͳ͜Ͱ͔͢ʁ ෺ମݕग़ Կͷը૾Ͱ͔͢ʁ ը૾෼ྨ ͲͷΑ͏ͳγʔϯͰ͔͢ʁ γʔϯͷཧղ
ಛఆ෺ମೝࣝ Ұൠ෺ମೝࣝ ඪࣝzࢭ·ΕzͰ͔͢ʁ

w ը૾தͷըૉ୯ҐͰ෺ମΧςΰϦΛٻΊΔ໰୊ 4FNBOUJDUFYUPOGPSFTUT<4IPUUPO $713`> ηϚϯςΟοΫηάϝϯςʔγϣϯλεΫ sts of T
decision trees. A feature vector is classified by descending each tree. This gives, lass distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow) ure vector. This paper shows how to simultaneously exploit both the hierarchical clustering ss distributions. (b) Semantic texton forests features. The split nodes in semantic texton els within a d d patch: either the raw value of a single pixel, or the sum, difference, or mentation by em- present. of this work are: ntly provide both xtons and a local xtons model, and entation; and (iii) ove segmentation ion 2 gives a brief ch form the basis uced in Section 3. in Section 4 and ing, we also use it as a classifier, which enables us to use semantic context for image segmentation; and (iii) in addition to the leaf nodes used in [19], we include the branch nodes as hierarchical clusters. A related method, the pyra- mid match kernel [9], exploits a hierarchy in descriptor space, though it requires the computation of feature descrip- tors and is only applicable to kernel based classifiers. The pixel-based features we use are similar to those in [14], but our forests are trained to recognize object categories, not match particular feature points. Other work has also looked at alternatives to k-means. Recent work [29] quantizes feature space into a hyper-grid, but requires descriptor computation and can result in very uilding rass ee ow heep ky irplane ater ce ar icycle ower ign ird ook hair ad at og ody oat lobal verage .BDIJOFMFBSOJOHBMHPSJUIN )BOEDSBGUFEGFBUVSF Figure 2. (a) Decision forests. A forest consists of T decision trees. A feature vector is classified by descending each tree. This gives, for each tree, a path from root to leaf, and a class distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow) and class distributions (red) for one input feature vector. This paper shows how to simultaneously exploit both the hierarchical clustering implicit in the tree structure and the node class distributions. (b) Semantic texton forests features. The split nodes in semantic texton forests use simple functions of raw image pixels within a d d patch: either the raw value of a single pixel, or the sum, difference, or absolute difference of a pair of pixels (red). act as an image-level prior to improve segmentation by em- phasizing the categories most likely to be present. To summarize, the main contributions of this work are: (i) semantic texton forests which efficiently provide both a hierarchical clustering into semantic textons and a local classification; (ii) the bag of semantic textons model, and its applications in categorization and segmentation; and (iii) the use of the image-level prior to improve segmentation performance. The paper is structured as follows. Section 2 gives a brief recap of randomized decision forests which form the basis of our new semantic texton forests, introduced in Section 3. ing, we also use it as a classifier, which enables us to use semantic context for image segmentation; and (iii) in addition to the leaf nodes used in [19], we include the branch nodes as hierarchical clusters. A related method, the pyra- mid match kernel [9], exploits a hierarchy in descriptor space, though it requires the computation of feature descrip- tors and is only applicable to kernel based classifiers. The pixel-based features we use are similar to those in [14], but our forests are trained to recognize object categories, not match particular feature points. Other work has also looked at alternatives to k-means. Recent work [29] quantizes feature space into a hyper-grid, 3BOEPNGPSFTU 1JYFMDPNQBSJTPO

w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭ ֤ը૾ೝࣝλεΫʹ͓͚Δ$//ͷߏ଄ ʜ z1FSTPOz W H W′
H′ H W W H ըૉ͝ͱʹΫϥε֬཰Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚ C άϦου͝ͱʹ Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ Ϋϥε֬཰Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁ՌͷՄࢹԽ $// $// ෺ମݕग़ ը૾෼ྨ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ

w 4FH/FU<#BESJOBSBZBOBO 1".*`> Τϯίʔμɾσίʔμߏ଄ $//ʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ 2 Fig. 1.
SegNet predictions on urban and highway scene test samples from the wild. The class colour codes can be obtained from Brostow et al. [3]. To try our system yourself, please see our online web demo at http://mi.eng.cam.ac.uk/projects/segnet/. 1PPMJOHJOEJDFT ϓʔϦϯάҐஔΛهԱ͠ɼΞοϓαϯϓϦϯά࣌ʹར༻

ηϚϯςΟοΫηάϝϯςʔγϣϯͷωοτϫʔΫߏ଄ '$/ GVMMZDPOWPMVUJPOBMOFUXPSL 6/FU 141/FU +-POH l'VMMZ$POWPMVUJPOBM/FUXPSLTGPS4FNBOUJD4FHNFOUBUJPOz $713
03POOFCFSHFS l6/FU$POWPMVUJPOBM/FUXPSLTGPS#JPNFEJDBM *NBHF4FHNFOUBUJPOz .*$$"* );IBP l1ZSBNJE4DFOF1BSTJOH/FUXPSLz $713

w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ
ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ 4FH/FU 141/FU %FFQ-BC7 )3/FU 0$3 $JUZTDBQFTWBMTFUϕϯνϚʔΫ %FFQ-BC7 %FFQ-BC %FFQ-BC7 4FH'PSNFS ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ $//ϕʔεͷ4UBUFPGUIFBSU .FBO*P6 $MBTT ൃද೥

w "USPVT$POWPMVUJPOʹΑΓಛ௃ϚοϓαΠζΛখ͘͞͠ͳ͍ %FFQ-BCW <$IFO &$$7> ҰൠతͳωοτϫʔΫ "USPVT$POWPMVUJPOͷωοτϫʔΫ "USPVT4QBUJBMQZSBNJEQPPMJOH "USPVT$POWPMVUJPO
Y

w "USPVT4QBUJBM1ZSBNJE1PPMJOH "411 ҟͳΔͷ%JMBUJPOͷ৞ΈࠐΈΛฒྻॲཧˠ༷ʑͳ3FDFQUJWF fi FMEΛߟྀ w Τϯίʔμɾσίʔμߏ଄
෺ମڥքपΓͷࣝผਫ਼౓޲্ %FFQ-BCW <$IFO &$$7> 1x1 Conv 3x3 Conv rate 6 3x3 Conv rate 12 3x3 Conv rate 18 Image Pooling 1x1 Conv 1x1 Conv Low-Level Features Upsample by 4 Concat 3x3 Conv Encoder Decoder Atrous Conv DCNN Image Prediction Upsample by 4 -$IFO l&ODPEFS%FDPEFSXJUI"USPVT4FQBSBCMF$POWPMVUJPOGPS4FNBOUJD*NBHF4FHNFOUBUJPOz &$$7

w ߴղ૾౓ɾ௿ղ૾౓ͷαϒωοτϫʔΫʹΑΔฒྻॲཧ ہॴతͳಛ௃ͱେہతͳಛ௃Λ֫ಘՄೳ αϒωοτϫʔΫؒͷ઀ଓˠ֤εέʔϧͷಛ௃ϚοϓΛ଍͠߹Θͤͯ৘ใڞ༗ )3/FU<8BOH 1".*`>

w ΦϕδΣΫτίϯςΩετදݱΛఏҊ ΦϒδΣΫτྖҬதؒಛ௃ΛιϑτΞςϯγϣϯͱͯ͠࢖༻ ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠ಛ௃Λදݱ 0$3<:VBO &$$7`> Backbone
Pixel Representations Pixel-Region Relation Object Contextual Representations Augmented Representations N N N Soft Object Regions Object Region Representations Loss ΦϒδΣΫτΫϥεͷදݱ Λར༻ͯ͠ϐΫηϧ͝ͱͷ ಛ௃Λ֫ಘ ΦϒδΣΫτྖҬʹଘࡏ͢Δ ϐΫηϧͷදݱΛू໿ू໿ ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠ ΦϒδΣΫτίϯςΩετදݱ "411 0$3 ιϑτΞςϯγϣϯ

w γʔϯղੳ ࣗಈӡస ϩϘςΟΫεϏδϣϯ Ӵ੕ը૾ղੳ w ҩྍը૾ղੳ
ଁثηάϝϯςʔγϣϯ පมηάϝϯςʔγϣϯ w ޻ۀݕࠪ ҟৗྖҬͷݕ஌ ηϚϯςΟοΫηάϝϯςʔγϣϯͷԠ༻ྫ )"MFNPIBNNBE l-BOE$PWFS/FU"HMPCBMCFODINBSLMBOEDPWFS DMBTTJ fi DBUJPOUSBJOJOHEBUBTFUz/FVS*14 Ӵ੕ը૾ղੳ )3PUI l"OBQQMJDBUJPOPGDBTDBEFE%GVMMZDPOWPMVUJPOBMOFUXPSLTGPS NFEJDBMJNBHFTFHNFOUBUJPOz +$.*( ଁثηάϝϯςʔγϣϯ (5 ,,BNOJUTBT l& ff i DJFOU.VMUJ4DBMF%$//XJUI'VMMZ$POOFDUFE$3'GPS "DDVSBUF#SBJO-FTJPO4FHNFOUBUJPOz .FE*" පมηάϝϯςʔγϣϯ ᜊ౻জّlਂ૚ֶशʹΑΔίϯΫϦʔτޢ؛ྼԽྖҬݕग़γεςϜͷ։ൃz σδλϧϓϥΫςΟε ҟৗྖҬͷݕ஌

w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ ֶशσʔλɿԤभͰࡱӨ $JUZTDBQFTEBUBTFU ςετσʔλɿ೔ຊͰࡱӨ ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ ֶशσʔλɿ೔ຊͰࡱӨ ςετσʔλɿ೔ຊͰࡱӨ ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

w ϚϧνϔουʹΑΓෳ਺ͷσʔληοτͷηϚϯςΟοΫηάϝϯςʔγϣϯΛ࣮ݱ ڞ༗ωοτϫʔΫɿ3FT/FUʹ%PNBJO"UUFOUJPONPEVMFΛద༻ ResNet101 + DA module ASPP 1×1conv.
concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM
⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W Concat. GAP υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ υϝΠϯʹର͢ΔΞςϯγϣϯ

ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM ⨂ SE module A C×N N×1 SE module
B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W Concat. GAP ⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W C×H×W GAP Concat. υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ υϝΠϯʹର͢ΔΞςϯγϣϯ

w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ ⨂ SE
module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W C×H×W GAP Concat. ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ ˠυϝΠϯ͝ͱʹ4&"EBQUFS͔Β࠷దͳಛ௃Λ֫ಘ υϝΠϯʹର͢ΔΞςϯγϣϯ

w Ϛϧνϔουߏ଄ σʔληοτ͝ͱʹݻ༗ͷग़ྗϔουΛ༻ҙ ҟͳΔΦϒδΣΫτΫϥεΛ࣋ͭσʔληοτͰ΋ֶशՄೳ ResNet101 + DA module
ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM Ϋϥεͷग़ྗ Ϋϥεͷग़ྗ Ϋϥεͷग़ྗ

w ֶशํ๏ ޡࠩΛྦྷੵͯ͠ಉ࣌ʹٯ఻೻͢Δ.JY-PTTΛ࢖༻ ಉ࣌ʹڞ༗ωοτϫʔΫͷύϥϝʔλΛߋ৽͢Δ͜ͱͰόϥ͖ͭΛ௿ݮ ResNet101 + DA module
ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM LCE (x1 ) LCE (x2 ) + LCE (xn ) L = N ∑ n=1 LCE (xn ) .JY-PTT

w ࣮ݧ֓ཁ ࣮ݧ৚݅ ೖྗαΠζɿºϐΫηϧ ֶशճ਺ɿFQPDI ΦϓςΟϚΠβɿ.PNFOUVN4(%
σʔληοτͷ૊Έ߹Θͤ $JUZTDBQFT #%% 4ZOTDBQFT ධՁࢦඪ .FBO*P6 ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM $JUZTDBQFTɿ #%%ɿ 4ZOTDBQFTɿ υΠπͷ౎ࢢͰࡱӨ͞Εͨंࡌը૾ͷσʔληοτ ΞϝϦΧͷ౎ࢢʢχϡʔϤʔΫɼόʔΫϨΠɼαϯϑϥϯγε ίɼϕΠΞϦΞʣͰࡱӨ͞Εͨσʔληοτ ϑΥτϦΞϦεςΟοΫϨϯμϦϯάٕज़Λ༻͍ͯੜ੒ͨ͠σʔ ληοτ

ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM 5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT ఏҊख๏ʢ%"ͳ͠ʣ $JUZTDBQFT
#%% 4ZOTDBQFT .FBO*P6ʹΑΔൺֱ

w Ξϊςʔγϣϯ Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺ ˠΞϊςʔγϣϯϥϕϧ͕ҟͳΔͱֶश͕͏·͍͔͘ͳ͍ Ξϊςʔλʔ"͞Μ Ξϊςʔλʔ#͞Μ थ໦

w Ξϊςʔγϣϯ Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ Ξϊςʔγϣϯίετɿ ຕº෼ʹ ࣌ؒ ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺
$JUZTDBQFTͷ৔߹ ຕ͋ͨΓ෼ M. Cordts, “The Cityscapes Dataset for Semantic Urban Scene Understanding”, CVPR2016

w 41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPNTJNVMBUJPO<-FF *$-3`> Ξϊςʔγϣϯແ͠ͷ࣮ը૾ͱγϛϡϨʔλΛ༻ֶ͍ͨश ϥϕϧແ࣮͠ը૾ʴγϛϡϨʔγϣϯ $(Λೖྗ࣮ͯ͠ը૾ʹελΠϧม׵ γϛϡϨʔλΛ༻͍ͯ $(ը૾ɺΞϊςʔγϣϯɺσϓεΛੜ੒
໨తλεΫ̍ɿ ηϚϯςΟοΫηάϝϯςʔγϣϯ ໨తλεΫɿ σϓεਪఆ ϥϕϧແ࣮͠ը૾܈ γϛϡϨʔλ

w 4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI%JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ<9-BJ $713> ڭࢣͳ͠σʔλ͔ΒϥϯμϜΫϩοϓͨ͠ೋͭͷύον͔Βରরֶश 1PTJUJWFQBJSɿڞ௨ྖҬ /FHBUJWFQBJSɿҟͳΔྖҬ ൒ڭࢣ͋Γֶश
ϥϕϧແ͠ը૾ ϥϕϧ͋Γը૾ 1PTJUJWFQBJS /FHBUJWFQBJS 1PTJUJWFQBJS /FHBUJWFQBJS 1PTJUJWFQBJSɿ ͷಉ͡Ґஔʹ͋Δͭͷಛ௃Λ͚ۙͮΔ /FHBUJWFQBJSɿ ͷҟͳΔҐஔʹ͋Δͭͷಛ௃Λ཭͢ 𝜙 𝑜 1 ͱ 𝜙 𝑜 2 𝜙 𝑢 1 ͱ 𝜙 𝑢 2

w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ
ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ 4FH/FU 141/FU %FFQ-BC7 )3/FU 0$3 $JUZTDBQFTWBMTFUϕϯνϚʔΫ %FFQ-BC7 %FFQ-BC %FFQ-BC7 4FH'PSNFS ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ 5SBOT'PSNFSϕʔε .FBO*P6 $MBTT ൃද೥

w "UUFOUJPOػߏͷΈΛ༻͍ͨϞσϧ 3//΍$//ʹ୅Θͬͯจষੜ੒΍຋༁λεΫͰ4P5" w "UUFOUJPOػߏͷΈͰߏ੒ $//ͷΑ͏ʹฒྻܭࢉ͕Մೳ 3//ͷΑ͏ʹ௕ظґଘϞσϧΛߏஙՄೳ
w 1PTJUJPOBM&ODPEJOHͷߏங ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ w 4FMG"UUFOUJPOϞσϧͰߏ੒ ೖྗग़ྗؒͷরԠؔ܎Λ௕ظతʹ֫ಘՄೳ 5SBOTGPSNFS<7BTXBOJ /FVS*14> &ODPEFS %FDPEFS

w ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ ܥྻσʔλΛਖ਼֬ͳॱྻʹอͭ 3//΍$//ͷ૬ରత͔ͭઈରతͳҐஔ৘ใΛՃࢉ͢ΔΠϝʔδ 1PTJUJPOBM&ODPEJOH 3// ࢲ 3//
͸ 3// Ϧϯΰ 3// ͕ 3// ޷͖ 3// Ͱ͢ U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ PE(pos,2i) = sin(pos/10,0002i/dmodel) PE(pos,2i+1) = cos(pos/10,0002i/dmodel) 1PTJUJPOBM&ODPEJOHͷఆࣜԽ dmodel i pos 1&ͷ࣍ݩ਺ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩͷ੒෼ ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ 2π 10,000 ⋅ 2π ܥྻσʔλͷҐஔ 1&ͷ࣍ݩ 1PTJUJPOBM&ODPEJOHͷՄࢹԽ

w 5SBOTGPSNFSͷ伴ͱͳΔ෦෼ .VMUJ)FBE"UUFOUJPOͷதͷϞδϡʔϧ &ODPEFS%FDPEFSͷ྆ํͰ࢖༻ 4FMG"UUFOUJPO 2VFSZ ,FZ
7BMVF Ͱߏ੒ 4DBMFE%PU1SPEVDU"UUFOUJPO Attention(Q, K, V) = softmax( QKT dk )V Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- ఆࣜԽ 2VFSZͷ࣍ݩ਺ dk ྫɿ2 ,ͷฏۉ͕ɼ෼ࢄ͕ͱԾఆ͢Δͱɼ͜ΕΒͷߦྻੵͷฏۉ஋͕ɼ෼ࢄ ͱͳΔ dk 4PGUNBYؔ਺ͷޯ഑Λܭࢉ࣌ʹɼҰ෦ͷ಺ੵ஋͕ඇৗʹେ͖͍ͱɼ ಺ੵ஋͕࠷େͷཁૉҎ֎ͷޯ഑͕ඇৗʹখ͘͞ͳΔ 2 ,ͷಛ௃ྔΛ ͰεέʔϦϯά͢Δ͜ͱͰɼฏۉɼ෼ࢄͱͳΓ׈Β͔ͳޯ഑ΛͱΔ dk Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention

4FMG"UUFOUJPOͷৄࡉ q1 k1 v1 q2 k2 v2 q3 k3 v3
q4 k4 v4 q5 k5 v5 x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each qi = Wq ei ki = Wk ei vi = Wv ei &NCFEEJOHͨ͠ಛ௃ྔ ͔Β2VFSZɼ,FZɼ7BMVFಛ௃ྔΛͦΕͧΕͷઢܗม׵ͰٻΊΔ ei 2VFSZ ,FZ 7BMVF

q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 2VFSZͱ,FZಛ௃ྔͷ಺ੵΛͱΓɼTPGUNBYؔ਺Ͱܥྻؒͷؔ࿈౓ "UUFOUJPOXFJHIU Λऔಘ ̂ α = softmax( QKT dk ) Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output1 output2 output3 output4 output5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each "UUFOUJPOXFJHIUͱ7BMVFಛ௃ྔΛ৐ࢉ͠৘ใ෇༩͢Δ͜ͱͰɼ࣌ࠁؒͷಛ௃ྔͷؔ܎ੑΛٻΊΔ Attention(Q, K, V) = ̂ αV

5SBOTGPSNFS#MPDLͷશମॲཧ ϕΫτϧΛ࣌ؒํ޲ʹ   ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼
ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ୯ޠ਺Y࣍ݩ਺ Wfeed1 '$ޙͷಛ௃ϕΫτϧ Wfeed2 RVFSZ LFZ WBMVF ୯ޠ਺Y࣍ݩ਺ Wq Wk Wv ୯ޠ਺Y୯ޠ਺ ͱ7BMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱ ͷॏཁ౓͕֨ೲ͞Ε͍ͯΔ RVFSZͱLFZؒͷߦྻԋࢉ Wout సஔ ྻํ޲ʹ TPGUNBY ୯ޠ਺Y࣍ݩ਺ α ̂ α ೖྗɿ୯ޠྻ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

ਪ࿦࣌ͷσίʔμͷॲཧ ࢲ͸ݘ͕޷͖ͩɻ ೖྗɿ୯ޠྻ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ 'FFE'PSXBSE ਖ਼
ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz &ODPEFS .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث l*z &04 .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث lMJLFz &04l*z .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث &/% &04l*zlMJLFzlEPHTzlz ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ %FDPEFS ʜ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7

w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏ ը૾Λݻఆύονʹ෼ղ ෼ղͨ͠ύονΛ fl BUUFOʹ͠ɼ5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃ྔΛಘΔ *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5"
7JTJPO5SBOTGPSNFS<%PTPWJUTLJZ *$-3> "%PTPWJUTLJZ l"/*."(&*48035)9803%453"/4'03.&34'03*."(&3&$0(/*5*0/"54$"-& z*$-3

w 7J5͸$//ͷSFDFQUJWF fi FMEͷΑ͏ͳಛ௃Λଊ͑Δ<> ύονʹ෼ղ͠5SBOTGPSNFSͰύονؒͷಛ௃Λֶश $//Ͱ͸ଊ͖͑Εͳ͔ͬͨը૾શମͷಛ௃Λଊ͑Δ $//ͱൺ΂ͯԿ͕ڧ͍͔ʁ
$// 7J5 YͷྖҬ SFDFQUJWF fi FME ͷಛ௃Λଊ͑Δ ύονʹ෼ղ͠5SBOTGPSNFSͰը૾શମͷಛ௃Λଊ͑Δ ʜ 7J5 ˞ճ৞ΈࠐΜͩ৔߹ <>+$PSEPOOJFS l0/5)&3&-"5*0/4)*1#&58&&/4&-'"55&/5*0/"/%$0/70-65*0/"--":&34 z*$-3

w ը૾Λݻఆύονʹ෼ղͯ͠5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃Λநग़ ωοτϫʔΫ֓ཁ Preprint. Under review. Transformer Encoder MLP Head
Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,

w ෼ྨ໰୊༻ʹ৽͘͠$-45PLFOΛ௥Ճ ֶशՄೳͳύϥϝʔλ 5SBOTGPSNFS&ODPEFSͷ$-45PLFOͷग़ྗ͔ΒΫϥε෼ྨ $-45PLFO Transformer Encoder MLP
Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the
arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D 7J5 ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the
arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ilarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position $//ͷ௿࣍ݩ૚ͷϑΟϧλͷΑ͏ͳ΋ͷֶ͕श͞ΕΔ 7J5

7J5 w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH 1PTJUJPO&NCFEEJOH asis functions for a low-dimensional
the arns em- em- the idal That ex- ield tire the the nte- ten- ﬁnd west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D <DMT>UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ͋Δ1&ͱଞͷ1&ͱͷྨࣅ౓Λදݱͨ͠΋ͷ ෇ۙͷଞͷҐஔͱͷྨࣅ౓͕ߴ͍ ͔ΒԕํͷଞͷҐஔͱͷྨࣅ౓͕௿͍ ෇ۙͷ1&ಉ࢜͸ࣅͨ஋ʹͳΔΑ͏ʹֶश͞Ε͍ͯΔ

7JTJPO5SBOTGPSNFSͷ#MPDLͷશମॲཧ ಛ௃ϕΫτϧ DMTUPLFO ɾ7JTJPO5SBOTGPSNFS ɾ5SBOTGPSNFS ϕΫτϧΛ࣌ؒํ޲ʹ   ࠞͥͯม׵ ਖ਼ ن
Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ೖྗɿ୯ޠྻ ϕΫτϧΛۭؒํ޲ʹ   ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE DMTUPLFO DMTUPLFOͷΈ༻͍ͯ   Ϋϥε෼ྨΛߦ͏

w ը૾෼ྨʹ͓͍ͯ4P5"ͱͷൺֱ +'5.Ͱࣄલֶश֤ͯ͠σʔληοτͰసҠֶश 7JTJPO5SBOTGPSNFSͷੑೳ Published as a conference paper
at ICLR 2021 Ours-JFT Ours-JFT Ours-I21k BiT-L Noisy Student (ViT-H/14) (ViT-L/16) (ViT-L/16) (ResNet152x4) (EfficientNet-L2) ImageNet 88.55± 0.04 87.76± 0.03 85.30± 0.02 87.54± 0.02 88.4/88.5⇤ ImageNet ReaL 90.72± 0.05 90.54± 0.03 88.62± 0.05 90.54 90.55 CIFAR-10 99.50± 0.06 99.42± 0.03 99.15± 0.03 99.37± 0.06 CIFAR-100 94.55± 0.04 93.90± 0.05 93.25± 0.05 93.51± 0.08 Oxford-IIIT Pets 97.56± 0.03 97.32± 0.11 94.67± 0.15 96.62± 0.23 Oxford Flowers-102 99.68± 0.02 99.74± 0.00 99.61± 0.02 99.63± 0.03 VTAB (19 tasks) 77.63± 0.23 76.28± 0.46 72.72± 0.21 76.29± 1.70 TPUv3-core-days 2.5k 0.68k 0.23k 9.9k 12.3k Table 2: Comparison with state of the art on popular image classification benchmarks. We re- port mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. ⇤Slightly improved 88.5% result reported in Touvron et al. (2020). ԯ ԯ ԯ ԯ ύϥϝʔλ਺ ˞516WDPSFEBZTɿֶशʹ࢖༻ͨ͠516WίΞ਺ νοϓ͋ͨΓݸ ʹֶश࣌ؒ ೔਺ Λ͔͚ͨ΋ͷ ˞ ˠશͯͷσʔληοτͰ4P5"Λୡ੒

w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ ("/ʹΑΔελΠϧม׵Ͱը૾ ධՁର৅ɿ7J5 $// ਓؒ ධՁࢦඪ
ܗঢ়ͷׂ߹ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ςΫενϟΫϥεͱࣝผ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW ("/ʹΑΔ ελΠϧม׵ ೣ ৅ ೣˠܗঢ়Λଊ͍͑ͯΔ ৅ˠςΫενϟΛଊ͍͑ͯΔ $// PS 7J5 ෼ྨ݁Ռ

w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ $//͸ςΫενϟΛॏࢹ 7J5͸෺ମͷܗঢ়Λॏࢹ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ <>3(FJSIPT l*."(&/&553"*/&%$//4"3&#*"4&%508"3%45&9563&*/$3&"4*/(4)"1&#*"4*.1307&4"$$63"$:"/%30#645/&44 z*$-3
blished as a conference paper at ICLR 2019 AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet 100 GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet 100 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 100 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 100 Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans 99 97 99 100100 98 44 49 48 54 75 40 28 24 18 87 100100100100 90 original greyscale silhouette edges texture Figure 2: Accuracies and example stimuli for five different experiments without cue conflict. anging biases, and discovering emergent benefits of changed biases. We show that the texture bias standard CNNs can be overcome and changed towards a shape bias if trained on a suitable data . Remarkably, networks with a higher shape bias are inherently more robust to many different age distortions (for some even reaching or surpassing human performance, despite never being <>ΑΓҾ༻ <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW Fig. 4: Error consistency results on SIN dataset. distribution (i.e., p 2 D240 corresponding to the off-diagonal entries of the 16 ⇥ 16 confusion matrix) by taking the error counts to be the off-diagonal elements of the confusion matrix: ei j = CMi, j, 8 j 6= i In this context, the inter-class JS distance compares what classes were misclassified as what. An interesting finding is that, instead of a strong correlation shown by class-wise JS in Figure 3(a), Figure 3(b) sug- gests that there is no correlation of inter-class JS distance with Cohen’s k implying that this metric gives insight beyond Co- hen’s k in measuring error-consistency with humans. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) Fig. 5: Shape bias for different networks for the SIN dataset (Geirhos et al., 2019). Vertical lines indicate averages. range of models on the Stylized ImageNet (SIN) dataset (Fig- <>ΑΓҾ༻ $//ͱਓؒͷൺֱ 7J5 ˝˝ $// ˔˙˛˔ ਓؒ ♦︎ ͷൺֱ ςΫενϟ ܗঢ়

w 5SBOTGPSNFSΛηάϝϯςʔγϣϯλεΫ΁Ԡ༻ .JY5SBOTGPSNFS w ϚϧνϨϕϧͳಛ௃Λ֫ಘՄೳͳ֊૚ܕ5SBOTGPSNFS w ܭࢉίετΛ࡟ݮ͢Δߏ଄ ܰྔ͔ͭγϯϓϧͳ.-1σίʔμΛ࠾༻
w 5SBOTGPSNFS͸ہॴత͔ͭ޿Ҭతͳಛ௃ΛऔಘՄೳ w .-1͸ہॴతͳಛ௃Λิ׬Ͱ͖ɼڧྗͳදݱΛ֫ಘՄೳ 7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW> &9JF l4FH'PSNFS4JNQMFBOE& ff i DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. i the MiT encoder go through an MLP layer to unify the channel dimension. Then, in a second step, features are up-sampled to 1/4th and concatenated together. Third, a MLP layer is adopted to fuse the concatenated features F. Finally, another MLP layer takes the fused feature to predict the segmentation mask M with a H 4 ⇥ W 4 ⇥ N cls resolution, where N cls is the number of categories. This lets us formulate the decoder as: ˆ F i = Linear(C i , C)(F i ), 8i ˆ F i = Upsample( W 4 ⇥ W 4 )( ˆ F i ), 8i F = Linear(4C, C)(Concat( ˆ F i )), 8i M = Linear(C, N cls )(F), (4) where M refers to the predicted mask, and Linear(C in , C out )(·) refers to a linear layer with C in and C out as input and output vector dimensions respectively. DeepLabv3+ SegFormer Stage-1 Stage-2 Stage-3 Head Stage-4 Figure 3: Effective Receptive Field (ERF) on Cityscapes (aver- age over 100 images). Top row: Deeplabv3+. Bottom row: Seg- Former. ERFs of the four stages and the decoder heads of both Effective Receptive Field Analysis. For semantic segmentation, maintain- ing large receptive field to include context information has been a central is- sue [5, 19, 20]. Here, we use effective receptive field (ERF) [70] as a toolkit to visualize and interpret why our MLP decoder design is so effective on Transformers. In Figure 3, we visualize ERFs of the four encoder stages and the decoder heads for both 3FE#PYɿ4UBHFͷ4FMG"UUFOUJPOͷSFDFQUJWF fi FMEͷେ͖͞ #MVF#PYɿ.-1-BZFSͷ.-1ͷSFDFQUJWF fi FMEͷେ͖͞ $JUZTDBQFTʹ͓͚ΔSFDFQUJWF fi FMEͷޮՌΛ෼ੳ ˠ$//ϕʔε͸ہॴతͳಛ௃ͷΈଊ͑Δ͕ɼ5SBOTGPSNFSϕʔε͸ہॴతɾ޿Ҭతͳಛ௃Λଊ͑Δ ˠ.-1ͷSFDFQUJWF fi FME͸ہॴతྖҬͷີ౓͕ߴ͍খ͍͞෺ମͷਖ਼֬ͳηάϝϯςʔγϣϯ͕ظ଴

&9JF l4FH'PSNFS4JNQMFBOE& ff i DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW 7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW>
C Flops # Params # mIoU " 256 25.7 24.7 44.9 512 39.8 25.8 45.0 768 62.4 27.5 45.4 1024 93.6 29.6 45.2 2048 304.4 43.4 45.6 Inf Res Enc Type mIoU " 768⇥768 PE 77.3 1024⇥2048 PE 74.0 768⇥768 Mix-FFN 80.5 1024⇥2048 Mix-FFN 79.8 Encoder Flops # Params # mIoU " ResNet50 (S1-4) 69.2 29.0 34.7 ResNet101 (S1-4) 88.7 47.9 38.7 ResNeXt101 (S1-4) 127.5 86.8 39.8 MiT-B2 (S4) 22.3 24.7 43.1 MiT-B2 (S1-4) 62.4 27.7 45.4 MiT-B3 (S1-4) 79.0 47.3 48.6 Table 2: Comparison to state of the art methods on ADE20K and Cityscapes. SegFormer has significant advantages on #Params, #Flops, #Speed and #Accuracy. Note that for SegFormer-B0 we scale the short side of image to {1024, 768, 640, 512} to get speed-accuracy tradeoffs. Method Encoder Params # ADE20K Cityscapes Flops # FPS " mIoU " Flops # FPS " mIoU " Real-Time FCN [1] MobileNetV2 9.8 39.6 64.4 19.7 317.1 14.2 61.5 ICNet [11] - - - - - - 30.3 67.7 PSPNet [17] MobileNetV2 13.7 52.9 57.7 29.6 423.4 11.2 70.2 DeepLabV3+ [20] MobileNetV2 15.4 69.4 43.1 34.0 555.4 8.4 75.2 SegFormer (Ours) MiT-B0 3.8 8.4 50.5 37.4 125.5 15.2 76.2 - - - 51.7 26.3 75.3 - - - 31.5 37.1 73.7 - - - 17.7 47.6 71.9 Non Real-Time FCN [1] ResNet-101 68.6 275.7 14.8 41.4 2203.3 1.2 76.6 EncNet [24] ResNet-101 55.1 218.8 14.9 44.7 1748.0 1.3 76.9 PSPNet [17] ResNet-101 68.1 256.4 15.3 44.4 2048.9 1.2 78.5 CCNet [41] ResNet-101 68.9 278.4 14.1 45.2 2224.8 1.0 80.2 DeeplabV3+ [20] ResNet-101 62.7 255.1 14.1 44.1 2032.3 1.2 80.9 OCRNet [23] HRNet-W48 70.5 164.8 17.0 45.6 1296.8 4.2 81.1 GSCNN [35] WideResNet38 - - - - - - 80.8 Axial-DeepLab [74] AxialResNet-XL - - - - 2446.8 - 81.1 Dynamic Routing [75] Dynamic-L33-PSP - - - - 270.0 - 80.7 Auto-Deeplab [50] NAS-F48-ASPP - - - 44.0 695.0 - 80.3 SETR [7] ViT-Large 318.3 - 5.4 50.2 - 0.5 82.2 SegFormer (Ours) MiT-B4 64.1 95.7 15.4 51.1 1240.6 3.0 83.8 SegFormer (Ours) MiT-B5 84.7 183.3 9.8 51.8 1447.6 2.5 84.0 however, it leads to larger and less efficient models. Interestingly, this performance plateaus for channel dimensions wider than 768. Given these results, we choose C = 256 for our real-time SegFormer SETR DeepLabv3+ Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predic tially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduce highlighted in red. Best viewed in screen. 4.4 Robustness to natural corruptions Model robustness is important for many safety-critical tasks such as autonomous experiment, we evaluate the robustness of SegFormer to common corruptions a this end, we follow [77] and generate Cityscapes-C, which expands the Cityscape 16 types of algorithmically generated corruptions from noise, blur, weather and d SegFormer SETR SegFormer DeepLabv3+ Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predicts masks with substantially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduces long-range errors as highlighted in red. Best viewed in screen.

4FH'PSNFS͕΋ͨΒ͢Մೳੑ IUUQTXXXZPVUVCFDPNXBUDI W+.P32[;F6 ϊΠζͷӨڹΛड͚ੑೳ͕ྼԽ ϊΠζʹର͠ϩόετ ˠ5SBOTGPSNFS͸෺ମͷܗঢ়Λֶश͢ΔͨΊɼϊΠζͷӨڹΛड͚ʹ͍͘

w 4FH'PSNFSΛϚϧνϔουԽͯ͠ෳ਺ͷσʔληοτʹରԠ %FDPEFSΛσʔληοτ͝ͱʹ༻ҙ 5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ 4FH'PSNFSͷϚϧνϔουԽ Transformer Block
1 Transformer Block 2 MLP Layer MLP Layer MLP Layer Efficient self-attention Mix FFN Overlap Patch Merging DA module ⨁ ⨁ ×N Transformer Block 4 Transformer Block 3 Encoder Decoder Concat MLP Concat MLP Concat MLP υϝΠϯ# υϝΠϯ" υϝΠϯ$ &ODPEFS .VMUJIFBE%FDPEFS

w ϚϧνυϝΠϯԽͨ͠4FH'PSNFSͱ$//ख๏ͱͷൺֱ 5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ 4FH'PSNFSͷϚϧνϔουԽ 5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT
ఏҊख๏ $JUZTDBQFT #%% 4ZOTDBQFT 5SBJOWBM $JUZTDBQF T #%% 4ZOTDBQFT ఏҊख๏ $JUZTDBQFT #%% 4ZOTDBQFT %FFQ-BCW 4FH'PSNFS ˠϚϧνυϝΠϯʹ͓͍ͯ΋4FH'PSNFS͸ޮՌେ

w $//ϕʔεͷ4UBUFPGUIFBSU %FFQ-BC7 )3/FU 0$/ w ϚϧνυϝΠϯ΁ͷରԠ %"ϞδϡʔϧΛಋೖͯ͠ϚϧνυϝΠϯֶश
w 4FH'PSNFSͷޮՌ 7JTJPO5SBOTGPSNFSͷʹΑΔରϊΠζੑͷ޲্ ·ͱΊɿਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ ResNet101 + DA module ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. in an end-to-end manner. After that, researchers focused on improving FCN from different aspects such as: enlarging the receptive field [17–19, 5, 2, 4, 20]; refining the contextual information [21– 29]; introducing boundary information [30–37]; designing various attention modules [38–46]; or using AutoML technologies [47–51]. These methods significantly improve semantic segmentation performance at the expense of introducing many empirical modules, making the resulting framework computationally demanding and complicated. More recent methods have proved the effectiveness of Transformer-based architectures for semantic segmentation [7, 46]. However, these methods are still computationally demanding. Transformer backbones. ViT [6] is the first work to prove that a pure Transformer can achieve state-of-the-art performance in image classification. ViT treats each image as a sequence of tokens and ϚϧνυϝΠϯରԠ 4FH'PSNFSͷޮՌ

ࢀߟจݙ <4IPUUPO $713`>+BNJF4IPUUPO .BUUIFX+PIOTPO 3PCFSUP$JQPMMB l4FNBOUJDUFYUPOGPSFTUTGPSJNBHFDBUFHPSJ[BUJPOBOETFHNFOUBUJPO z $713
<#BESJOBSBZBOBO 1".*`>7JKBZ#BESJOBSBZBOBO "MFY,FOEBMM 3PCFSUP$JQPMMB l4FHOFU"EFFQDPOWPMVUJPOBMFODPEFSEFDPEFSBSDIJUFDUVSF GPSJNBHFTFHNFOUBUJPOz1".* <-POH $713`>+POBUIBO-POH &WBO4IFMIBNFS BOE5SFWPS%BSSFMM l'VMMZDPOWPMVUJPOBMOFUXPSLTGPSTFNBOUJDTFHNFOUBUJPO z$713 <3POOFCFSHFS .*$$"*`>0MBG3POOFCFSHFS 1IJMJQQ'JTDIFS 5IPNBT#SPY l6OFU$POWPMVUJPOBMOFUXPSLTGPSCJPNFEJDBMJNBHF TFHNFOUBUJPO z.*$$"* <;IBP $713`>)FOHTIVBOH;IBP +JBOQJOH4IJ 9JBPKVBO2J 9JBPHBOH8BOH +JBZB+JB l1ZSBNJETDFOFQBSTJOHOFUXPSL z$713 <$IFO &$$7`>-JBOH$IJFI$IFO :VLVO;IV (FPSHF1BQBOESFPV 'MPSJBO4DISP ff )BSUXJH"EBNl&ODPEFSEFDPEFSXJUIBUSPVTTFQBSBCMF DPOWPMVUJPOGPSTFNBOUJDJNBHFTFHNFOUBUJPO z&$$7 <8BOH 1".*`>+JOHEPOH8BOH ,F4VO 5JBOIFOH$IFOH #PSVJ+JBOH $IBPSVJ%FOH :BOH;IBP %POH-JV :BEPOH.V .JOHLVJ5BO 9JOHHBOH8BOH 8FOZV-JV #JO9JBPl%FFQIJHISFTPMVUJPOSFQSFTFOUBUJPOMFBSOJOHGPSWJTVBMSFDPHOJUJPO z1".* <:VBO &$$7`>:VIVJ:VBO 9JMJO$IFO +JOHEPOH8BOHl0CKFDUDPOUFYUVBMSFQSFTFOUBUJPOTGPSTFNBOUJDTFHNFOUBUJPO z&$$7 <$PSEUT $713`>.BSJVT$PSEUT .PIBNFE0NSBO 4FCBTUJBO3BNPT 5JNP3FIGFME .BSLVT&O[XFJMFS 3PESJHP#FOFOTPO 6XF'SBOLF 4UFGBO3PUI #FSOU4DIJFMFl5IFDJUZTDBQFTEBUBTFUGPSTFNBOUJDVSCBOTDFOFVOEFSTUBOEJOHz$713 <-FF *$-3`>,VBO)VJ-FF (FSNBO3PT +JF-J "ESJFO(BJEPO l41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPN4JNVMBUJPO z*$-3 <-BJ $713`>9JO-BJ ;IVPUBP5JBO -J+JBOH 4IV-JV )FOHTIVBOH;IBP -JXFJ8BOH +JBZB+JB l4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI %JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ z$713

ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ த෦େֶϩΰ த෦େֶϩΰ ڭत ౻٢߂࿱ Hironobu Fujiyoshi E-mail: [email protected]
1997೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ, 1997೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴPostdoctoral Fellow, 2000೥ த෦େֶ޻ֶ෦৘ใ޻ֶՊߨࢣ, 2004೥ த෦େֶ।ڭत, 2005೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһ(ʙ2006೥), 2010೥ த෦େֶڭत, 2014೥໊ݹ԰େֶ٬һڭत.  ܭࢉػࢹ֮ɼಈը૾ॲཧɼύλʔϯೝࣝɾཧղͷݚڀʹैࣄɽ  ϩϘΧοϓݚڀ৆(2005೥)ɼ৘ใॲཧֶձ࿦จࢽCVIM༏ल࿦จ৆(2009೥)ɼ৘ใॲཧֶձࢁԼه೦ݚڀ৆(2009೥)ɼը૾ηϯγϯάγϯϙδ΢Ϝ༏लֶज़৆(2010, 2013, 2014೥) ɼ ిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ଞ ڭत ࢁԼོٛ Takayoshi Yamashita E-mail:[email protected] 2002೥ ಸྑઌ୺Պֶٕज़େֶӃେֶത࢜લظ՝ఔमྃ, 2002೥ ΦϜϩϯגࣜձࣾೖࣾ, 2009೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ(ࣾձਓυΫλʔ), 2014೥ த෦େֶߨࢣɼ 2017೥ த෦େֶ।ڭतɼ2021೥ த෦େֶڭतɽ  ਓͷཧղʹ޲͚ͨಈը૾ॲཧɼύλʔϯೝࣝɾػցֶशͷݚڀʹैࣄɽ  ը૾ηϯγϯάγϯϙδ΢Ϝߴ໦৆(2009೥)ɼిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ɼిࢠ৘ใ௨৴ֶձPRMUݚڀձݚڀ঑ྭ৆(2013೥)ड৆ɽ ߨࢣ ฏ઒ཌྷ Tsubasa Hirakawa E-mail:[email protected] 2013೥ ޿ౡେֶେֶӃത࢜՝ఔલظऴྃɼ2014೥ ޿ౡେֶେֶӃത࢜՝ఔޙظೖֶɼ2017೥ த෦େֶݚڀһ (ʙ2019೥)ɼ2017೥ ޿ౡେֶେֶӃത࢜ޙظ՝ఔमྃɽ2019 ೥ த෦େֶಛ೚ॿڭɼ2021೥ த෦େֶߨࢣɽ2014೥ ಠཱߦ੓๏ਓ೔ຊֶज़ৼڵձಛผݚڀһDC1ɽ2014೥ ESIEE Paris٬һݚڀһ (ʙ2015೥)ɽ ίϯϐϡʔλϏδϣϯɼύλʔϯೝࣝɼҩ༻ը૾ॲཧͷݚڀʹैࣄ

深層学習によるセマンティックセグメンテーションとその最新動向

深層学習によるセマンティックセグメンテーションとその最新動向

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Featured

Transcript