Slide 1

Slide 1 text

೔ຊݦඍڸֶձ ੜମػೳ ボ ϦϡʔϜ デ ʔλղੳݚڀ෦ձୈճݚڀձ ਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯͱͦͷ࠷৽ಈ޲ ౻٢߂࿱ʢத෦େֶɾػց஌֮ϩϘςΟΫεݚڀάϧʔϓʣ IUUQNQSHKQ

Slide 2

Slide 2 text

Ұൠ෺ମೝࣝͱ͸ʁ

Slide 3

Slide 3 text

ίϯϐϡʔλϏδϣϯʢը૾ೝࣝʣͷ࠷ऴ໨ඪʁ Ұൠ෺ମೝࣝɿ ੍໿ͷͳ͍࣮ੈքγʔϯͷը૾ʹରͯ͠ɺܭࢉػ が ͦͷதʹؚ·ΕΔ෺ମΛҰൠతͳ໊শ で ೝࣝ͢Δ͜ͱ ਂ૚ֶशલ d ɿඇৗʹ೉͍͠໰୊ͱ͞Ε͍ͯͨ ਂ૚ֶशޙ d ɿղ͔Ε࢝Ίͨ

Slide 4

Slide 4 text

Ұൠ෺ମೝࣝ໰୊ͷࡉ෼Խ ͜Ε͸ݐ෺Ͱ͔͢ʁ র߹ ਓ͸Ͳ͜Ͱ͔͢ʁ ෺ମݕग़ Կͷը૾Ͱ͔͢ʁ ը૾෼ྨ ͲͷΑ͏ͳγʔϯͰ͔͢ʁ γʔϯͷཧղ ಛఆ෺ମೝࣝ Ұൠ෺ମೝࣝ ඪࣝzࢭ·ΕzͰ͔͢ʁ

Slide 5

Slide 5 text

w ը૾தͷըૉ୯ҐͰ෺ମΧςΰϦΛٻΊΔ໰୊ 4FNBOUJDUFYUPOGPSFTUT<4IPUUPO $713`> ηϚϯςΟοΫηάϝϯςʔγϣϯλεΫ sts of T decision trees. A feature vector is classified by descending each tree. This gives, lass distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow) ure vector. This paper shows how to simultaneously exploit both the hierarchical clustering ss distributions. (b) Semantic texton forests features. The split nodes in semantic texton els within a d d patch: either the raw value of a single pixel, or the sum, difference, or mentation by em- present. of this work are: ntly provide both xtons and a local xtons model, and entation; and (iii) ove segmentation ion 2 gives a brief ch form the basis uced in Section 3. in Section 4 and ing, we also use it as a classifier, which enables us to use semantic context for image segmentation; and (iii) in addi- tion to the leaf nodes used in [19], we include the branch nodes as hierarchical clusters. A related method, the pyra- mid match kernel [9], exploits a hierarchy in descriptor space, though it requires the computation of feature descrip- tors and is only applicable to kernel based classifiers. The pixel-based features we use are similar to those in [14], but our forests are trained to recognize object categories, not match particular feature points. Other work has also looked at alternatives to k-means. Recent work [29] quantizes feature space into a hyper-grid, but requires descriptor computation and can result in very uilding rass ee ow heep ky irplane ater ce ar icycle ower ign ird ook hair ad at og ody oat lobal verage .BDIJOFMFBSOJOHBMHPSJUIN )BOEDSBGUFEGFBUVSF Figure 2. (a) Decision forests. A forest consists of T decision trees. A feature vector is classified by descending each tree. This gives, for each tree, a path from root to leaf, and a class distribution at the leaf. As an illustration, we highlight the root to leaf paths (yellow) and class distributions (red) for one input feature vector. This paper shows how to simultaneously exploit both the hierarchical clustering implicit in the tree structure and the node class distributions. (b) Semantic texton forests features. The split nodes in semantic texton forests use simple functions of raw image pixels within a d d patch: either the raw value of a single pixel, or the sum, difference, or absolute difference of a pair of pixels (red). act as an image-level prior to improve segmentation by em- phasizing the categories most likely to be present. To summarize, the main contributions of this work are: (i) semantic texton forests which efficiently provide both a hierarchical clustering into semantic textons and a local classification; (ii) the bag of semantic textons model, and its applications in categorization and segmentation; and (iii) the use of the image-level prior to improve segmentation performance. The paper is structured as follows. Section 2 gives a brief recap of randomized decision forests which form the basis of our new semantic texton forests, introduced in Section 3. ing, we also use it as a classifier, which enables us to use semantic context for image segmentation; and (iii) in addi- tion to the leaf nodes used in [19], we include the branch nodes as hierarchical clusters. A related method, the pyra- mid match kernel [9], exploits a hierarchy in descriptor space, though it requires the computation of feature descrip- tors and is only applicable to kernel based classifiers. The pixel-based features we use are similar to those in [14], but our forests are trained to recognize object categories, not match particular feature points. Other work has also looked at alternatives to k-means. Recent work [29] quantizes feature space into a hyper-grid, 3BOEPNGPSFTU 1JYFMDPNQBSJTPO

Slide 6

Slide 6 text

w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭ ֤ը૾ೝࣝλεΫʹ͓͚Δ$//ͷߏ଄ ʜ z1FSTPOz W H W′  H′  H W W H ըૉ͝ͱʹΫϥε֬཰Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚ C άϦου͝ͱʹ Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ Ϋϥε֬཰Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁ՌͷՄࢹԽ $// $// ෺ମݕग़ ը૾෼ྨ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ

Slide 7

Slide 7 text

w 4FH/FU<#BESJOBSBZBOBO 1".*`> Τϯίʔμɾσίʔμߏ଄ $//ʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ 2 Fig. 1. SegNet predictions on urban and highway scene test samples from the wild. The class colour codes can be obtained from Brostow et al. [3]. To try our system yourself, please see our online web demo at http://mi.eng.cam.ac.uk/projects/segnet/. 1PPMJOHJOEJDFT ϓʔϦϯάҐஔΛهԱ͠ɼΞοϓαϯϓϦϯά࣌ʹར༻

Slide 8

Slide 8 text

ηϚϯςΟοΫηάϝϯςʔγϣϯͷωοτϫʔΫߏ଄ '$/ GVMMZDPOWPMVUJPOBMOFUXPSL 6/FU 141/FU +-POH l'VMMZ$POWPMVUJPOBM/FUXPSLTGPS4FNBOUJD4FHNFOUBUJPOz $713 03POOFCFSHFS l6/FU$POWPMVUJPOBM/FUXPSLTGPS#JPNFEJDBM *NBHF4FHNFOUBUJPOz .*$$"* );IBP l1ZSBNJE4DFOF1BSTJOH/FUXPSLz $713

Slide 9

Slide 9 text

w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ 4FH/FU 141/FU %FFQ-BC7 )3/FU 0$3 $JUZTDBQFTWBMTFUϕϯνϚʔΫ %FFQ-BC7 %FFQ-BC %FFQ-BC7 4FH'PSNFS ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ $//ϕʔεͷ4UBUFPGUIFBSU .FBO*P6 $MBTT ൃද೥

Slide 10

Slide 10 text

w "USPVT$POWPMVUJPOʹΑΓಛ௃ϚοϓαΠζΛখ͘͞͠ͳ͍ %FFQ-BCW<$IFO &$$7> ҰൠతͳωοτϫʔΫ "USPVT$POWPMVUJPOͷωοτϫʔΫ "USPVT4QBUJBMQZSBNJEQPPMJOH "USPVT$POWPMVUJPO Y

Slide 11

Slide 11 text

w "USPVT4QBUJBM1ZSBNJE1PPMJOH "411 ҟͳΔͷ%JMBUJPOͷ৞ΈࠐΈΛฒྻॲཧˠ༷ʑͳ3FDFQUJWF fi FMEΛߟྀ w Τϯίʔμɾσίʔμߏ଄ ෺ମڥքपΓͷࣝผਫ਼౓޲্ %FFQ-BCW<$IFO &$$7> 1x1 Conv 3x3 Conv rate 6 3x3 Conv rate 12 3x3 Conv rate 18 Image Pooling 1x1 Conv 1x1 Conv Low-Level Features Upsample by 4 Concat 3x3 Conv Encoder Decoder Atrous Conv DCNN Image Prediction Upsample by 4 -$IFO l&ODPEFS%FDPEFSXJUI"USPVT4FQBSBCMF$POWPMVUJPOGPS4FNBOUJD*NBHF4FHNFOUBUJPOz &$$7

Slide 12

Slide 12 text

w ߴղ૾౓ɾ௿ղ૾౓ͷαϒωοτϫʔΫʹΑΔฒྻॲཧ ہॴతͳಛ௃ͱେہతͳಛ௃Λ֫ಘՄೳ αϒωοτϫʔΫؒͷ઀ଓˠ֤εέʔϧͷಛ௃ϚοϓΛ଍͠߹Θͤͯ৘ใڞ༗ )3/FU<8BOH 1".*`>

Slide 13

Slide 13 text

w ΦϕδΣΫτίϯςΩετදݱΛఏҊ ΦϒδΣΫτྖҬதؒಛ௃ΛιϑτΞςϯγϣϯͱͯ͠࢖༻ ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠ಛ௃Λදݱ 0$3<:VBO &$$7`> Backbone Pixel Representations Pixel-Region Relation Object Contextual Representations Augmented Representations N N N Soft Object Regions Object Region Representations Loss ΦϒδΣΫτΫϥεͷදݱ Λར༻ͯ͠ϐΫηϧ͝ͱͷ ಛ௃Λ֫ಘ ΦϒδΣΫτྖҬʹଘࡏ͢Δ ϐΫηϧͷදݱΛू໿ू໿ ΦϒδΣΫτྖҬͷදݱΛՃॏू໿ͨ͠ ΦϒδΣΫτίϯςΩετදݱ "411 0$3 ιϑτΞςϯγϣϯ

Slide 14

Slide 14 text

w γʔϯղੳ ࣗಈӡస ϩϘςΟΫεϏδϣϯ Ӵ੕ը૾ղੳ w ҩྍը૾ղੳ ଁثηάϝϯςʔγϣϯ පมηάϝϯςʔγϣϯ w ޻ۀݕࠪ ҟৗྖҬͷݕ஌ ηϚϯςΟοΫηάϝϯςʔγϣϯͷԠ༻ྫ )"MFNPIBNNBE l-BOE$PWFS/FU"HMPCBMCFODINBSLMBOEDPWFS DMBTTJ fi DBUJPOUSBJOJOHEBUBTFUz/FVS*14 Ӵ੕ը૾ղੳ )3PUI l"OBQQMJDBUJPOPGDBTDBEFE%GVMMZDPOWPMVUJPOBMOFUXPSLTGPS NFEJDBMJNBHFTFHNFOUBUJPOz +$.*( ଁثηάϝϯςʔγϣϯ (5 ,,BNOJUTBT l& ff i DJFOU.VMUJ4DBMF%$//XJUI'VMMZ$POOFDUFE$3'GPS "DDVSBUF#SBJO-FTJPO4FHNFOUBUJPOz .FE*" පมηάϝϯςʔγϣϯ ᜊ౻জّlਂ૚ֶशʹΑΔίϯΫϦʔτޢ؛ྼԽྖҬݕग़γεςϜͷ։ൃz σδλϧϓϥΫςΟε ҟৗྖҬͷݕ஌

Slide 15

Slide 15 text

w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ ֶशσʔλɿԤभͰࡱӨ $JUZTDBQFTEBUBTFU ςετσʔλɿ೔ຊͰࡱӨ ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

Slide 16

Slide 16 text

w $//ͷ൚Խೳྗͷݶքʢֶशͨ͠υϝΠϯͷΈͰ༗ޮʣ ֶशσʔλɿ೔ຊͰࡱӨ ςετσʔλɿ೔ຊͰࡱӨ ҟͳΔυϝΠϯͰͷηϚϯςΟοΫηάϝϯςʔγϣϯੑೳ

Slide 17

Slide 17 text

w ϚϧνϔουʹΑΓෳ਺ͷσʔληοτͷηϚϯςΟοΫηάϝϯςʔγϣϯΛ࣮ݱ ڞ༗ωοτϫʔΫɿ3FT/FUʹ%PNBJO"UUFOUJPONPEVMFΛద༻ ResNet101 + DA module ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM

Slide 18

Slide 18 text

w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM ⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W Concat. GAP υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ υϝΠϯʹର͢ΔΞςϯγϣϯ

Slide 19

Slide 19 text

ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM ⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W Concat. GAP ⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W C×H×W GAP Concat. υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ υϝΠϯʹର͢ΔΞςϯγϣϯ

Slide 20

Slide 20 text

w %PNBJO"UUFOUJPO %" NPEVMF 4&"EBQUFSɿ֤ϒϥϯν͸ಛఆͷυϝΠϯ৘ใΛநग़ %PNBJO"TTJHONFOUɿυϝΠϯʹର͢ΔΞςϯγϣϯΛٻΊॏΈ෇͚ ⨂ SE module A C×N N×1 SE module B SE module C ⨂ C×H×W C×H×W C×1 C×1 C×1 C×1 C×1 Domain Assignment softmax FC GAP SE Adapter C×H×W C×H×W GAP Concat. ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM υϝΠϯ# 3FTJEVBM #MPDL υϝΠϯ" υϝΠϯ$ ˠυϝΠϯ͝ͱʹ4&"EBQUFS͔Β࠷దͳಛ௃Λ֫ಘ υϝΠϯʹର͢ΔΞςϯγϣϯ

Slide 21

Slide 21 text

w Ϛϧνϔουߏ଄ σʔληοτ͝ͱʹݻ༗ͷग़ྗϔουΛ༻ҙ ҟͳΔΦϒδΣΫτΫϥεΛ࣋ͭσʔληοτͰ΋ֶशՄೳ ResNet101 + DA module ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM Ϋϥεͷग़ྗ Ϋϥεͷग़ྗ Ϋϥεͷग़ྗ

Slide 22

Slide 22 text

w ֶशํ๏ ޡࠩΛྦྷੵͯ͠ಉ࣌ʹٯ఻೻͢Δ.JY-PTTΛ࢖༻ ಉ࣌ʹڞ༗ωοτϫʔΫͷύϥϝʔλΛߋ৽͢Δ͜ͱͰόϥ͖ͭΛ௿ݮ ResNet101 + DA module ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM LCE (x1 ) LCE (x2 ) + LCE (xn ) L = N ∑ n=1 LCE (xn ) .JY-PTT

Slide 23

Slide 23 text

w ࣮ݧ֓ཁ ࣮ݧ৚݅ ೖྗαΠζɿºϐΫηϧ ֶशճ਺ɿFQPDI ΦϓςΟϚΠβɿ.PNFOUVN4(% σʔληοτͷ૊Έ߹Θͤ $JUZTDBQFT#%%4ZOTDBQFT ධՁࢦඪ .FBO*P6 ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM $JUZTDBQFTɿ #%%ɿ 4ZOTDBQFTɿ υΠπͷ౎ࢢͰࡱӨ͞Εͨंࡌը૾ͷσʔληοτ ΞϝϦΧͷ౎ࢢʢχϡʔϤʔΫɼόʔΫϨΠɼαϯϑϥϯγε ίɼϕΠΞϦΞʣͰࡱӨ͞Εͨσʔληοτ ϑΥτϦΞϦεςΟοΫϨϯμϦϯάٕज़Λ༻͍ͯੜ੒ͨ͠σʔ ληοτ

Slide 24

Slide 24 text

ఏҊख๏ɿ.VMUJ%PNBJO4FNBOUJD4FHNFOUBUJPOVTJOH.VMUJ)FBE.PEFM 5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT ఏҊख๏ʢ%"ͳ͠ʣ $JUZTDBQFT #%% 4ZOTDBQFT .FBO*P6ʹΑΔൺֱ

Slide 25

Slide 25 text

w Ξϊςʔγϣϯ Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺ ˠΞϊςʔγϣϯϥϕϧ͕ҟͳΔͱֶश͕͏·͍͔͘ͳ͍ Ξϊςʔλʔ"͞Μ Ξϊςʔλʔ#͞Μ थ໦

Slide 26

Slide 26 text

w Ξϊςʔγϣϯ Ұ؏ੑͷ͋ΔΞϊςʔγϣϯ Ξϊςʔγϣϯίετɿ ຕº෼ʹ ࣌ؒ ηϚϯςΟοΫηάϝϯςʔγϣϯͷ໰୊఺ $JUZTDBQFTͷ৔߹ ຕ͋ͨΓ෼ M. Cordts, “The Cityscapes Dataset for Semantic Urban Scene Understanding”, CVPR2016

Slide 27

Slide 27 text

w 41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPNTJNVMBUJPO<-FF *$-3`> Ξϊςʔγϣϯແ͠ͷ࣮ը૾ͱγϛϡϨʔλΛ༻ֶ͍ͨश ϥϕϧແ࣮͠ը૾ʴγϛϡϨʔγϣϯ $(Λೖྗ࣮ͯ͠ը૾ʹελΠϧม׵ γϛϡϨʔλΛ༻͍ͯ $(ը૾ɺΞϊςʔγϣϯɺσϓεΛੜ੒ ໨తλεΫ̍ɿ ηϚϯςΟοΫηάϝϯςʔγϣϯ ໨తλεΫɿ σϓεਪఆ ϥϕϧແ࣮͠ը૾܈ γϛϡϨʔλ

Slide 28

Slide 28 text

w 4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI%JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ<9-BJ $713> ڭࢣͳ͠σʔλ͔ΒϥϯμϜΫϩοϓͨ͠ೋͭͷύον͔Βରরֶश 1PTJUJWFQBJSɿڞ௨ྖҬ /FHBUJWFQBJSɿҟͳΔྖҬ ൒ڭࢣ͋Γֶश ϥϕϧແ͠ը૾ ϥϕϧ͋Γը૾ 1PTJUJWFQBJS /FHBUJWFQBJS 1PTJUJWFQBJS /FHBUJWFQBJS 1PTJUJWFQBJSɿ ͷಉ͡Ґஔʹ͋Δͭͷಛ௃Λ͚ۙͮΔ /FHBUJWFQBJSɿ ͷҟͳΔҐஔʹ͋Δͭͷಛ௃Λ཭͢ 𝜙 𝑜 1 ͱ 𝜙 𝑜 2 𝜙 𝑢 1 ͱ 𝜙 𝑢 2

Slide 29

Slide 29 text

w ෺ମೝࣝλεΫͷൃలʹͱ΋ͳ͍ηϚϯςΟοΫηάϝϯςʔγϣϯ΋ߴਫ਼౓Խ ηϚϯςΟοΫηάϝϯςʔγϣϯͷੑೳ 4FH/FU 141/FU %FFQ-BC7 )3/FU 0$3 $JUZTDBQFTWBMTFUϕϯνϚʔΫ %FFQ-BC7 %FFQ-BC %FFQ-BC7 4FH'PSNFS ౎ࢢ ϑϨʔϜ͔ΒͳΔσʔληοτ 5SBOT'PSNFSϕʔε .FBO*P6 $MBTT ൃද೥

Slide 30

Slide 30 text

w "UUFOUJPOػߏͷΈΛ༻͍ͨϞσϧ 3//΍$//ʹ୅Θͬͯจষੜ੒΍຋༁λεΫͰ4P5" w "UUFOUJPOػߏͷΈͰߏ੒ $//ͷΑ͏ʹฒྻܭࢉ͕Մೳ 3//ͷΑ͏ʹ௕ظґଘϞσϧΛߏஙՄೳ w 1PTJUJPOBM&ODPEJOHͷߏங ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ w 4FMG"UUFOUJPOϞσϧͰߏ੒ ೖྗग़ྗؒͷরԠؔ܎Λ௕ظతʹ֫ಘՄೳ 5SBOTGPSNFS<7BTXBOJ /FVS*14> &ODPEFS %FDPEFS

Slide 31

Slide 31 text

w ࣌ࠁຖʹҐஔ৘ใΛຒΊࠐΉ ܥྻσʔλΛਖ਼֬ͳॱྻʹอͭ 3//΍$//ͷ૬ରత͔ͭઈରతͳҐஔ৘ใΛՃࢉ͢ΔΠϝʔδ 1PTJUJPOBM&ODPEJOH 3// ࢲ 3// ͸ 3// Ϧϯΰ 3// ͕ 3// ޷͖ 3// Ͱ͢ U U U U U U 5SBOTGPSNFS͸U͕෼͔Βͳ͍ͷͰ໌ࣔతʹ༩͑Δඞཁ͕͋Δ PE(pos,2i) = sin(pos/10,0002i/dmodel) PE(pos,2i+1) = cos(pos/10,0002i/dmodel) 1PTJUJPOBM&ODPEJOHͷఆࣜԽ dmodel i pos 1&ͷ࣍ݩ਺ ܥྻσʔλͷҐஔ 1&ͷ࣍ݩͷ੒෼ ˠ1PTJUJPOBM&ODPEJOHͷ೾௕͸ ͔Β ·Ͱͷஈ֊తͳඇ࿈ଓͷ޿͕Γ ౳ൺ਺ྻ 2π 10,000 ⋅ 2π ܥྻσʔλͷҐஔ 1&ͷ࣍ݩ 1PTJUJPOBM&ODPEJOHͷՄࢹԽ

Slide 32

Slide 32 text

w 5SBOTGPSNFSͷ伴ͱͳΔ෦෼ .VMUJ)FBE"UUFOUJPOͷதͷϞδϡʔϧ &ODPEFS%FDPEFSͷ྆ํͰ࢖༻ 4FMG"UUFOUJPO 2VFSZ ,FZ 7BMVF Ͱߏ੒ 4DBMFE%PU1SPEVDU"UUFOUJPO Attention(Q, K, V) = softmax( QKT dk )V Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- ఆࣜԽ 2VFSZͷ࣍ݩ਺ dk ྫɿ2 ,ͷฏۉ͕ɼ෼ࢄ͕ͱԾఆ͢Δͱɼ͜ΕΒͷߦྻੵͷฏۉ஋͕ɼ෼ࢄ ͱͳΔ dk 4PGUNBYؔ਺ͷޯ഑Λܭࢉ࣌ʹɼҰ෦ͷ಺ੵ஋͕ඇৗʹେ͖͍ͱɼ ಺ੵ஋͕࠷େͷཁૉҎ֎ͷޯ഑͕ඇৗʹখ͘͞ͳΔ 2 ,ͷಛ௃ྔΛ ͰεέʔϦϯά͢Δ͜ͱͰɼฏۉɼ෼ࢄͱͳΓ׈Β͔ͳޯ഑ΛͱΔ dk Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention

Slide 33

Slide 33 text

4FMG"UUFOUJPOͷৄࡉ q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4 q5 k5 v5 x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each qi = Wq ei ki = Wk ei vi = Wv ei &NCFEEJOHͨ͠ಛ௃ྔ ͔Β2VFSZɼ,FZɼ7BMVFಛ௃ྔΛͦΕͧΕͷઢܗม׵ͰٻΊΔ ei 2VFSZ ,FZ 7BMVF

Slide 34

Slide 34 text

4FMG"UUFOUJPOͷৄࡉ q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 2VFSZͱ,FZಛ௃ྔͷ಺ੵΛͱΓɼTPGUNBYؔ਺Ͱܥྻؒͷؔ࿈౓ "UUFOUJPOXFJHIU Λऔಘ ̂ α = softmax( QKT dk ) Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

Slide 35

Slide 35 text

4FMG"UUFOUJPOͷৄࡉ q1 k1 v1 q2 k2 v2 q3 k3 v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output1 output2 output3 output4 output5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY x1 e1 x2 e2 x3 e3 x4 e4 x5 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. der and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Figure 1: The Transformer - model architecture. Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each "UUFOUJPOXFJHIUͱ7BMVFಛ௃ྔΛ৐ࢉ͠৘ใ෇༩͢Δ͜ͱͰɼ࣌ࠁؒͷಛ௃ྔͷؔ܎ੑΛٻΊΔ Attention(Q, K, V) = ̂ αV

Slide 36

Slide 36 text

5SBOTGPSNFS#MPDLͷશମॲཧ ϕΫτϧΛ࣌ؒํ޲ʹ 
 ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ୯ޠ਺Y࣍ݩ਺ Wfeed1 '$ޙͷಛ௃ϕΫτϧ Wfeed2 RVFSZ LFZ WBMVF ୯ޠ਺Y࣍ݩ਺ Wq Wk Wv ୯ޠ਺Y୯ޠ਺ ͱ7BMVFؒͷߦྻԋࢉ ॏΈΛ৐ࢉ ֤ߦʹ͸֤୯ޠͷଞͷ୯ޠͱ ͷॏཁ౓͕֨ೲ͞Ε͍ͯΔ RVFSZͱLFZؒͷߦྻԋࢉ Wout సஔ ྻํ޲ʹ TPGUNBY ୯ޠ਺Y࣍ݩ਺ α ̂ α ೖྗɿ୯ޠྻ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each

Slide 37

Slide 37 text

ਪ࿦࣌ͷσίʔμͷॲཧ ࢲ͸ݘ͕޷͖ͩɻ ೖྗɿ୯ޠྻ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ 'FFE'PSXBSE ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz &ODPEFS .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث l*z &04 .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث lMJLFz &04l*z .BTLFE .VMUJ)FBE "UUFOUJPO ਖ਼ ن Խ ਖ਼ ن Խ ಛ௃ϕΫτϧ GFFEGPSXBSE ਖ਼ ن Խ ෼ ྨ ث &/% &04l*zlMJLFzlEPHTzlz ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ ఺ઢ෦Λ/ճ܁Γฦ͢ %FDPEFS ʜ Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7 .VMUJ)FBE "UUFOUJPO 2 , 7

Slide 38

Slide 38 text

w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏ ը૾Λݻఆύονʹ෼ղ ෼ղͨ͠ύονΛ fl BUUFOʹ͠ɼ5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃ྔΛಘΔ *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ4P5" 7JTJPO5SBOTGPSNFS<%PTPWJUTLJZ *$-3> "%PTPWJUTLJZ l"/*."(&*48035)9803%453"/4'03.&34'03*."(&3&$0(/*5*0/"54$"-& z*$-3

Slide 39

Slide 39 text

w 7J5͸$//ͷSFDFQUJWF fi FMEͷΑ͏ͳಛ௃Λଊ͑Δ<> ύονʹ෼ղ͠5SBOTGPSNFSͰύονؒͷಛ௃Λֶश $//Ͱ͸ଊ͖͑Εͳ͔ͬͨը૾શମͷಛ௃Λଊ͑Δ $//ͱൺ΂ͯԿ͕ڧ͍͔ʁ $// 7J5 YͷྖҬ SFDFQUJWF fi FME ͷಛ௃Λଊ͑Δ ύονʹ෼ղ͠5SBOTGPSNFSͰը૾શମͷಛ௃Λଊ͑Δ ʜ 7J5 ˞ճ৞ΈࠐΜͩ৔߹ <>+$PSEPOOJFS l0/5)&3&-"5*0/4)*1#&58&&/4&-'"55&/5*0/"/%$0/70-65*0/"--":&34 z*$-3

Slide 40

Slide 40 text

w ը૾Λݻఆύονʹ෼ղͯ͠5SBOTGPSNFS&ODPEFSͰ֤ύονಛ௃Λநग़ ωοτϫʔΫ֓ཁ Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

Slide 41

Slide 41 text

w ෼ྨ໰୊༻ʹ৽͘͠$-45PLFOΛ௥Ճ ֶशՄೳͳύϥϝʔλ 5SBOTGPSNFS&ODPEFSͷ$-45PLFOͷग़ྗ͔ΒΫϥε෼ྨ $-45PLFO Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them,

Slide 42

Slide 42 text

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the arns em- em- the idal That ex- ield tire the the nte- ten- find west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D 7J5 ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४

Slide 43

Slide 43 text

w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the arns em- em- the idal That ex- ield tire the the nte- ten- find west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ilarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position $//ͷ௿࣍ݩ૚ͷϑΟϧλͷΑ͏ͳ΋ͷֶ͕श͞ΕΔ 7J5

Slide 44

Slide 44 text

7J5 w ೖྗը૾Λ෼ղͨ͠ύονྖҬΛ&NCFEEJOHͨ͠ޙɼҐஔ৘ใΛ෇༩ͯ͠7J5΁ &NCFEEJOH1PTJUJPO&NCFEEJOH asis functions for a low-dimensional the arns em- em- the idal That ex- ield tire the the nte- ten- find west y is &NCFEEJOH 'MBUUFO x1 p x2 p x3 p x4 p x1 p E x2 p E x3 p E x4 p E E ∈ ℝ(P2⋅C)×D UPLFO xN p ∈ ℝP2⋅C + + + + + 1PTJUJPO&NCFEEJOH Epos ∈ ℝ(N+1)×D ඍ෼ՄೳͳύϥϝʔλͰ ໌ࣔతʹҐஔ͸ڭ͑ͣʹࣗಈͰֶश͞ΕΔ ࢝఺ ϕΫτϧͰදݱ Ͱ͋Γ ը૾ಛ௃͕7J5Ͱू໿͞Ε࠷ऴ૚ͰΫϥε෼ྨ 5SBOTGPSNFSͷ&NCFEEJOHಛ௃ྔͱಉ͡ѻ͍ N ɿύον਺ ɿ࣍ݩ਺ D ɿνϟϯωϧ਺ ɿύονը૾αΠζ C P ೖྗը૾ΛYʹখ෼͚ ʜ ʜ Epos ϕʔεϞσϧج४ Preprint. Under review. Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim- ͋Δ1&ͱଞͷ1&ͱͷྨࣅ౓Λදݱͨ͠΋ͷ ෇ۙͷଞͷҐஔͱͷྨࣅ౓͕ߴ͍ ͔ΒԕํͷଞͷҐஔͱͷྨࣅ౓͕௿͍ ෇ۙͷ1&ಉ࢜͸ࣅͨ஋ʹͳΔΑ͏ʹֶश͞Ε͍ͯΔ

Slide 45

Slide 45 text

7JTJPO5SBOTGPSNFSͷ#MPDLͷશମॲཧ ಛ௃ϕΫτϧ DMTUPLFO ɾ7JTJPO5SBOTGPSNFS ɾ5SBOTGPSNFS ϕΫτϧΛ࣌ؒํ޲ʹ 
 ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE ࢲ͸ݘ͕޷͖ͩɻ ಛ௃ϕΫτϧ lࢲz l͸z lݘ l͕z l޷͖z lͩz lɻz ೖྗɿ୯ޠྻ ϕΫτϧΛۭؒํ޲ʹ 
 ࠞͥͯม׵ ਖ਼ ن Խ ϕΫτϧΛEFQUIํ޲ʹ ݸผʹม׵ ਖ਼ ن Խ ఺ઢ෦Λ/ճ܁Γฦ͢ TFMGBUUFOUJPO GFFEGPSXBSE DMTUPLFO DMTUPLFOͷΈ༻͍ͯ 
 Ϋϥε෼ྨΛߦ͏

Slide 46

Slide 46 text

w ը૾෼ྨʹ͓͍ͯ4P5"ͱͷൺֱ +'5.Ͱࣄલֶश֤ͯ͠σʔληοτͰసҠֶश 7JTJPO5SBOTGPSNFSͷੑೳ Published as a conference paper at ICLR 2021 Ours-JFT Ours-JFT Ours-I21k BiT-L Noisy Student (ViT-H/14) (ViT-L/16) (ViT-L/16) (ResNet152x4) (EfficientNet-L2) ImageNet 88.55± 0.04 87.76± 0.03 85.30± 0.02 87.54± 0.02 88.4/88.5⇤ ImageNet ReaL 90.72± 0.05 90.54± 0.03 88.62± 0.05 90.54 90.55 CIFAR-10 99.50± 0.06 99.42± 0.03 99.15± 0.03 99.37± 0.06 CIFAR-100 94.55± 0.04 93.90± 0.05 93.25± 0.05 93.51± 0.08 Oxford-IIIT Pets 97.56± 0.03 97.32± 0.11 94.67± 0.15 96.62± 0.23 Oxford Flowers-102 99.68± 0.02 99.74± 0.00 99.61± 0.02 99.63± 0.03 VTAB (19 tasks) 77.63± 0.23 76.28± 0.46 72.72± 0.21 76.29± 1.70 TPUv3-core-days 2.5k 0.68k 0.23k 9.9k 12.3k Table 2: Comparison with state of the art on popular image classification benchmarks. We re- port mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. ⇤Slightly improved 88.5% result reported in Touvron et al. (2020). ԯ ԯ ԯ ԯ ύϥϝʔλ਺ ˞516WDPSFEBZTɿֶशʹ࢖༻ͨ͠516WίΞ਺ νοϓ͋ͨΓݸ ʹֶश࣌ؒ ೔਺ Λ͔͚ͨ΋ͷ ˞ ˠશͯͷσʔληοτͰ4P5"Λୡ੒

Slide 47

Slide 47 text

w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ ("/ʹΑΔελΠϧม׵Ͱը૾ ධՁର৅ɿ7J5 $// ਓؒ ධՁࢦඪ ܗঢ়ͷׂ߹ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผਖ਼͍͠ςΫενϟΫϥεͱࣝผ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW ("/ʹΑΔ ελΠϧม׵ ೣ ৅ ೣˠܗঢ়Λଊ͍͑ͯΔ ৅ˠςΫενϟΛଊ͍͑ͯΔ $// PS 7J5 ෼ྨ݁Ռ

Slide 48

Slide 48 text

w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ $//͸ςΫενϟΛॏࢹ 7J5͸෺ମͷܗঢ়Λॏࢹ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ <>3(FJSIPT l*."(&/&553"*/&%$//4"3&#*"4&%508"3%45&9563&*/$3&"4*/(4)"1&#*"4*.1307&4"$$63"$:"/%30#645/&44 z*$-3 blished as a conference paper at ICLR 2019 AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet AlexNet 100 GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet GoogLeNet 100 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 VGG−16 100 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 ResNet−50 100 Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans Humans 99 97 99 100100 98 44 49 48 54 75 40 28 24 18 87 100100100100 90 original greyscale silhouette edges texture Figure 2: Accuracies and example stimuli for five different experiments without cue conflict. anging biases, and discovering emergent benefits of changed biases. We show that the texture bias standard CNNs can be overcome and changed towards a shape bias if trained on a suitable data . Remarkably, networks with a higher shape bias are inherently more robust to many different age distortions (for some even reaching or surpassing human performance, despite never being <>ΑΓҾ༻ <>45VMJ l"SF$POWPMVUJPOBM/FVSBM/FUXPSLTPS5SBOTGPSNFSTNPSFMJLFIVNBOWJTJPO zBS9JW Fig. 4: Error consistency results on SIN dataset. distribution (i.e., p 2 D240 corresponding to the off-diagonal entries of the 16 ⇥ 16 confusion matrix) by taking the error counts to be the off-diagonal elements of the confusion ma- trix: ei j = CMi, j, 8 j 6= i In this context, the inter-class JS distance compares what classes were misclassified as what. An interesting finding is that, instead of a strong correla- tion shown by class-wise JS in Figure 3(a), Figure 3(b) sug- gests that there is no correlation of inter-class JS distance with Cohen’s k implying that this metric gives insight beyond Co- hen’s k in measuring error-consistency with humans. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) Fig. 5: Shape bias for different networks for the SIN dataset (Geirhos et al., 2019). Vertical lines indicate averages. range of models on the Stylized ImageNet (SIN) dataset (Fig- <>ΑΓҾ༻ $//ͱਓؒͷൺֱ 7J5 ˝˝ $// ˔˙˛˔ ਓؒ ♦︎ ͷൺֱ ςΫενϟ ܗঢ়

Slide 49

Slide 49 text

w 5SBOTGPSNFSΛηάϝϯςʔγϣϯλεΫ΁Ԡ༻ .JY5SBOTGPSNFS w ϚϧνϨϕϧͳಛ௃Λ֫ಘՄೳͳ֊૚ܕ5SBOTGPSNFS w ܭࢉίετΛ࡟ݮ͢Δߏ଄ ܰྔ͔ͭγϯϓϧͳ.-1σίʔμΛ࠾༻ w 5SBOTGPSNFS͸ہॴత͔ͭ޿Ҭతͳಛ௃ΛऔಘՄೳ w .-1͸ہॴతͳಛ௃Λิ׬Ͱ͖ɼڧྗͳදݱΛ֫ಘՄೳ 7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW> &9JF l4FH'PSNFS4JNQMFBOE& ff i DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. i the MiT encoder go through an MLP layer to unify the channel dimension. Then, in a second step, features are up-sampled to 1/4th and concatenated together. Third, a MLP layer is adopted to fuse the concatenated features F. Finally, another MLP layer takes the fused feature to predict the segmentation mask M with a H 4 ⇥ W 4 ⇥ N cls resolution, where N cls is the number of categories. This lets us formulate the decoder as: ˆ F i = Linear(C i , C)(F i ), 8i ˆ F i = Upsample( W 4 ⇥ W 4 )( ˆ F i ), 8i F = Linear(4C, C)(Concat( ˆ F i )), 8i M = Linear(C, N cls )(F), (4) where M refers to the predicted mask, and Linear(C in , C out )(·) refers to a linear layer with C in and C out as input and output vector dimensions respectively. DeepLabv3+ SegFormer Stage-1 Stage-2 Stage-3 Head Stage-4 Figure 3: Effective Receptive Field (ERF) on Cityscapes (aver- age over 100 images). Top row: Deeplabv3+. Bottom row: Seg- Former. ERFs of the four stages and the decoder heads of both Effective Receptive Field Analysis. For semantic segmentation, maintain- ing large receptive field to include con- text information has been a central is- sue [5, 19, 20]. Here, we use effec- tive receptive field (ERF) [70] as a toolkit to visualize and interpret why our MLP decoder design is so effec- tive on Transformers. In Figure 3, we visualize ERFs of the four encoder stages and the decoder heads for both 3FE#PYɿ4UBHFͷ4FMG"UUFOUJPOͷSFDFQUJWF fi FMEͷେ͖͞ #MVF#PYɿ.-1-BZFSͷ.-1ͷSFDFQUJWF fi FMEͷେ͖͞ $JUZTDBQFTʹ͓͚ΔSFDFQUJWF fi FMEͷޮՌΛ෼ੳ ˠ$//ϕʔε͸ہॴతͳಛ௃ͷΈଊ͑Δ͕ɼ5SBOTGPSNFSϕʔε͸ہॴతɾ޿Ҭతͳಛ௃Λଊ͑Δ ˠ.-1ͷSFDFQUJWF fi FME͸ہॴతྖҬͷີ౓͕ߴ͍খ͍͞෺ମͷਖ਼֬ͳηάϝϯςʔγϣϯ͕ظ଴

Slide 50

Slide 50 text

&9JF l4FH'PSNFS4JNQMFBOE& ff i DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST zBS9JW 7J5ϕʔεͷηάϝϯςʔγϣϯɿ4FH'PSNFS<9JF BS9JW> C Flops # Params # mIoU " 256 25.7 24.7 44.9 512 39.8 25.8 45.0 768 62.4 27.5 45.4 1024 93.6 29.6 45.2 2048 304.4 43.4 45.6 Inf Res Enc Type mIoU " 768⇥768 PE 77.3 1024⇥2048 PE 74.0 768⇥768 Mix-FFN 80.5 1024⇥2048 Mix-FFN 79.8 Encoder Flops # Params # mIoU " ResNet50 (S1-4) 69.2 29.0 34.7 ResNet101 (S1-4) 88.7 47.9 38.7 ResNeXt101 (S1-4) 127.5 86.8 39.8 MiT-B2 (S4) 22.3 24.7 43.1 MiT-B2 (S1-4) 62.4 27.7 45.4 MiT-B3 (S1-4) 79.0 47.3 48.6 Table 2: Comparison to state of the art methods on ADE20K and Cityscapes. SegFormer has significant advantages on #Params, #Flops, #Speed and #Accuracy. Note that for SegFormer-B0 we scale the short side of image to {1024, 768, 640, 512} to get speed-accuracy tradeoffs. Method Encoder Params # ADE20K Cityscapes Flops # FPS " mIoU " Flops # FPS " mIoU " Real-Time FCN [1] MobileNetV2 9.8 39.6 64.4 19.7 317.1 14.2 61.5 ICNet [11] - - - - - - 30.3 67.7 PSPNet [17] MobileNetV2 13.7 52.9 57.7 29.6 423.4 11.2 70.2 DeepLabV3+ [20] MobileNetV2 15.4 69.4 43.1 34.0 555.4 8.4 75.2 SegFormer (Ours) MiT-B0 3.8 8.4 50.5 37.4 125.5 15.2 76.2 - - - 51.7 26.3 75.3 - - - 31.5 37.1 73.7 - - - 17.7 47.6 71.9 Non Real-Time FCN [1] ResNet-101 68.6 275.7 14.8 41.4 2203.3 1.2 76.6 EncNet [24] ResNet-101 55.1 218.8 14.9 44.7 1748.0 1.3 76.9 PSPNet [17] ResNet-101 68.1 256.4 15.3 44.4 2048.9 1.2 78.5 CCNet [41] ResNet-101 68.9 278.4 14.1 45.2 2224.8 1.0 80.2 DeeplabV3+ [20] ResNet-101 62.7 255.1 14.1 44.1 2032.3 1.2 80.9 OCRNet [23] HRNet-W48 70.5 164.8 17.0 45.6 1296.8 4.2 81.1 GSCNN [35] WideResNet38 - - - - - - 80.8 Axial-DeepLab [74] AxialResNet-XL - - - - 2446.8 - 81.1 Dynamic Routing [75] Dynamic-L33-PSP - - - - 270.0 - 80.7 Auto-Deeplab [50] NAS-F48-ASPP - - - 44.0 695.0 - 80.3 SETR [7] ViT-Large 318.3 - 5.4 50.2 - 0.5 82.2 SegFormer (Ours) MiT-B4 64.1 95.7 15.4 51.1 1240.6 3.0 83.8 SegFormer (Ours) MiT-B5 84.7 183.3 9.8 51.8 1447.6 2.5 84.0 however, it leads to larger and less efficient models. Interestingly, this performance plateaus for channel dimensions wider than 768. Given these results, we choose C = 256 for our real-time SegFormer SETR DeepLabv3+ Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predic tially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduce highlighted in red. Best viewed in screen. 4.4 Robustness to natural corruptions Model robustness is important for many safety-critical tasks such as autonomous experiment, we evaluate the robustness of SegFormer to common corruptions a this end, we follow [77] and generate Cityscapes-C, which expands the Cityscape 16 types of algorithmically generated corruptions from noise, blur, weather and d SegFormer SETR SegFormer DeepLabv3+ Figure 4: Qualitative results on Cityscapes. Compared to SETR, our SegFormer predicts masks with substan- tially finer details near object boundaries. Compared to DeeplabV3+, SegFormer reduces long-range errors as highlighted in red. Best viewed in screen.

Slide 51

Slide 51 text

4FH'PSNFS͕΋ͨΒ͢Մೳੑ IUUQTXXXZPVUVCFDPNXBUDI W+.P32[;F6 ϊΠζͷӨڹΛड͚ੑೳ͕ྼԽ ϊΠζʹର͠ϩόετ ˠ5SBOTGPSNFS͸෺ମͷܗঢ়Λֶश͢ΔͨΊɼϊΠζͷӨڹΛड͚ʹ͍͘

Slide 52

Slide 52 text

w 4FH'PSNFSΛϚϧνϔουԽͯ͠ෳ਺ͷσʔληοτʹରԠ %FDPEFSΛσʔληοτ͝ͱʹ༻ҙ 5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ 4FH'PSNFSͷϚϧνϔουԽ Transformer Block 1 Transformer Block 2 MLP Layer MLP Layer MLP Layer Efficient self-attention Mix FFN Overlap Patch Merging DA module ⨁ ⨁ ×N Transformer Block 4 Transformer Block 3 Encoder Decoder Concat MLP Concat MLP Concat MLP υϝΠϯ# υϝΠϯ" υϝΠϯ$ &ODPEFS .VMUJIFBE%FDPEFS

Slide 53

Slide 53 text

w ϚϧνυϝΠϯԽͨ͠4FH'PSNFSͱ$//ख๏ͱͷൺֱ 5SBOTGPSNFS#MPDLʹ%"NPEVMFΛ௥Ճ 4FH'PSNFSͷϚϧνϔουԽ 5SBJOWBM $JUZTDBQFT #%% 4ZOTDBQFT ఏҊख๏ $JUZTDBQFT #%% 4ZOTDBQFT 5SBJOWBM $JUZTDBQF T #%% 4ZOTDBQFT ఏҊख๏ $JUZTDBQFT #%% 4ZOTDBQFT %FFQ-BCW 4FH'PSNFS ˠϚϧνυϝΠϯʹ͓͍ͯ΋4FH'PSNFS͸ޮՌେ

Slide 54

Slide 54 text

w $//ϕʔεͷ4UBUFPGUIFBSU %FFQ-BC7 )3/FU 0$/ w ϚϧνυϝΠϯ΁ͷରԠ %"ϞδϡʔϧΛಋೖͯ͠ϚϧνυϝΠϯֶश w 4FH'PSNFSͷޮՌ 7JTJPO5SBOTGPSNFSͷʹΑΔରϊΠζੑͷ޲্ ·ͱΊɿਂ૚ֶशʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ ResNet101 + DA module ASPP 1×1conv. concat Cityscapes !" !# !$ マルチヘッド構造 3×3conv. 3×3conv. 1×1conv. Head 1 %" %# %$ Head 2 Head & 3×3conv. 3×3conv. 1×1conv. 3×3conv. 3×3conv. 1×1conv. Cityscapes 共有ネットワーク A2D2 Mapillary A2D2 Mapillary Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. in an end-to-end manner. After that, researchers focused on improving FCN from different aspects such as: enlarging the receptive field [17–19, 5, 2, 4, 20]; refining the contextual information [21– 29]; introducing boundary information [30–37]; designing various attention modules [38–46]; or using AutoML technologies [47–51]. These methods significantly improve semantic segmentation performance at the expense of introducing many empirical modules, making the resulting framework computationally demanding and complicated. More recent methods have proved the effectiveness of Transformer-based architectures for semantic segmentation [7, 46]. However, these methods are still computationally demanding. Transformer backbones. ViT [6] is the first work to prove that a pure Transformer can achieve state-of-the-art performance in image classification. ViT treats each image as a sequence of tokens and ϚϧνυϝΠϯରԠ 4FH'PSNFSͷޮՌ

Slide 55

Slide 55 text

ࢀߟจݙ <4IPUUPO $713`>+BNJF4IPUUPO .BUUIFX+PIOTPO 3PCFSUP$JQPMMB l4FNBOUJDUFYUPOGPSFTUTGPSJNBHFDBUFHPSJ[BUJPOBOETFHNFOUBUJPO z $713 <#BESJOBSBZBOBO 1".*`>7JKBZ#BESJOBSBZBOBO "MFY,FOEBMM 3PCFSUP$JQPMMB l4FHOFU"EFFQDPOWPMVUJPOBMFODPEFSEFDPEFSBSDIJUFDUVSF GPSJNBHFTFHNFOUBUJPOz1".* <-POH $713`>+POBUIBO-POH &WBO4IFMIBNFS BOE5SFWPS%BSSFMM l'VMMZDPOWPMVUJPOBMOFUXPSLTGPSTFNBOUJDTFHNFOUBUJPO z$713 <3POOFCFSHFS.*$$"*`>0MBG3POOFCFSHFS 1IJMJQQ'JTDIFS 5IPNBT#SPY l6OFU$POWPMVUJPOBMOFUXPSLTGPSCJPNFEJDBMJNBHF TFHNFOUBUJPO z.*$$"* <;IBP$713`>)FOHTIVBOH;IBP +JBOQJOH4IJ 9JBPKVBO2J 9JBPHBOH8BOH +JBZB+JB l1ZSBNJETDFOFQBSTJOHOFUXPSL z$713 <$IFO&$$7`>-JBOH$IJFI$IFO :VLVO;IV (FPSHF1BQBOESFPV 'MPSJBO4DISP ff )BSUXJH"EBNl&ODPEFSEFDPEFSXJUIBUSPVTTFQBSBCMF DPOWPMVUJPOGPSTFNBOUJDJNBHFTFHNFOUBUJPO z&$$7 <8BOH1".*`>+JOHEPOH8BOH ,F4VO 5JBOIFOH$IFOH #PSVJ+JBOH $IBPSVJ%FOH :BOH;IBP %POH-JV :BEPOH.V .JOHLVJ5BO 9JOHHBOH8BOH 8FOZV-JV #JO9JBPl%FFQIJHISFTPMVUJPOSFQSFTFOUBUJPOMFBSOJOHGPSWJTVBMSFDPHOJUJPO z1".* <:VBO&$$7`>:VIVJ:VBO 9JMJO$IFO +JOHEPOH8BOHl0CKFDUDPOUFYUVBMSFQSFTFOUBUJPOTGPSTFNBOUJDTFHNFOUBUJPO z&$$7 <$PSEUT$713`>.BSJVT$PSEUT .PIBNFE0NSBO 4FCBTUJBO3BNPT 5JNP3FIGFME .BSLVT&O[XFJMFS 3PESJHP#FOFOTPO 6XF'SBOLF 4UFGBO3PUI #FSOU4DIJFMFl5IFDJUZTDBQFTEBUBTFUGPSTFNBOUJDVSCBOTDFOFVOEFSTUBOEJOHz$713 <-FF*$-3`>,VBO)VJ-FF (FSNBO3PT +JF-J "ESJFO(BJEPO l41*("/1SJWJMFHFE"EWFSTBSJBM-FBSOJOHGSPN4JNVMBUJPO z*$-3 <-BJ$713`>9JO-BJ ;IVPUBP5JBO -J+JBOH 4IV-JV )FOHTIVBOH;IBP -JXFJ8BOH +JBZB+JB l4FNJTVQFSWJTFE4FNBOUJD4FHNFOUBUJPOXJUI %JSFDUJPOBM$POUFYUBXBSF$POTJTUFODZ z$713

Slide 56

Slide 56 text

ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ த෦େֶϩΰ த෦େֶϩΰ ڭत ౻٢߂࿱ Hironobu Fujiyoshi E-mail: [email protected] 1997೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ, 1997೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴPostdoctoral Fellow, 2000೥ த෦େֶ޻ֶ෦৘ใ޻ֶՊߨࢣ, 2004೥ த෦େֶ।ڭत, 2005೥ ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһ(ʙ2006೥), 2010೥ த෦େֶڭत, 2014೥໊ݹ԰େֶ٬һڭत.
 ܭࢉػࢹ֮ɼಈը૾ॲཧɼύλʔϯೝࣝɾཧղͷݚڀʹैࣄɽ
 ϩϘΧοϓݚڀ৆(2005೥)ɼ৘ใॲཧֶձ࿦จࢽCVIM༏ल࿦จ৆(2009೥)ɼ৘ใॲཧֶձࢁԼه೦ݚڀ৆(2009೥)ɼը૾ηϯγϯάγϯϙδ΢Ϝ༏लֶज़৆(2010, 2013, 2014೥) ɼ ిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ଞ ڭत ࢁԼོٛ Takayoshi Yamashita E-mail:[email protected] 2002೥ ಸྑઌ୺Պֶٕज़େֶӃେֶത࢜લظ՝ఔमྃ, 2002೥ ΦϜϩϯגࣜձࣾೖࣾ, 2009೥ த෦େֶେֶӃത࢜ޙظ՝ఔमྃ(ࣾձਓυΫλʔ), 2014೥ த෦େֶߨࢣɼ 2017೥ த෦େֶ।ڭतɼ2021೥ த෦େֶڭतɽ
 ਓͷཧղʹ޲͚ͨಈը૾ॲཧɼύλʔϯೝࣝɾػցֶशͷݚڀʹैࣄɽ
 ը૾ηϯγϯάγϯϙδ΢Ϝߴ໦৆(2009೥)ɼిࢠ৘ใ௨৴ֶձ ৘ใɾγεςϜιαΠΤςΟ࿦จ৆(2013೥)ɼిࢠ৘ใ௨৴ֶձPRMUݚڀձݚڀ঑ྭ৆(2013೥)ड৆ɽ ߨࢣ ฏ઒ཌྷ Tsubasa Hirakawa E-mail:[email protected] 2013೥ ޿ౡେֶେֶӃത࢜՝ఔલظऴྃɼ2014೥ ޿ౡେֶେֶӃത࢜՝ఔޙظೖֶɼ2017೥ த෦େֶݚڀһ (ʙ2019೥)ɼ2017೥ ޿ౡେֶେֶӃത࢜ޙظ՝ఔमྃɽ2019 ೥ த෦େֶಛ೚ॿڭɼ2021೥ த෦େֶߨࢣɽ2014೥ ಠཱߦ੓๏ਓ೔ຊֶज़ৼڵձಛผݚڀһDC1ɽ2014೥ ESIEE Paris٬һݚڀһ (ʙ2015೥)ɽ ίϯϐϡʔλϏδϣϯɼύλʔϯೝࣝɼҩ༻ը૾ॲཧͷݚڀʹैࣄ