Computer Visionの近年の動向のサーベイ
Computer visionͷۙͷಈͷαʔϕΠߴࢤ1
View Slide
αʔϕΠͷత2Computer vision (CV) ݚڀͷۙͷಈΛΓ͍ͨʂ• ֶशख๏ΛΓ͍ͨ• ωοτϫʔΫͷมભΛΓ͍ͨˠ χϡʔϥϧҎ߱ͷ$7ͷมભ͜Ε·ͰͷಈΛ͘ઙ͘հ
ࠓճ͞ͳ͍͜ͱ3• ը૾/ಈըੜҰൠ• ఢରతֶश• ڭࢣ͋Γֶश• ࣗݾڭࢣ͋Γֶश• ݹయతͳίϯϐϡʔλʔϏδϣϯͳͲͳͲɽɽ
ࠓͷྲྀΕ4̍ɽλεΫඇಛԽϞσϧʢը૾ೝࣝͷϞσϧʣͷಈ̎ɽ֤λεΫʹಛԽͨ͠Ϟσϧͷಈ̏ɽ·ͱΊ
ͦͷલʹ5ɾਆࢿྉ܈ɾͪ͜ΒͷࢿྉΛେ͍ʹࢀߟʹ͠·ͨ͠http://xpaperchallenge.org/cv/https://github.com/hirokatsukataoka16/cvpaper.challenge-summary
̍ɽλεΫඇಛԽϞσϧͷಈ6
ΞʔΩςΫνϟɾֶश๏ʢը૾ೝࣝʣ7
࣌ܥྻ8
AlexNet [Krizhevsky+ NeurIPS 2012]9• ը૾ೝࣝίϯϖͰ͋ΔILSVRC2012Ͱѹউ• ਂΈࠐΈχϡʔϥϧωοτϫʔΫ(CNN)ͷ࣌ͷນ։͚
࣌ܥྻ10
ResNet [He+ CVPR 2016]11• ILSVRC2015༏উϞσϧ• Skip connectionͷಋೖͰ152ͷਂCNNͷֶश͕Մೳʹ• Ҏ߱ͷը૾ೝࣝͷϞσϧجຊతʹResNetͷվྑ
࣌ܥྻ12
ResNext [Xie+ CVPR 2017]13• ೖྗΛذͤͯ͞ෳͷωοτϫʔΫͰॲཧ͠ɼͦͷ݁ՌΛ͠߹ΘͤΔ
WideResNet [Zagoruyko+ 2017]14• ਂ͞Λઙͯ͘͠෯Λͨ͘͠ResNet
࣌ܥྻ15
PyramidNet [Han+ CVPR 2017]16• DownsamplingΛ༻͍Δࡍͷٸܹͳ෯૿ՃʹΑΔਫ਼ྼԽΛ͙ͨΊɼશମͰগͣͭ͠ͷ෯Λେ͖͘͢Δ
SENet [Hu+ CVPR 2018]17• ͷೖྗΛѹॖͨ͠ͷΛχϡʔϥϧωοτͰม͠ɼ͜ΕΛ༻͍ͯೖྗΛॏΈ͚Δ
DenseNet [Huang+ CVPR 2017 (best paper)]18• ֤ͦͷલͷͯ͢ͷͱskip connectionͰͭͳ͕Δ
MobileNet v1-3 [Howard+ 2017, Sandler+ 2018, Howard+ 2019]19• ۭؒํͷΈͷΈࠐΉdepthwise convolutionͱνϟωϧํͷΈΈࠐΉpointwise convolutionͰΈࠐΈͷܰྔԽ
PNASNet [Liu+ 2017]20• Neural architecture search (NAS)ͷ݁ՌಘΒΕͨϞσϧ• CNNશମͰͳ͘ෳͷCNNϒϩοΫ͔ΒͳΔʮηϧʯΛ୳ࡧ• ୯७ͳͷ͔Βঃʑʹෳࡶͳͷͱ୳ࡧΛߦ͏
࣌ܥྻ21
EfficientNet [Tan&Le ICML 2019]22• ͜Ε·Ͱͷ༷ʑͳϞσϧͷεέʔϧΞοϓख๏ͷશ෦ͷͤ
Noisy Student Training [Xie+ CVPR 2020]23• ֶशࡁΈੜెΛڭࢣͱͯ͠ɼॱ࣍େ͖ͳੜెΛֶश͢Δࣗݾڭࢣ͋Γֶश• ੜెʹϊΠζΛՃ͢Δ͜ͱͰਫ਼ʹՃ͑ͯؤ݈ੑ্
BiT [Xie+ Kolesnikov 2019]24• 10ԯύϥϝʔλͷେنϞσϧͰࣄલֶश• సҠઌͷσʔλ͕গͳͯ͘͏·͍͘͘
࣌ܥྻ25
Vision Transformer (ViT) [Dosovitskiy+ ICLR 2021]26• TransformerͰը૾ೝࣝͷSOTA
̎ɽ֤λεΫʹಛԽͨ͠Ϟσϧͷಈ27
ମݕग़28
Ұൠମݕग़29[https://pjreddie.com/media/files/papers/YOLOv3.pdf]• ը૾தͷମͷΫϥεͱҐஔΛͯΔ
࣌ܥྻ30[Zou+ 2020 Object Detection in 20 Years: A Survey]
R-CNN [Girshick+ CVPR 2014]31• ΦϒδΣΫτ͕ଘࡏ͢ΔީิྖҬΛΓग़͠CNNͰಛநग़
Fast R-CNN [Girshick ICCV 2015]32• ·ͣը૾ͷಛϚοϓΛ࡞͠ɼީิྖҬ (ROI) ΛಛϚοϓ্ʹࣹӨ• ΦϒδΣΫτͷྨͱόϯσΟϯάϘοΫεͷճؼNNͰߦ͏• ֤ީิྖҬ͝ͱͰͳ֤͘ը૾͝ͱʹΈࠐΊΑ͘ͳΓɼߴԽ
Faster R-CNN [Ren+ NeurIPS 2015]33• ީิྖҬ (ROI) ͷఏҊ·ͰؚΊͯend-to-endʹֶश
YOLO v1-4[Redmon+ CVPR 2016, CVPR 2017, 2018, Bochkovskiy+ 2020]34• ମݕग़ͱମࣝผΛҰؾ௨؏ʹߦ͏one-stageͷख๏• Ϋϥε֬ɼ֬৴ɼόϯσΟϯάϘοΫεͷใΛग़ྗ
SSD [Liu+ ECCV 2016]35• YOLOಉ༷one-stageͷख๏• ༧Ίෳ༻ҙͨ͠ͷόϯσΟϯάϘοΫεຖʹਪ• ֤ͷಛϚοϓ͔Βಛநग़͢Δ͜ͱͰ༷ʑͳεέʔϧͰମݕग़
RetinaNet [Lin+ ICCV 2017]36• ForegroundͱbackgroundͷΫϥεෆۉߧ͕one-stage๏͕ੑೳͰtwo-stage๏ʹྼΔཧ༝Ͱ͋Δ͜ͱΛࢦఠ• ΫϥεෆۉߧʹରԠ͢ΔͷͨΊͷFocal LossͷఏҊʹΑΓɼ1-stageͳ͕Βߴ͍ਫ਼ͷମೝࣝΛ࣮ݱ• ϕʔεͷΞʔΩςΫνϟʔʹޙड़ͷFeature Pyramid NetworkΛ༻
FCOS [Tian+ ICCV 2019]37• RetinaNetͷվྑ൛• ମͷத৺ͷਪఆΛՃͰߦ͍ɼΞϯΧʔϑϦʔͳମݕग़Λ࣮ݱ
Bridging the Gap Between Anchor-based andAnchor-free Detection [Zhang+ 2019]38• Anchor-basedͱancho-freeͷҧ͍ɼෛྫͱਖ਼ྫͷબͷҧ͍
ηάϝϯςʔγϣϯ39
ηάϝϯςʔγϣϯ40[https://arxiv.org/pdf/1706.05587.pdf]• ֤ϐΫηϧຖʹମͷΫϥε/എܠͷࣝผΛ͢Δ
࣌ܥྻ41[Minaee+ 2020 Image Segmentation Using Deep Learning: A Survey]
FCN [Long+ CVPR 2015]42• CNNͷग़ྗΈࠐΈʹ͢Δ͜ͱͰɼώʔτϚοϓΛग़ྗ
SegNet [Badrinarayanan+ 2015]43• શͯΈࠐΈͷΤϯίʔμͱσίʔμ͔ΒͳΔωοτϫʔΫ• σίʔμΛ༻͍Δ͜ͱͰDeconvolutionஈ֊తʹߦ͑Δ
U-Net [Ronneberger+ MICCAI 2015]44• ΤϯίʔμͷಛදݱΛskip connectionͰσίʔμʹίϐʔͯ͢͠
DeepLab v1-3 [Chen+ TPAMI 2017]45• Down samplingΛͳ͘͠ɼdilated convolutionͱઢܗิؒΛΈ߹ΘͤΔ͜ͱͰߴղ૾ͳηάϝϯςʔγϣϯΛ࣮ݱ[Cui+ Remote Sens.2019]
FastFCN [Wu+ 2019]46• Joint Pyramid Upsampling (JPU) ͷಋೖͰdilated convolutionʹൺͯܭࢉίετΛେ෯ʹݮ
Mask R-CNN [He+ ICCV 2017]47• Bounding boxͷ༧ଌʹՃ͑ͯΫϥεͷϚεΫ༧ଌ͢ΔFaster R-CNN• RoIPoolʹΘΔRoIAlignͷಋೖͰྖҬׂͳͲՄೳʹ
PSPNet [Zhao+ CVPR 2017]48• ༷ʑͳεέʔϧͷϓʔϦϯάʹΑΓϚϧνεέʔϧͳಛදݱΛ֫ಘ
FPN [Lin+ CVPR 2017]49• CNNͷ֊ੑΛར༻֤͠֊Ͱ༧ଌͯ͠ϚϧνεέʔϧͳಛΛ֫ಘ• ग़ྗʹ͍ۙಛΛೖྗʹ͍ۙଆʹ͑Δ͜ͱͰɼઙ͍Ͱ༗ҙຯͳಛநग़͕Մೳ
Visual Question Answering50
Visual Question Answering51[https://arxiv.org/pdf/1505.00468.pdf]• ը૾ʹର͢Δ࣭จͷԠ
࣌ܥྻ52[Srivastava+ 2020 Visual Question Answering using Deep Learning: A Survey and Performance Analysis]
σʔληοτ53[Srivastava+ 2020 Visual Question Answering using Deep Learning: A Survey and Performance Analysis]
VQA [Agrawal+ ICCV 2015]54• LSTMͰ࣭จΛɼCNNͰը૾ΛຒΊࠐΜͰಛදݱΛ࡞
Stacked Attention Networks [Yang+ CVPR 2016]55• CNNಛྔʹଟஈ֊ͷattentionΛ͔͚ͯஈ֊తʹରΛߜΓࠐΉ
Embodied Question Answering [Das+ CVPR 2018]56• ࣭͕༩͑ΒΕΔͱɼΤʔδΣϯτγϛϡϨʔγϣϯۭؒͰߦಈΛͱͬͯ͑Λݟ͚ͭΔ
CLEVR [Johnson+ CVPR 2017]57• VQAͷͨΊͷσʔληοτ• ཧతͳਪ͕ඞཁͱ͞ΕΔ
ಈըೝࣝ58
࣌ܥྻ59[Zhu+ 2020 A Comprehensive Study of Deep Video Action Recognition]
σʔληοτ60[Zhu+ 2020 A Comprehensive Study of Deep Video Action Recognition]
ྨ61[Zhu+ 2020 A Comprehensive Study of Deep Video Action Recognition]
3D CNN (C3D) [Tran+ ICCV 2015]62• 3࣍ݩΈࠐΈΛ༻͍Δ͜ͱͰ࣌ؒํͷಛදݱ
(2+1)D CNN [Tran+ CVPR 2018]63• ҰͭͷͰҰؾʹ࣌ؒํ·ͰΈࠐΉͷͰͳ͘ɼ·ۭͣؒํʹΈࠐΜͩ͋ͱͰ࣌ؒํʹΈࠐΉ
I3D [Carreira&Zisserman CVPR 2017]64• 3D ConvΛੵΈॏͶͨωοτϫʔΫ
Non-local [Wang+ CVPR 2018]65• AttentionʹΑΔॏΈ͚ͰɼେҬతͳใΛՃຯ• ͋ΔҐஔͷΛͦͷଞͷͯ͢ͷҐஔͷಛͷॏΈ͖Ͱදݱ
SlowFast Networks [Feichtenhofer+ ICCV 2019]66• ϑϨʔϜϨʔτͰۭؒಛΛɼߴϑϨʔϜϨʔτͰ࣌ؒಛΛଊ͑Δ
࢟ਪఆ67
ྨ68[Chen+ 2020 Monocular Human Pose Estimation: A Survey of Deep Learning-based Methods][Zheng+ 2020 Deep Learning-Based Human Pose Estimation: A Survey]
Convolutional Pose Machines [Wei+ CVPR 2016]69• ଟஈ֊ͷ༧ଌʹΑΓɼ֤ମ෦Ґͷਪఆਫ਼ΛߴΊΔ
Part Affinity Fields [Cao+ CVPR 2017]70• ࢛ࢶͷҐஔͱ͖ΛຒΊࠐΉϕΫτϧΛ༻͍ͨ࢟ਪఆ
HRNet [Sun+ CVPR 2019]71• Sub-networkΛՃ͢Δ͜ͱͰશମͷղ૾Λམͱͣ࢟͞ਪఆ͕Մೳ
3D72
ྨ73[Ahmed+ 2020 A survey on Deep Learning Advances on Different 3D Data Representations]
3D ܈74
࣌ܥྻ75[Guo+ 2020 Deep Learning for 3D Point Clouds: A Survey]
ྨ76[Guo+ 2020 Deep Learning for 3D Point Clouds: A Survey]
σʔληοτ77[Guo+ 2020 Deep Learning for 3D Point Clouds: A Survey]
PointNet [Qi+ CVPR 2017]78• ܈σʔλΛೖྗͱ͠ɼճసॱংͷมͳͲͷૢ࡞ʹରͯ͠ෆมͳಛΛग़ྗ͢ΔωοτϫʔΫ
PointNet++ [Qi+ NeurIPS 2017]79• PointNetہॴతͳใΛ͏·͘र͍͑ͯͳ͔͕ͬͨɼPointNetΛ֊తʹద༻͢Δ͜ͱͰ͜ΕʹରԠ
Dynamic Graph CNN [ACMTG+ 2019]80• ֤ͱͦͷۙͷؔΛදݱͨ͠ΤοδಛΛͭ͘ΔΈࠐΈͷఏҊ
VoxelNet [Zhou+ CVPR 2018]81• ܈σʔλΛvoxelʹΓ͚ɼ֤ϘΫηϧ୯ҐͰಛදݱͷຒΊࠐΈ• 3D܈ମೝࣝͷਫ਼্
3D ϝογϡ82
Heat Diffusion Equation83• ۂ໘ʢϦʔϚϯଟ༷ମʣ্Ͱͷ֦ࢄΛߟ͑Δ[Bronstein+ 2016 Geometric deep learning: going beyond Euclidean data]
Geodesic CNN [Masci+ ICCV 2015]84• ඇϢʔΫϦουଟ༷ମʹରԠՄೳͳCNNͷఏҊ• ֤Ͱۃ࠲ඪΛߟ͑Δ
Anisotropic CNN [Boscaini+ NeurIPS 2016]85• ඇํͳΧʔωϧΛߟ͑Δ͜ͱͰہॴతͳදݱΛΑΓΑ͘நग़[Bronstein+ 2016 Geometric deep learning: going beyond Euclidean data]
Monet [Monti+ CVPR 2017]86• ͜Ε·ͰͷඇϢʔΫϦουCNNͷҰൠԽ• ࠲ඪͷҰൠԽ• ݻఆͷΧʔωϧͰͳֶ͘शՄೳͳΧʔωϧΛ͍ɼΧʔωϧͷҰൠԽ
3D ඍՄೳϨϯμϥʔ87
ඍՄೳϨϯμϥʔ88%%ϨϯμϦϯά
Perspective Transformer Nets [Yan+ NeurIPS 2016]89• ϘΫηϧͷඍՄೳϨϯμϥʔ
Neural 3D Mesh Renderer [Monti+ CVPR 2017]90• ߴਫ਼ͳϝογϡͷඍՄೳϨϯμϥʔ• ϥελϥΠζ෦ΛඍՄೳʹͨ͜͠ͱͰٯՄೳʹ[https://www.slideshare.net/100001653434308/23d-neural-3d-mesh-renderer-cvpr-2018]
Transformers/Attention91
࣌ܥྻ92[Han+ 2021 A Survey on Visual Transformer]
ྨ93[Han+ 2021 A Survey on Visual Transformer][Khan+ 2021 Transformers in Vision: A Survey]
DETR [Carion+ ECCV 2020]94• CNNͰը૾ಛΛநग़ͨ͠ͷͪɼtransformerͰମೝࣝ
iGPT [Chen+ ICML 2020]95• ը૾ಛΛGPT-2Ͱڭࢣͳֶ͠श
Vision Transformer (ViT) [Dosovitskiy+ ICLR 2021]96• ७ਮͳTransformerͰը૾ೝࣝͷSOTA࠶ܝ
IPT [Chen+ 2020]97• ෳͷλεΫΛಉ࣌ʹߦ͏transformer
98[https://twitter.com/jaguring1/status/1377710003377725441]
99[https://www.slideshare.net/cvpaperchallenge/transformer-247407256]
ɽ·ͱΊ100
·ͱΊ101• ϞσϧͷൃలResNetΛϕʔεʹɼෳࡶԽɾେنԽɾޮԽ• Vision transformer͕ଓʑొ• جຊతͳcomputer visionͷλεΫʹಛԽͨ͠ϞσϧϕϯνϚʔΫ͕ݻ·͍ͬͯΔ༷ࢠ• 2D → 3DͷྲྀΕ• ϚϧνεέʔϧͳใͷΈࠐΈ͕Α͋͘Δҹ• ࡉ͔͍ςΫχοΫ͕ॏཁͳҹ[https://www.slideshare.net/cvpaperchallenge/cvpr-2020-237139930]
ࢀߟࢿྉͳͲ102
ࢀߟࢿྉ103• [cvpaper.challenge-summary](https://github.com/hirokatsukataoka16/cvpaper.challenge-summary)• [CVPR 2016 ใ](https://www.slideshare.net/HirokatsuKataoka/cvpr-2016)• [CVPR 2017 ใ](https://www.slideshare.net/cvpaperchallenge/cvpr-2017-78294211)• [CVPR 2018 ใ](https://www.slideshare.net/cvpaperchallenge/cvpr-2018-102878612)• [CVPR 2019 ใ](https://www.slideshare.net/cvpaperchallenge/cvpr-2019)• [CVPR 2020 ใ](https://www.slideshare.net/cvpaperchallenge/cvpr-2020-237139930)• [ಈըೝࣝαʔϕΠv1ʢϝλαʔϕΠ ʣ](https://www.slideshare.net/cvpaperchallenge/v1-232973484)• [Vision and LanguageʢϝλαʔϕΠ ʣ](https://www.slideshare.net/cvpaperchallenge/vision-and-language-232926110)• [ΈࠐΈχϡʔϥϧωοτϫʔΫͷݚڀಈ](https://www.slideshare.net/ren4yu/ss-84282514)• [ConvNetͷྺ࢙ͱResNetѥछɺετϓϥΫςΟε](https://www.slideshare.net/ren4yu/convnetresnet)• [ΈࠐΈχϡʔϥϧωοτϫʔΫͷߴਫ਼ԽͱߴԽ](https://www.slideshare.net/ren4yu/ss-145689425)• [จհ: Fast R-CNN&Faster R-CNN](https://www.slideshare.net/takashiabe338/fast-rcnnfaster-rcnn)• [ʲମݕग़ʳSSD(Single Shot MultiBox Detector)ͷղઆ](https://www.acceluniverse.com/blog/developers/2020/02/SSD.html)• [ʲମݕग़ख๏ͷྺ࢙ : YOLOͷհʳ](https://qiita.com/cv_carnavi/items/68dcda71e90321574a2b)• [ը૾ೝࣝͱਂֶश](https://www.slideshare.net/ren4yu/ss-234439652)• [semantic segmentation αʔϕΠ](https://www.slideshare.net/yoheiokawa/semantic-segmentation-141471958)• [Semantic segmentation ৼΓฦΓ](https://speakerdeck.com/motokimura/semantic-segmentation-zhen-rifan-ri)• [[DLྠಡձ]SlowFast Networks for Video Recognition](https://www.slideshare.net/DeepLearningJP2016/dlslowfast-networks-for-video-recognition-202057397)• [ࡾ࣍ݩ܈ΛऔΓѻ͏χϡʔϥϧωοτϫʔΫͷαʔϕΠ](https://www.slideshare.net/naoyachiba18/ss-120302579)• [ࡾ࣍ݩ܈ΛऔΓѻ͏χϡʔϥϧωοτϫʔΫͷαʔϕΠ Ver. 2](https://speakerdeck.com/nnchiba/point-cloud-deep-learning-survey-ver-2)• [܈ਂֶश Meta-study](https://www.slideshare.net/naoyachiba18/metastudy)• [ୈ̍ճ ࠷৽ͷML,CV,NLP ؔ࿈จಡΈձ PointNet](https://www.slideshare.net/FujimotoKeisuke/point-net)• [ [DLྠಡձ]MeshͱDeep Learning Surface Networks & AtlasNet](https://www.slideshare.net/DeepLearningJP2016/dlmeshdeep-learning-surface-networks-atlasnet)• [จ·ͱΊɿConvolutional Pose Machines](https://qiita.com/masataka46/items/88f1a375ce8a485d9454)• [ίϯϐϡʔλϏδϣϯͷ࠷৽จௐࠪ 2D Human Pose Estimation ฤ](https://engineer.dena.com/posts/2019.11/cv-papers-19-2d-human-pose-estimation/)• [[ୈ2ճ3Dษڧձ ݚڀհ] Neural 3D Mesh Renderer (CVPR 2018)](https://www.slideshare.net/100001653434308/23d-neural-3d-mesh-renderer-cvpr-2018)• [DeepLabʹΘΓݱࡏͷSOTAͰ͋ΔFastFCN(JPU)ͷจղઆ](https://qiita.com/kamata1729/items/1b495658a63d76904ac3)
104