Slide 1

Slide 1 text

࿦จ঺հ Πϯελϯεηάϝϯςʔγϣϯͷ࠷৽ಈ޲ ISHII TOMOAKI

Slide 2

Slide 2 text

ͲΜͳ΋ͷʁ ઌߦݚڀͱൺ΂ͯԿ͕͍͢͝ʁ ٕज़ͷख๏΍؊͸ʁ ٞ࿦͸͋Δʁ Ͳ͏΍ͬͯ༗ޮͩͱݕূͨ͠ʁ ࣍ʹಡΉ΂͖࿦จ͸ʁ Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao “YOLOv4: Optimal Speed and Accuracy of Object Detection” Zhaowei Cai and Nuno Vasconcelos “Cascade r-cnn: Delving into high quality object detection. “ هࡌͳ͠ COCO datasetΛ༻͍ͯMask R-CNN EfficientNetB7- FPN ͷϊʔϚϧɺmixupɺCopy-pasteͰൺ ֱͨ͠ͱ͜ΖɺCopyPaste͸ɺ௿σʔλମ੍ʢCOCOͷ10ˋʣͰ࠷΋໾ཱͪɺSSJ(standard scale jittering)্Ͱ6.9ϘοΫεAPͷվળɺLSJ( large scale jittering)্Ͱ4.8ϘοΫεAPͷվળ Λ΋ͨΒͨ͠ɻ Ұํɺmixup͸σʔλྔ͕গͳ͍৔߹ʹͷΈ໾ཱͬͨɻ ·ͨɺڧྗͳ54.8ϘοΫεAPCOCOϞσϧͷ্ʹCopyPaste֦ுΛద༻ͨ݁͠ՌɺCOCOͷ σʔλޮ཰͕େ෯ʹ޲্͠ɺEfficientNet-B7ͱNAS-FPNΞʔΩςΫνϟΛ࢖༻ͯ͠ɺςε τ࣌ؒͷ֦ுͳ͠ͰCOCOtest-devͰ57.3ϘοΫεAPͱ49.1ϚεΫAPΛ࣮ݱɻ 2ͭͷը૾ΛϥϯμϜʹબ୒͠ɺͦΕͧΕʹϥϯμϜεέʔϧδολʔͱϥϯμϜਫฏϑ ϦοϐϯάΛద༻ɻ ࣍ʹɺը૾ͷ1͔ͭΒΦϒδΣΫτͷϥϯμϜͳαϒηοτΛબ୒ ͠ɺͦΕΒΛଞͷը૾ʹషΓ෇͚Δɻ࠷ޙʹɺground-truth annotationsΛௐ੔ɻ ϥϯμϜΫϩοϓɺΧϥʔδολʔɺAuto / RandAugmentͳͲͷσʔλ֦ு͸ຊ࣭తʹ൚ ༻తͰ͋Γɺ·ͨɺϛοΫεΞοϓɺCutMix͓ΑͼMosaic͸ɺΦϒδΣΫτΛೝࣝͤ ͣɺΠϯελϯεͷηάϝϯςʔγϣϯͷλεΫ༻ʹಛผʹઃܭ͞Ε͍ͯͳ͍ɻҰํɺ Copy-paste͸ɺΦϒδΣΫτͷό΢ϯσΟϯάϘοΫε಺ͷ͢΂ͯͷϐΫηϧͰ͸ͳ͘ɺ ΦϒδΣΫτʹରԠ͢Δਖ਼֬ͳϐΫηϧͷΈΛίϐʔ͢ΔɻContextual Copy-Paste΍ InstaBoostͱൺֱͨ͠ॏཁͳҧ͍͸ɺίϐʔ͞ΕͨΦϒδΣΫτΠϯελϯεΛ഑ஔ͢Δ ͨΊʹɺपғͷϏδϡΞϧίϯςΩετΛϞσϧԽ͢Δඞཁ͕ͳ͍఺Ͱ͋Δɻ୯७ͳϥ ϯμϜ഑ஔઓུ͕͏·͘ػೳ͠ɺڧྗͳϕʔεϥΠϯϞσϧΛ࣮֬ʹվળ͢Δɻ࠷ۙͷ ϩϯάςʔϧʹ΋ରԠ͍ͯ͠Δɻ CopyPaste֦ுʹ͍ͭͯͷ࿦จɻσʔλ֦ுํ๏ͷҰͭͰɺ͋Δը૾͔Βผͷը૾ʹΦϒ δΣΫτΛϥϯμϜʹషΓ෇͚Δ୯७ͳϝΧχζϜʹΑͬͯɺτϨʔχϯάίετ΍ਪ ࿦࣌ؒΛ૿΍͢͜ͱͳ͘े෼ͳརಘΛఏڙͰ͖ΔɻΦϒδΣΫτΛҙ֦ࣝͨ͠ு͸Πϯ ελϯεͷηάϝϯςʔγϣϯʹ໾ཱ͕ͭɺैདྷͷํ๏Ͱ͸ෆे෼Ͱ͋Δͷʹରͯ͠ CopyPaste֦ு͸ͦΕʹద͍ͯ͠Δɻ Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation ୯७ͳίϐʔΞϯυϖʔετ͸ΠϯελϯεηάϝϯςʔγϣϯͷͨΊͷڧྗͳσʔλ֦ுํ๏Ͱ͢ ʢDecember 2020ʣ Golnaz Ghiasi Yin Cui Aravind Srinivas Rui Qian Tsung-Yi Lin Ekin D. Cubuk Quoc V. Le Barret Zoph Google Research, Brain Team UC Berkeley Cornell University

Slide 3

Slide 3 text

ͲΜͳ΋ͷʁ ઌߦݚڀͱൺ΂ͯԿ͕͍͢͝ʁ ٕज़ͷख๏΍؊͸ʁ ٞ࿦͸͋Δʁ Ͳ͏΍ͬͯ༗ޮͩͱݕূͨ͠ʁ ࣍ʹಡΉ΂͖࿦จ͸ʁ Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. “Soft-NMS–improving object detection with one line of code. “ Zhaowei Cai and Nuno Vasconcelos. “Cascade R-CNN: Delving into high quality object detection” هࡌͳ͠ YOLOv4ͷΞʔΩςΫνϟͱͯ͠ɺCSPDarknet53όοΫϘʔϯɺSPP௥ՃϞδϡʔϧɺ PANetύεΞάϦήʔγϣϯωοΫɺ͓ΑͼYOLOv3ʢΞϯΧʔϕʔεʣϔουΛબ୒ɻ 1080Ti·ͨ͸2080TiGPUʹ͓͍ͯImageNetʢILSVRC 2012 valʣσʔληοτɺMS COCO ʢtest-dev 2017ʣσʔληοτͷݕग़ثͷਫ਼౓ʹର͢Δ͞·͟·ͳτϨʔχϯάվળख ๏ͷӨڹΛςετʢσʔλ֦ுɺBoF/BoSɺόοΫϘʔϯɺόοναΠζʣɻYOLOv4͸଎ ౓ͱਫ਼౓ͷ྆ํͷ఺Ͱɺଞͷ଎͘ਖ਼֬ͳݕग़ثΑΓ΋༏Εͨ݁ՌΛग़ͨ͠ɻ ར༻Մೳͳ͢΂ͯͷ୅ସݕग़ثΑΓ΋ߴ଎ʢFPSʣͰਖ਼֬ͳʢMS COCO AP50 ... 95͓Αͼ AP50ʣ࠷ઌ୺ͷݕग़ثΛఏڙɻ8ʙ16 GB-VRAMͷैདྷͷGPUͰτϨʔχϯάͯ͠࢖༻Ͱ ͖ΔͨΊɺ෯޿͍࢖༻͕Մೳɻ ࠷৽ͷݕग़ث͸௨ৗɺόοΫϘʔϯͱϔουͷ2ͭͷ෦෼Ͱߏ੒͞Εɺϔουʹ͍ͭͯ͸ 1ஈࣜ෺ମݕग़ثͱ2ஈࣜ෺ମݕग़ثͷ2छྨʹ෼ྨ͞ΕΔɻۙ೥։ൃ͞ΕͨΦϒδΣΫ τݕग़ث͸ɺόοΫϘʔϯͱϔουͷؒʹ͍͔ͭ͘ͷϨΠϠʔΛૠೖ͢Δ͜ͱ͕ଟ͘ɺ ͜Ε͸෺ମݕग़ثͷωοΫͱ͍͑Δɻ௨ৗɺैདྷͷΦϒδΣΫτݕग़ث͸ΦϑϥΠϯͰ τϨʔχϯά͞ΕɺτϨʔχϯάઓུΛมߋ͢Δ͚ͩɺ·ͨ͸τϨʔχϯάίετΛ૿ ΍͚ͩ͢ͷ͜ΕΒͷํ๏ʢσʔλͷ֦ுͳͲʣΛʮBag of freebiesʯɺਪ࿦ίετΛΘͣ ͔ʹ૿Ճͤ͞Δ͚ͩͰɺΦϒδΣΫτݕग़ͷਫ਼౓Λେ෯ʹ޲্ͤ͞Δ͜ͱ͕Ͱ͖Δϓϥ άΠϯϞδϡʔϧͱޙॲཧํ๏ΛʮBag of specialsʯͱݺͿɻ SPPϞδϡʔϧ͸1࣍ݩͷಛ ௃ϕΫτϧΛग़ྗ͢ΔͨΊ׬શ৞ΈࠐΈωοτϫʔΫʢFCNʣʹద༻͢Δ͜ͱ͸Ͱ͖ ͣɺAttentionϞδϡʔϧ͸ɺܭࢉίετ2ˋ૿ʹΑΓGPUͰ͸ਪ࿦͕࣌ؒ௕͘ͳΔɻ ͜ͷ࿦จͷओͳ໨ඪ͸ɺ௿ܭࢉྔͷཧ࿦తࢦඪʢBFLOPʣͰ͸ͳ͘ɺ࣮ಈγεςϜʢGPU ͰϦΞϧλΠϜʹಈ࡞ʣͰͷΦϒδΣΫτݕग़ثͷߴ଎ಈ࡞ͱฒྻܭࢉͷ࠷దԽΛઃܭ ͢Δ͜ͱɻ৞ΈࠐΈχϡʔϥϧωοτϫʔΫʢCNNʣͷਫ਼౓Λ޲্ͤ͞ΔͱݴΘΕΔॏ Έ෇͖࢒ࠩ઀ଓʢWRCʣɺεςʔδؒ෦෼઀ଓʢCSPʣɺϛχόονਖ਼نԽʢCmBNʣɺ ࣗݾఢରτϨʔχϯάʢSATʣɺMishΞΫςΟϕʔγϣϯʹՃ͑ɺϞβΠΫσʔλ֦ுɺ DropBlockਖ਼ଇԽɺ͓ΑͼCIoUଛࣦͷ͍͔ͭ͘Λ૊Έ߹Θͤͯ࠷ઌ୺ͷ݁ՌΛ࣮ݱɻҰஈ ΞϯΧʔϕʔεݕग़ثͷ֓೦ͷ࣮ߦՄೳੑΛূ໌ɻ “YOLOv4: Optimal Speed and Accuracy of Object Detection” YOLOv4ɿΦϒδΣΫτݕग़ͷ࠷దͳ଎౓ͱਫ਼౓ ʢApril 2020ʣ Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao

Slide 4

Slide 4 text

ͲΜͳ΋ͷʁ ઌߦݚڀͱൺ΂ͯԿ͕͍͢͝ʁ ٕज़ͷख๏΍؊͸ʁ ٞ࿦͸͋Δʁ Ͳ͏΍ͬͯ༗ޮͩͱݕূͨ͠ʁ ࣍ʹಡΉ΂͖࿦จ͸ʁ K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. “Mask r-cnn.” J. Dai, Y. Li, K. He, and J. Sun. “R-FCN: object detection via region-based fully convolutional networks.” هࡌͳ͠ ΧεέʔυR-CNNͷ൚༻ੑΛςετ͢ΔͨΊʹɺ3ͭͷҰൠతͳϕʔεϥΠϯݕग़ثΛ࢖ ༻࣮ͯ͠ݧ࣮ࢪɻMS-COCO2017 Λ༻͍ͯόοΫϘʔϯVGG-NetΛඋ͑ͨFaster-RCNN ɺR- FCN ɺResNetόοΫϘʔϯΛඋ͑ͨFPN ʹ͓͍ͯɺMS-COCO2017 Λ༻͍࣮ͯݧͨ͠ͱ͜ ΖɺCascade R-CNN͸ɺ͢΂ͯͷධՁࢦඪͷԼͰɺ͢΂ͯͷ୯ҰϞσϧݕग़ثΛେ෯ʹ্ ճͬͨɻ Χεέʔυό΢ϯσΟϯάϘοΫεճؼ Χεέʔυݕग़ R-CNN ΞʔΩςΫνϟͷ੒ޭʹΑΓɺఏҊݕग़ثͱྖҬ͝ͱͷ෼ྨثΛ૊Έ߹ΘͤΔ͜ ͱʹΑΔ2ஈ֊ͷఆࣜԽ͕ओྲྀʹͳ͍ͬͯΔɻR-CNNͰͷ৑௕ͳCNNܭࢉΛݮΒͨ͢Ί ʹɺSPP-Net ͱFast-RCNN ͸ɺྖҬ͝ͱͷಛ௃நग़ͷΞΠσΞΛಋೖ͠ɺݕग़ثશମΛେ ෯ʹߴ଎ԽɻͦͷޙɺFaster-RCNN ͸ɺRegion Proposal NetworkʢRPNʣΛಋೖ͢Δ͜ͱʹ ΑΓɺ͞ΒͳΔεϐʔυΞοϓΛୡ੒͠ओཁͳΦϒδΣΫτݕग़ϑϨʔϜϫʔΫʹͳͬ ͨɻ࠷ۙ͞Βʹ֦ு͞ΕɺR-FCN ͸ɺFaster-RCNNͷॏ͍ྖҬ͝ͱͷCNNܭࢉΛճආ͢Δ ͨΊʹɺਫ਼౓Λଛͳ͏͜ͱͳ͘ޮ཰తͳྖҬ͝ͱͷ׬શͳ৞ΈࠐΈΛఏҊɺҰํɺMS- CNN ͱFPN ͸ɺෳ਺ͷग़ྗϨΠϠʔͰϓϩϙʔβϧΛݕग़͠ɺRPNड༰໺ͱ࣮ࡍͷΦϒ δΣΫταΠζͷؒͷεέʔϧͷෆҰகΛܰݮͯ͠ɺ࠶ݱ཰ͷߴ͍ϓϩϙʔβϧΛݕग़ ͢Δɻ·ͨɺεϥΠσΟϯά΢Οϯυ΢ʹ͍ۙ1ஈ֊ͷYOLO΋੒ՌΛग़͍ͯ͠Δɻ Cascade R-CNNͰ͸Faster-RCNNͷ2ஈ֊ΞʔΩςΫνϟΛ֦ு͠·͢ɻ ΧεέʔυR-CNN͸ଟஈ෺ମݕग़ϑϨʔϜϫʔΫɻΦϒδΣΫτݕग़Ͱ͸ɺਖ਼ͱෛΛఆ ٛ͢ΔͨΊʹɺަࠩΦʔόʔϢχΦϯʢIoUʣ͖͍͠஋͕ඞཁ͕ͩɺIoU͖͍͠஋Λ্͛ Δͱɺݕग़ύϑΥʔϚϯε͕௿Լ͢Δ܏޲͕͋Δɻ2ͭͷओͳཁҼ͸ɺਖ਼ͷαϯϓϧ͕ ࢦ਺ؔ਺తʹফࣦ͢Δ͜ͱʹΑΔτϨʔχϯάதͷա৒ద߹ɺ࠷దͳIoUͱೖྗԾઆͷ IoUؒͷਪ࿦࣌ؒͷෆҰகɻ͜ΕΒͷ໰୊ʹରॲ͢ΔͨΊʹɺIoU͖͍͠஋Λ্͛ͯτ Ϩʔχϯά͞ΕͨҰ࿈ͷݕग़ثͰߏ੒͞Εɺۙ઀ͨ͠ޡݕ஌ʹରͯ͠ॱ࣍બ୒ੑ͕ߴ͘ ͳΔɺଟஈ֊ͷΦϒδΣΫτݕग़ΞʔΩςΫνϟͰ͋ΔΧεέʔυR-CNN͕ఏҊ͞Ε ͨɻ “Cascade R-CNN: Delving into high quality object detection” ΧεέʔυR-CNNɿߴ඼࣭ͷΦϒδΣΫτݕग़Λ۷ΓԼ͛·͢ ʢDecember 2017ʣ Zhaowei Cai and Nuno Vasconcelos.

Slide 5

Slide 5 text

ͲΜͳ΋ͷʁ ઌߦݚڀͱൺ΂ͯԿ͕͍͢͝ʁ ٕज़ͷख๏΍؊͸ʁ ٞ࿦͸͋Δʁ Ͳ͏΍ͬͯ༗ޮͩͱݕূͨ͠ʁ ࣍ʹಡΉ΂͖࿦จ͸ʁ M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. “2D human pose estimation: New benchmark and state of the art analysis” P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and ´ J. Malik. “Multiscale combinatorial grouping. “ هࡌͳ͠ COCOΩʔϙΠϯτσʔληοτͰͷਓؒͷ࢟੎ਪఆͷλεΫͰMask R-CNN͸ɺ2016 COCOΩʔϙΠϯτίϯϖςΟγϣϯͷউऀΛ্ճͬͨ ΞϒϨʔγϣϯ࣮ݧͰCOCOΦϒδΣΫτݕग़λεΫʹ΋༏Ε͍ͯϧ͜ͱ͕Θ͔ͬͨɻ RoIAlignͱݺ͹ΕΔɺਖ਼֬ͳۭؒҐஔΛ஧࣮ʹอ࣋͢ΔɺྔࢠԽͷͳ͍୯७ͳϨΠϠʔΛ ఏҊ͠·͢ɻҰݟখ͞ͳมߋͰ͋Δʹ΋͔͔ΘΒͣɺRoIAlignʹ͸େ͖ͳӨڹ͕͋Γ· ͢ɻϚεΫͷਫ਼౓͕ൺֱత10ˋ͔Β50ˋ޲্͠ɺΑΓݫີͳϩʔΧϦθʔγϣϯϝτ ϦοΫͷԼͰΑΓେ͖ͳ޲্͕ݟΒΕ·͢ɻ࣍ʹɺϚεΫͱΫϥε༧ଌΛ෼཭͢Δ͜ͱ ͕ෆՄܽͰ͋Δ͜ͱ͕Θ͔Γ·ͨ͠ɻΫϥεؒͰڝ߹͢Δ͜ͱͳ͘ɺ֤ΫϥεͷόΠφ ϦϚεΫΛݸผʹ༧ଌ͠ɺωοτϫʔΫͷRoI෼ྨϒϥϯνʹґଘͯ͠ΧςΰϦΛ༧ଌ͠ ·͢ɻ Faster RCNN͸ɺωοτϫʔΫೖྗͱग़ྗͷؒͷϐΫηϧؒͷҐஔ߹Θͤ༻ʹઃܭ͞Εͯ ͍ͳ͍͜ͱ͕໰୊ Mask R-CNNͱݺ͹ΕΔ͜ͷํ๏͸ɺ෼ྨͱό΢ϯσΟϯάϘοΫεճؼͷͨ Ίͷطଘͷϒϥϯνͱฒߦͯ͠ɺ֤ؔ৺ྖҬʢRoIʣͰηάϝϯςʔγϣϯϚ εΫΛ༧ଌ͢ΔͨΊͷϒϥϯνΛ௥Ճ͢Δ͜ͱʹΑΓɺFaster R-CNN Λ֦ு “Mask R-CNN” ʢMarch 2017ʣ Kaiming He Georgia Gkioxari Piotr Dollar Ross Girshick

Slide 6

Slide 6 text

ͲΜͳ΋ͷʁ ઌߦݚڀͱൺ΂ͯԿ͕͍͢͝ʁ ٕज़ͷख๏΍؊͸ʁ ٞ࿦͸͋Δʁ Ͳ͏΍ͬͯ༗ޮͩͱݕূͨ͠ʁ ࣍ʹಡΉ΂͖࿦จ͸ʁ S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks” L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. “Semantic image segmentation with deep convolutional nets and fully connected crfs.” R. Girshick. “Fast R-CNN” R. Girshick, J. Donahue, T. Darrell, and J. Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation. “ هࡌͳ͠ PASCAL VOCσʔληοτͰ࣮ݧɻը૾͋ͨΓ170msͷςετ࣌ؒ଎౓Ͱୡ੒͞Εɺ FasterR-CNNͷରԠ෺ΑΓ2.5ʙ20ഒߴ଎ɻ R-CNN ʹଓ͍ͯɺʢiʣྖҬఏҊɺ͓ΑͼʢiiʣྖҬ෼ྨͰߏ੒͞ΕΔҰൠతͳ2ஈ֊ͷΦ ϒδΣΫτݕग़ઓུΛ࠾༻ɻީิྖҬ͸ɺͦΕࣗମ͕׬શ৞ΈࠐΈΞʔΩςΫνϟͰ͋ ΔRegion Proposal NetworkʢRPNʣʹΑͬͯநग़ɻఏҊྖҬʢRoIʣΛલఏͱͯ͠ɺR-FCN ΞʔΩςΫνϟ͸ɺRoIΛΦϒδΣΫτΧςΰϦͱഎܠʹ෼ྨ͢ΔΑ͏ʹઃܭɻ R-FCNͰ ͸ɺֶशՄೳͳ͢΂ͯͷॏΈ૚͸৞ΈࠐΈͰ͋Γɺը૾શମͰܭࢉ͞ΕΔɻ ҎલͷྖҬϕʔεͷݕग़ثͱ͸ରরతʹɺྖҬϕʔεͷݕग़ث͸׬શʹ৞Έࠐ ΈͰ͋Γɺ΄ͱΜͲ͢΂ͯͷܭࢉ͕ը૾શମͰڞ༗͞ΕΔɻ ਖ਼֬Ͱޮ཰తͳΦϒδΣΫτݕग़ͷͨΊͷྖҬϕʔεͷ׬શ৞ΈࠐΈωοτ ϫʔΫ “R-FCN: object detection via region-based fully convolutional networks.” R-FCNɿྖҬϕʔεͷ׬શ৞ΈࠐΈωοτϫʔΫΛհͨ͠ΦϒδΣΫτݕग़ ʢMay 2016ʣ Jifeng Dai, Yi Li, Kaiming He, Jian Sun