Facebook AI Research による論文「Deep Clustering for Unsupervised Learning of Visual Features」の解説資料 https://arxiv.org/abs/1807.05520
Deep Clustering for Unsupervised Learning of Visual Features2018.08.08(ԭೄ) AI ؔ࿈จಡΈձ #4
View Slide
@hoto17296• ͪΎΒσʔλגࣜձࣾ• Web / Πϯϑϥ / σʔλੳ
ࠓճͷจɿ Deep Clustering for Unsupervised Learning of Visual Features• https://arxiv.org/abs/1807.05520• Facebook AI Research• Accepted at ECCV 2018
֓ཁ• CNN Ͱը૾ͷΫϥελϦϯάΛ͢Δख๏• CNN ͷग़ྗΛ k-means ͰΫϥελϦϯάͨ݁͠ՌΛ “ِϥϕϧ” ͱͯ͠ѻ͍ɺωοτϫʔΫͷॏΈΛߋ৽͢Δ• ͱͯྑ͍ੑೳ͕ग़ͨ• Pascal VOC ʹΑΔධՁͰଞͷΞϧΰϦζϜΛ͑Δੑೳ• ؤ݈ੑ͕͋Δ• σʔληοτΛม͑ͯେৎ (ImageNet → YFCC100)• ωοτϫʔΫߏΛม͑ͯେৎ (AlexNet → VGG16)• k-means Ҏ֎ͷΫϥελϦϯάΞϧΰϦζϜͰେৎ
1. എܠ
ImageNet• (༗໊ͳ) ը૾σʔληοτ• 100ສຕΛ͑Δը૾• 1000Ϋϥεʹϥϕϧ͚͕͞Ε͍ͯΔ• ը૾ྨΞϧΰϦζϜͷධՁͳͲͷ༻్Ͱ Α͘༻͍ΒΕΔ
ImageNet ͷ՝• “ͨͬͨͷ” 100ສຕ͔͠ͳ͍• ਓؒͷ͚ͨϥϕϧʹҰ෦ޡΓ͕͋Δ↓ۙɺը૾ྨϞσϧͷੑೳ͕಄ଧͪʹͳ͍ͬͯΔͷ σʔληοτʹཁҼ͕͋Δͷͱߟ͑ΒΕ͍ͯΔ
ImageNet ͷ՝ͷղܾࡦ• Πϯλʔωοτنͷը૾σʔληοτ• ਓ͕ؒϥϕϧ͚͠ͳ͍↓ڭࢣͳֶ͠शʹΑͬͯ͜ΕΛ࣮ݱ͍ͨ͠
2. લఏ
ڭࢣ (͋Γ|ͳ͠) ֶश• ڭࢣ͋Γֶश (Supervised Learning)• ֶशσʔλʹڭࢣϥϕϧ͕͍͍ͯΔ• ྨճؼͳͲ• ڭࢣͳֶ͠श (Unsupervised Learning)• ֶशσʔλʹڭࢣϥϕϧ͕͍͍ͯͳ͍• ΫϥελϦϯά• AutoEncoder
ࣗݾڭࢣ͋Γֶश(Self-Supervised Learning)• ڭࢣͳֶ͠शͷҰछ (ཁग़య)• ԿΒ͔ͷํ๏Ͱ “ِͷϥϕϧ” Λ༻ҙ͠ɺ ͦΕΛڭࢣϥϕϧͱݟֶཱͯͯशΛߦ͏
3. ख๏
epoch ͷྲྀΕ3. k-means ͰΫϥελϦϯά1. ೖྗΛ CNN ͰϑΥϫʔυ4. ΫϥελϦϯά݁ՌΛ “ِϥϕϧ” ͱͯ͠ޡࠩΛܭࢉ5. ωοτϫʔΫͷॏΈΛߋ৽2. CNN ͷग़ྗ݁ՌΛ PCA Ͱѹॖ
ٙ͏·͘ΫϥελϦϯάͰ͖ΔΘ͚ͳ͘ͳ͍ʁʁʁॳظঢ়ଶͰωοτϫʔΫΛશֶ͘शͤͯ͞ͳ͍ͷʹɺ
Ͱ͖ΔΒ͍͠5IFHPPEQFSGPSNBODFPGSBOEPNDPOWOFUTJTJOUJNBUFMZUJFEUPUIFJSDPOWPMVUJPOBMTUSVDUVSFXIJDIHJWFTBTUSPOHQSJPSPOUIFJOQVUTJHOBM(ֶश͍ͤͯ͞ͳ͍) ϥϯμϜͳ CNN Ͱ͋ͬͯྑ͍ੑೳ͕ग़ͤΔͷɺೖྗ৴߸ʹڧ͍ࣄલΛ༩͑ΔΈࠐΈߏ ͕ີʹ͍ؔͯ͠Δɻ ❓❓❓
ผͷݚڀ [26]• ύϥϝʔλ͕ϥϯμϜͳ AlexNet ʹ ImageNet σʔληοτͰྨΛߦͬͨ• ग़ྗϥϯμϜʹͳΔͱ͢Εɺ (ImageNet 1000ΫϥεྨͳͷͰ) ਫ਼ͷظ 0.1 %ͱͳΔ• ͔࣮͠͠ࡍʹɺظΛང͔ʹ͑Δ 12 %ͷਫ਼Λग़ͨ͠
ͨͿΜ͜͏͍͏͜ͱύϥϝʔλ͕ϥϯμϜͰ͋ͬͯɺCNN ͷߏͦͷͷ͕ ʮͳΜ͔ͦΕͬΆ͍Λग़ྗ͢ΔʯྗΛ͍࣋ͬͯΔ
4. ͦͷଞ͍Ζ͍Ζ
͍͔ͭ͘ͷͷճආ• શ෦ͻͱͭͷΫϥελʹೖͬͯ͠·͏• ۭͷΫϥελ͕͋ͬͨ߹ॏ৺ΛҠಈͤͯ͞ ΫϥελΛ࠶ܭࢉ͢Δ͜ͱͰղܾ• ِϥϕϧ͕ภΔ• ڭࢣ͋ΓֶशͰϥϕϧͷ͕ภ͍ͬͯΔͱ͖ʹ ى͖Δͷͱಉ͡• ِϥϕϧͷத͔ΒҰ༷ʹαϯϓϦϯάͯ͠ ֶशͤ͞Δ͜ͱͰղܾ
࣮ͷৄࡉ (1/2)• CNN ʹඪ४తͳ AlexNet Λ༻͍ͨ• Local Response Normalization Batch Normalization ʹೖΕସ͑ͨ• ৭ใΛͦͷ··ѻ͏ͷ͕͍͠• Sobel filter (※ ྠֲநग़) ʹجͮ͘ઢܗมʹΑͬͯ ৭Λআ͠ίϯτϥετΛڧௐ͍ͯ͠Δ• ImageNet ͷը૾ Data Augmentation ͯ͠ೖྗͨ͠• mini batch size 256 ʹͨ͠
࣮ͷৄࡉ (2/2)• 500 epoch Λֶश͢Δͷʹ P100 GPU Λͬͯ 12 ͔͔ͬͨ• ࣮ߦ࣌ؒશମͷ 1/3 k-means ͷॲཧ࣌ؒ• ΫϥελϦϯά͢ΔલʹશσʔλΛ Forward ͢Δඞཁ͕ ͋ΔͷͰͲ͏͕͔͔ͯ࣌ؒ͠Δ
5. ༷ʑͳ࣮ݧɾߟ
ิɿਖ਼نԽ૬ޓใྔ (NMI)• Normalized Mutual Information• ͋ΔΫϥελϦϯά݁Ռ A ͱ ผͷΫϥελϦϯά݁Ռ B ͕ ͲΕ͚ͩࣅ௨͍ͬͯΔ͔ΛදݱͰ͖Δ
ImageNet ϥϕϧͱͷൺֱ• DeepCluster ʹΑΔΫϥελϦϯά݁Ռͱ ImageNet ͷϥϕϧͷ NMI ͷਪҠ• epoch ͕ਐΉʹͭΕͯ ࣅ௨ͬͯ͘Δ
Ϋϥελͷ҆ఆੑ• ͋Δը૾͕ɺ࣍ͷ epoch Ͱಉ͡Ϋϥελʹ ׂΓͯΒΕΔׂ߹ (= ҆ఆੑ)• epoch ͕ਐΉʹͭΕ ҆ఆੑ͕૿͢• 0.8 ҎԼͰ͢Δ• ͦΕҎ্ͷֶश ҙຯ͕ͳ͍
ΫϥελʹΑΔੑೳͷҧ͍• mAP ͱ͍͏ํ๏ (ʁ) ͰྨੑೳΛܭଌͨ͠• k = 10,000 Ͱ࠷ੑೳ͕ྑ͔ͬͨ• ImageNet Ͱ͋Ε k = 1,000 ͕ ྑ͍ͷͰͳ͍͔ͱߟ͕͕͑ͪͩɺ աͳηάϝϯςʔγϣϯͷ ΄͏͕͍͍݁ՌΛग़ͨ͠
৭ͷআڈʹΑΔࣝผೳྗͷҧ͍• Լͷը૾ɺCNN ͷ࠷ॳͷΛՄࢹԽͨ͠ͷ• ৭ใΛͦͷ··ೖྗͨ͠߹ (ࠨ) ɺ ৭ʹؔ͢Δใ͔ࣝ͠ผ͍ͯ͠ͳ͍• Sobel filter Ͱ৭ใΛมͨ͠߹ (ӈ) ɺ ΤοδΛࣝผ͍ͯ͠Δ
CNN ͷ֤͝ͱͷߟ• Լͷը૾ɺ֤Ͱ࠷Ԡͷྑ͔ͬͨը૾ TOP 9• ਂ͍ʹͳΔ΄Ͳେ͖ͳύλʔϯΛೝ͍ࣝͯ͠Δ (༧௨Γ)• ࠷ޙͷ (conv5) ɺલͷ·ͰͰೝࣝͨ͜͠ͱΛ ࠶ೝ͍ࣝͯ͠͠ΔΑ͏ʹݟ͑Δ• (AlexNet ʹ͓͍ͯ) ࠷ޙͷ (conv5) ଞͷͱ ಛ͕ҟͳΔͱ͍͏ผͷݚڀ݁Ռ [43] Λཪ͚͍ͯΔ
֤ͷྨੑೳ (1/3)• ্Ґ n ·Ͱͷग़ྗ͔ΒઢܗྨثΛ࡞Δ• ImageNet ͱ Place σʔληοτͰͷྨੑೳΛධՁ͢Δ
֤ͷྨੑೳ (2/3)
֤ͷྨੑೳ (3/3)• DeepCluster ߴ͍ϨΠϠͰͷੑೳ͕ྑ͍• conv3 ͷੑೳ͕ͱͯྑ͍• ͳΜͱ conv5 ΑΓྑ͍• ҰํͰ conv1 ͷੑೳ͕શ͘ྑ͘ͳ͍• DeepCluster Ͱɺconv3-conv4 Ͱ ImageNet ͷ ϥϕϧʹ૬͢ΔͷΛೝ͍ࣝͯ͠ΔͷͰͳ͍͔
Pascal VOC ʹΑΔධՁ (1/3)• Pascal VOC: ྨɾମݕग़ɾϥϕϧ͚ Λߦ͏ίϯϖ• DeepCluster ΛͬͯΛղ͘͜ͱͰੑೳΛධՁ͢Δ• ମݕग़ͷ࣮ʹ Fast R-CNN Λ༻͍ͨ
Pascal VOC ʹΑΔධՁ (2/3)
Pascal VOC ʹΑΔධՁ (3/3)• ྨɾମݕग़ɾϥϕϧ͚ ͯ͢ʹ͓͍ͯੑೳ͕ྑ͍• ڵຯਂ͍ͱͯ͠ɺfine-tuned (?) ͳϥϯμϜωοτϫʔΫͦΕͳΓͷਫ਼Λग़͕͢ɺશ݁߹ 6-8 ͷΈΛֶशͨ͠߹ͷੑೳ͔ͳΓ͘ͳΔ• ͜ΕΒͷλεΫ fine-tuning Ͱ͖ͳ͍߹Ͱݱ࣮ͷ ΞϓϦέʔγϣϯͱۙ͘ͳΔ• ͦͷ߹ɺ࠷৽ͷख๏ͱͷࠩߋʹେ͖͘ͳΔͩΖ͏(ྨͰ࠷େ 9%)( ˘ω˘) .oO ( ͪΐͬͱԿݴͬͯΔ͔Θ͔ΒΜ͔ͬͨ )
6. ·ͱΊ
֓ཁ (࠶ܝ)• CNN Ͱը૾ͷΫϥελϦϯάΛ͢Δख๏• CNN ͷग़ྗΛ k-means ͰΫϥελϦϯάͨ݁͠ՌΛ “ِϥϕϧ” ͱͯ͠ѻ͍ɺωοτϫʔΫͷॏΈΛߋ৽͢Δ• ͱͯྑ͍ੑೳ͕ग़ͨ• Pascal VOC ʹΑΔධՁͰଞͷΞϧΰϦζϜΛ͑Δੑೳ• ؤ݈ੑ͕͋Δ• σʔληοτΛม͑ͯେৎ (ImageNet → YFCC100)• ωοτϫʔΫߏΛม͑ͯେৎ (AlexNet → VGG16)• k-means Ҏ֎ͷΫϥελϦϯάΞϧΰϦζϜͰେৎ