Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Vision Transformer / pyml-niigata-20210220-vision-transformer

Vision Transformer / pyml-niigata-20210220-vision-transformer

2021/02/20 Python機械学習勉強会 in 新潟 で発表した資料です。

82d6167c4d14393c2e20b37a74b363c5?s=128

kasacchiful

February 20, 2021
Tweet

Transcript

  1. Vision Transformer Pythonػցֶशษڧձ in ৽ׁ #12 2021-02-20 @kasacchiful

  2. Software Developer Favorite: Community: • JAWS-UG Niigata • Python ML

    in Niigata (New!!) • JaSST Niigata • ASTER • SWANII • etc. Hiroshi Kasahara @kasacchiful @kasacchiful 2 New!!
  3. JAWS-UG Niigata #9 IUUQTKBXTVHOJJHBUBDPOOQBTTDPNFWFOU

  4. ໨࣍ 1. Vision Transformerͱ͸Կ͔ʁ 2. Transformerͷ͓͞Β͍ 3. Vision TransformerͷϝϦοτɾσϝϦοτ 4.

    ݱ࣌఺Ͱͷࢲͷߟ࡯ 5. ը૾෼ྨҎ֎ͷTransformerద༻ྫ
  5. Vision Transformer https://github.com/google-research/vision_transformer

  6. Vision Transformerͱ͸Կ͔ʁ • ࡢࠓͷࣗવݴޠॲཧͰϕʔεʹͳ͍ͬͯΔʮTransformerʯΛը૾෼ྨ ʹద༻ • ը૾෼ྨͰඪ४ͷʮCNNʯ͸࢖༻ͤͣ • ֤छSoTAϞσϧͱಉఔ౓΋͘͠͸ͦΕҎ্ͷੑೳୡ੒ •

    τϨʔχϯά࣌ͷܭࢉϦιʔε͸গͳͯ͘ࡁΉʢେྔͷσʔλ͸ඞཁʣ • ݱࡏICLR2021ࠪಡத
  7. Vision TransformerͷྲྀΕ 1.ը૾ΛNຕͷύονʹ෼͚Δ 2.ύονΛฏ׈Խͯ͠ઢܗࣸ૾ม׵ • ઢܗࣸ૾ͷύϥϝʔλ͸ֶश࣌ʹ֫ಘ 3.ݩͷύονͷҐஔ৘ใΛ࡞੒ 4.ม׵͞ΕͨύονͱҐஔ৘ใΛTransformer Encoderʹ 5.Transformer

    Encoderͷग़ྗΛMLPͰΫϥε෼ྨ IUUQTHJUIVCDPNMVDJESBJOTWJUQZUPSDI
  8. Vision Transformerͷੑೳ Ҿ༻"O*NBHFJT8PSUIY8PSET5SBOTGPSNFSTGPS*NBHF3FDPHOJUJPOBU4DBMF

  9. Transformerͷ͓͞Β͍

  10. Transformer • AttentionͰߏ੒͞Εͨػց຋༁Ϟσϧ • Attention = Dictionary Object (Query, Key,

    Value) ͱղऍ • QueryΛೖΕΔͱɺࢀর͢΂͖৔ॴ(Key)ΛಘΒΕɺͦͷ৔ॴͷ஋(Value)͕ಘΒΕΔ • KeyͱValue͕ࣄલ஌ࣝʹΑͬͯಘΒΕΔͨΊɺMemoryʹ૬౰͢Δ • Self-Attention: จষ಺ͷ୯ޠؒͷؔ܎ΛͱΒ͑ΔɻQuery/Key/Value͸ಉ͡୯ޠ͔Βੜ੒ɻ • Source Target Attention: 2ͭͷܥྻؒͷରԠؔ܎ΛͱΒ͑ΔɻQuery͸σίʔμଆɺKey/Value͸Τ ϯίʔμଆ͔Βੜ੒ɻ • Vision TransformerͰ͸ɺTransformerͷEncoder෦෼Λվྑͨ͠΋ͷΛ࢖༻͍ͯ͠Δ
  11. Transformer Ϟσϧ Ҿ༻"UUFOUJPO*T"MM:PV/FFE

  12. Attention ྫ IUUQTDPMBCSFTFBSDIHPPHMFDPNHJUIVCUFOTPSqPXUFOTPSUFOTPSCMPCNBTUFSUFOTPSUFOTPSOPUFCPPLTIFMMP@UUJQZOC Self Attention Source Target Attention

  13. Vision Transformer ͱ Transformer ͷ Encoderൺֱ Vision Transformer Transformer

  14. Vision Transformerͷ ϝϦοτɾσϝϦοτ

  15. Vision TransformerͷϝϦοτ Ҿ༻"O*NBHFJT8PSUIY8PSET5SBOTGPSNFSTGPS*NBHF3FDPHOJUJPOBU4DBMF • ߴੑೳ • ֤छSoTAϞσϧͱಉఔ౓΋͘͠͸ͦΕҎ্ͷੑೳୡ੒ • ܭࢉϦιʔε͕খͯ͘͞ࡁΉ •

    ࣄલֶशʹBiT΍NoisyStudent͸໿1ສTPUcore೔͔͔Δ͕ɺViT- HugeͰ͸໿2,500TPUcore೔ͱ໿1/4ͰࡁΉ
  16. Vision TransformerͷσϝϦοτ • େྔͷσʔλ͕ඞཁ • ʮJFT300Mʯͱ͍͏ڊେͳσʔληοτͰࣄલʹֶशࡁͷϞσϧΛ ϑΝΠϯνϡʔχϯά͍ͯ͠Δ • ImageNetͷσʔληοτͰֶशͯ͠΋ɺطଘͷSoTAϞσϧΑΓੑೳ ͸্͕Βͳ͍

    ➡গྔͷσʔληοτͰ͸͏·͍͔͘ͳ͍ ➡ڊେͳσʔληοτͰਅՁΛൃش͢Δ
  17. ݱ࣌఺Ͱͷࢲͷߟ࡯

  18. Vision TransformerΛ্खʹ࢖͏ʹ͸ʁ • େྔͷσʔλΛͲͷΑ͏ʹ༻ҙ͢Δʁ • େن໛σʔληοτͷࣄલֶशࡁϞσϧ͕ެ։͞Ε͍ͯΔͳΒɺ ͦΕΛ࢖ͬͯϑΝΠϯνϡʔχϯάͯ͠ར༻͢Δ • ࣗલͰ༻ҙ͢Δ &

    ࣗલͰੜ੒͢Δ • ʮࣗݾڭࢣ͋Γֶशʯͷݚڀ͕ਐΜͰ͍ΔͷͰɺڭࢣσʔλ Λʮࣗݾڭࢣ͋Γֶशʯʹ͋Δఔ౓೚ͤΔํ๏΋ߟ͑ΒΕΔ
  19. Vision TransformerΛ্खʹ࢖͏ʹ͸ʁ • େྔͷσʔλΛ༻ҙͰ͖ͳ͍৔߹͸ʁ • ۀ຿ʹ໰୊ͳ͍ਫ਼౓Ͱ͋Ε͹ɺطଘϞσϧʹ͢Δ • ը૾෼ྨͰ͋Ε͹ɺ࠷ۙ͸EfficientNet͕tf.keras.applicationʹ͋ ΔͷͰ؆୯ʹ࢖͑Δ

  20. Vision Transformerͷࠓޙͷ༧ײ • Vision TransformerΑΓҎલʹ΋ɺTransformerΛࣗવݴޠॲཧҎ֎ʹ ద༻͢Δࣄྫ͸͋Δ • Vision TransformerΛվྑͯ͠ɺΑΓখن໛ͳσʔληοτͰ΋ਫ਼౓ ͕ग़ΔϞσϧ͕ग़ͯ͘ΔͩΖ͏

    ➡Transformerܥͷ֤छλεΫ͸ཁ஫໨ ➡ซͤͯɺࣗݾڭࢣ͋Γֶशͷख๏΋ԡ͓͑ͯ͘͞ͱ͍͍͔΋
  21. ը૾෼ྨҎ֎ͷ Transformerద༻ྫ

  22. ը૾෼ྨҎ֎ͷ Transformer ద༻ྫ ֤छλεΫʹTransformerΛద༻ͨ͠΋ͷͷҰྫ • DETR: ෺ମݕ஌ʹTransformerΛར༻ • Axial-Attention: ηάϝϯςʔγϣϯʹTransformerΛར༻

    • Image Transformer: ը૾ੜ੒ʹTransformerΛར༻ • VideoBERT: ಈըཧղʹTransformerΛར༻ • Set Transformer: ΫϥελϦϯάʹTransformerΛར༻
  23. ·ͱΊ

  24. ·ͱΊ • ը૾෼ྨʹTransformerΛద༻ͨ͠ʮVision Transformerʯ͕ొ৔ • ߴੑೳͰɺֶशʹ͔͔ΔܭࢉϦιʔε͸গͳ͘ࡁΉ • ͨͩ͠ɺେྔͷσʔληοτ͕ඞཁ • Vision

    TransformerΛ࢖͏ʹ͸ɺେن໛σʔληοτͰͷࣄલֶशࡁϞσϧΛϑΝΠϯνϡʔ χϯάͯ͠࢖͏͔ɺࣗલͰσʔλ༻ҙͯ͠࢖͏͔ • ࣗલͰ༻ҙ͢Δ৔߹͸ʮࣗݾڭࢣ͋ΓֶशʯΛ࢖ͬͯϥϕϧ෇͚͢Δํ๏΋ݕ౼ͨ͠ํ ͕͍͍͔΋ • VisionҎ֎ʹ΋Transformer͕ద༻͞Ε͖͍ͯͯΔͷͰɺࠓޙͷτϨϯυͱͯ͠ԡ͓͑ͯ͜͞͏
  25. ͓͠·͍

  26. ࢀߟ • An Image is Worth 16x16 Words: Transformers for

    Image Recognition at Scale • https://arxiv.org/abs/2010.11929 • google-research/vision_transformer • https://github.com/google-research/vision_transformer • emla2805/vision-transformer: Tensorflow implementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale) • https://github.com/emla2805/vision-transformer • lucidrains/vit-pytorch: Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch • https://github.com/lucidrains/vit-pytorch • ը૾ೝࣝͷେֵ໋ɻAIքͰ࿩୊രൃதͷʮVision TransformerʯΛղઆʂ - Qiita • https://qiita.com/omiita/items/0049ade809c4817670d7 • Transformer Ͱը૾ೝࣝΛ΍ͬͯΈΔ ~ Vision Transformer ~ | GMOΠϯλʔωοτ ࣍ੈ୅γεςϜݚڀࣨ • https://recruit.gmo.jp/engineer/jisedai/blog/vision_transformer/
  27. ࢀߟ • Attention Is All You Need • https://arxiv.org/abs/1706.03762 •

    End-to-End Object Detection with Transformers • https://arxiv.org/abs/2005.12872 • Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation • https://arxiv.org/abs/2003.07853 • Image Transformer • https://arxiv.org/abs/1802.05751 • VideoBERT: A Joint Model for Video and Language Representation Learning • https://arxiv.org/abs/1904.01766 • Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks • https://arxiv.org/abs/1810.00825