Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Journal club] CRIS: CLIP-Driven Referring Imag...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

[Journal club] CRIS: CLIP-Driven Referring ImageĀ Segmentation

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. CRIS: CLIP-Driven Referring Image Segmentation Zhaoqing Wang1,2* Yu Lu3āˆ— Qiang

    Li4āˆ— Xunqiang Tao2 Yandong Guo2 Mingming Gong5 Tongliang Liu1 1University of Sydney; 2OPPO Research Institute; 3Beijing University of Posts and Telecommunications 4Kuaishou Technology; 5University of Melbourne 慶應義唾大学 ę‰ęµ¦å­”ę˜Žē ”ē©¶å®¤ ē•‘äø­é§æå¹³ Wang, Zhaoqing, et al. "Cris: Clip-driven referring image segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
  2. ā–ø Referring Image Segmentation ( RIS ) ā–¹ [Hu+, ECCV16]

    ć§åˆć‚ć¦ęę”ˆ ā–ø RIS ć®ē ”ē©¶ćƒˆćƒ¬ćƒ³ćƒ‰ 1. CNN惻LSTMć‹ć‚‰ē‰¹å¾“é‡ć‚’ēµåˆ (ex. [Long+, ECCV16]) 2. ę³Øę„ę©Ÿę§‹ć®å°Žå…„ļ¼ˆex. [Shi+, ECCV18]) 3. ć‚Æćƒ­ć‚¹ćƒ¢ćƒ¼ćƒ€ćƒ«ćŖę³Øę„ę©Ÿę§‹ (ex. CMSA [Ye+, CVPR19]) 3 čƒŒę™Æļ¼šRISć‚æć‚¹ć‚Æć«ćŠć„ć¦ć‚Æćƒ­ć‚¹ćƒ¢ćƒ¼ćƒ€ćƒ«ćŖę³Øę„ę©Ÿę§‹ć®å°Žå…„ ćŒé€²ć‚“ć§ć„ć‚‹ CMSA [Ye+, CVPR19] Image/Textć®ćƒ¢ćƒ€ćƒŖćƒ†ć‚£é–“ć®ē›øäŗ’ä½œē”Øć‚’å­¦ēæ’
  3. ā–ø RISć®ę—¢å­˜ę‰‹ę³•ćÆäŗ‹å‰å­¦ēæ’ęøˆćæćƒ¢ćƒ‡ćƒ«ć‚’ē”Øć„ć¦å­¦ēæ’ć‚’č”Œć† ā–¹ Image Encoderć®ä¾‹ļ¼šResNet态ViT ā–¹ Text Encoderć®ä¾‹ļ¼šLSTM态BERT ā–ø å•é”Œē‚¹ļ¼šē‹¬ē«‹ć—ćŸå˜äø€ćƒ¢ćƒ€ćƒŖćƒ†ć‚£ć®äŗ‹å‰å­¦ēæ’ęøˆćæćƒ¢ćƒ‡ćƒ«ć‚’é©ē”Ø

    ā–¹ ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ćŖč”Øē¾å­¦ēæ’ćŒäøååˆ† 4 å•é”Œē‚¹ļ¼šę—¢å­˜ę‰‹ę³•ćÆå˜äø€ćƒ¢ćƒ¼ćƒ€ćƒ«ć®äŗ‹å‰å­¦ēæ’ęøˆćæćƒ¢ćƒ‡ćƒ«ć‚’ é©ē”Øć—ć¦ćŠć‚Šćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ćŖč”Øē¾å­¦ēæ’ćŒäøååˆ† [Chen+, BMVC19] [Chen+, BMVC19]の例 • Text Encoder:LSTM • Image Encoder:ResNet-101
  4. ā–ø SimVLM [Wang+, 21]悄CLIP [Radford+, PMLR21] ā–¹ å¤§č¦ęØ”ćŖvision-languageäŗ‹å‰å­¦ēæ’ć«ć‚ˆć‚Šćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ćŖč”Øē¾ć‚’å­¦ēæ’ åÆčƒ½ ā–ø

    å•é”Œē‚¹ļ¼šRISć‚æć‚¹ć‚Æć«ćÆęœ€é©ć§ćÆćŖć„ ā–¹ CLIPļ¼šå…„åŠ›ē”»åƒć®å…Øä½“ēš„ćŖęƒ…å ±ć«ē€ē›® 5 å•é”Œē‚¹ļ¼šćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ćŖč”Øē¾å­¦ēæ’ćƒ¢ćƒ‡ćƒ«ćÆRISタスクには ęœ€é©ć§ćÆćŖć„ CLIP [Radford+, PMLR21] SimVLM [Wang+, 21]
  5. ā–ø CRIS ( CLIP-Driven Referring Image Segmentation ) ā–¹ CLIP駆動型RISćƒ•ćƒ¬ćƒ¼ćƒ ćƒÆćƒ¼ć‚Æ

    ā–ø CLIPć®ęŒć¤ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ćŖēŸ„č­˜ć‚’ ćƒ†ć‚­ć‚¹ćƒˆćØćƒ”ć‚Æć‚»ćƒ«å˜ä½ć®åÆ¾åæœć«ę“»ē”Ø 6 ęę”ˆę‰‹ę³•ļ¼šCLIPćƒ¢ćƒ‡ćƒ«ć‚’RISに擻用 CLIPć®č¦é ˜ć§ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ęƒ…å ±ć®ę•“åˆ ę€§ć‚’é«˜ć‚ć€ć‚Æćƒ­ć‚¹ćƒ¢ćƒ¼ćƒ€ćƒ«ćŖćƒžćƒƒćƒćƒ³ć‚°ć® ę€§čƒ½ć‚’å‘äøŠ
  6. ā–ø Visual Encoder:CLIPć§å­¦ēæ’ęøˆćæResNet ā–ø å…„åŠ›ļ¼šē”»åƒ š¼ ∈ ā„š»Ć—š‘ŠĆ—3 ā–¹ š»ļ¼šē”»åƒć®é«˜ć•ć€

    š‘Šļ¼šē”»åƒć®å¹… ā–ø å‡ŗåŠ›ļ¼š2-4ć‚¹ćƒ†ćƒ¼ć‚øē›®ć®ē‰¹å¾“é‡ š¹š‘£2 , š¹š‘£3 , š¹š‘£4 ā–¹ š¹š‘£2 ∈ ā„ š» 8 Ć—š‘Š 8 Ć—š¶2 ā–¹ š¹š‘£3 ∈ ā„ š» 16 Ć—š‘Š 16 Ć—š¶3 ā–¹ š¹š‘£4 ∈ ā„ š» 32 Ć—š‘Š 32 Ć—š¶4 ā–¹ š¶2 , š¶3 , š¶4 ļ¼šćƒćƒ£ćƒćƒ«ę•° 8 Visual Encoder š¹2 , š¹3 , š¹4 = š‘“VE (š¼)
  7. ā–ø Text Encoder:CLIPć§å­¦ēæ’ęøˆćæTransformer ā–ø å…„åŠ›ļ¼šćƒ†ć‚­ć‚¹ćƒˆ š‘‡ ∈ ā„šæ ā–¹ šæ

    ļ¼šćƒ†ć‚­ć‚¹ćƒˆę–‡ć®é•·ć• ā–ø å‡ŗåŠ› ā–¹ ćƒ†ć‚­ć‚¹ćƒˆē‰¹å¾“é‡ļ¼š š¹š‘” ∈ ā„šæĆ—š¶ ā–¹ ćƒ†ć‚­ć‚¹ćƒˆč”Øē¾ļ¼š š¹ š‘  ∈ ā„š¶ā€² ā–¹ ćƒć‚¤ćƒˆåÆ¾ē¬¦å·åŒ–ļ¼ˆbyte pair encoding, BEP) ā–¹ š¶, š¶ā€²ļ¼šćƒćƒ£ćƒćƒ«ę•° 9 Text Encoder š¹š‘” , š¹š‘  = š‘“TE (š‘‡)
  8. ā–ø Neck: š¹š‘£š‘– , š¹ š‘  ć‚’ēµåˆć—ć¦č¦–č¦šē‰¹å¾“é‡ć‚’ē²å¾—ć™ć‚‹ ā–ø å…„åŠ›ļ¼š š¹š‘£2

    , š¹š‘£3 , š¹š‘£4 , š¹ š‘  ā–ø å‡ŗåŠ›ļ¼š č¦–č¦šē‰¹å¾“é‡ š¹ š‘£ = š¶š‘œš‘›š‘£([š¹ š‘š , š¹š‘š‘œš‘œš‘Ÿš‘‘ ]) ∈ ā„ š» 16 Ć—š‘Š 16 Ć—š¶ ā–¹ š¹š‘š‘œš‘œš‘Ÿš‘‘ : š¹ š‘š ć«ä½ē½®ęƒ…å ±ć‚’åŸ‹ć‚č¾¼ć‚“ć ē‰¹å¾“é‡ ā–¹ š¹ š‘š = š¶š‘œš‘›š‘£([š¹š‘š2 , š¹š‘š3 , š¹š‘š4 ] ā–¹ š¹š‘š4 = š‘ˆš‘(šœŽ(š¹š‘£4 š‘Šš‘£4 ) āˆ™ šœŽ(š¹ š‘  š‘Š š‘  )) ā–¹ š¹š‘š3 = šœŽ š¹š‘š4 š‘Šš‘š4 āˆ™ šœŽ š¹š‘£3 š‘Šš‘£3 ā–¹ š¹š‘š2 = šœŽ š¹š‘š3 š‘Šš‘š3 āˆ™ šœŽ š“š‘£š‘”(š¹š‘£2 )š‘Šš‘£2 10 Cross-modal Neck š¹š‘£ = š¶š‘œš‘›š‘£([š¹š‘š , š¹š‘š‘œš‘œš‘Ÿš‘‘ ])
  9. ā–ø 兄力 ā–¹ ćƒ†ć‚­ć‚¹ćƒˆē‰¹å¾“é‡ļ¼š š¹š‘” ∈ ā„šæĆ—š¶ ā–¹ č¦–č¦šē‰¹å¾“é‡ļ¼š š¹

    š‘£ ∈ ā„ š» 16 Ć—š‘Š 16 Ć—š¶ ā–ø å‡ŗåŠ›ļ¼š ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ē‰¹å¾“é‡ š¹ š‘ ∈ ā„ š» 16 Ć—š‘Š 16 Ć—š¶ ā–¹ š¹š‘ = š‘€šæš‘ƒ šæš‘ š¹š‘ ′ + š¹š‘ ′ ā–¹ š¹š‘ ′ = š‘€š»š¶š“ šæš‘ š¹š‘£ ′ , š¹š‘” + š¹š‘£ ′ ā–¹ š‘€š»š¶š“ :multi-head cross-attention ā–¹ š¹š‘£ ′ = š‘€š»š‘†š“ šæš‘ š¹ š‘£ + š¹ š‘£ ā–¹ š‘€š»š‘†š“ :multi-head self-attention 11 Vision-Language Decoder Text Encoder Visual Encoder š¹š‘ = š‘€šæš‘ƒ šæš‘ š¹ š‘ ′ + š¹ š‘ ′
  10. ā–ø 兄力 ā–¹ ćƒ†ć‚­ć‚¹ćƒˆč”Øē¾ļ¼šš¹ š‘  ∈ ā„šæ ā–¹ ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ē‰¹å¾“é‡ļ¼šš¹ š‘

    ∈ ā„ š» 16 Ć—š‘Š 16 Ć—š¶ ā–ø å‡ŗåŠ› ā–¹ ćƒ†ć‚­ć‚¹ćƒˆē‰¹å¾“é‡ļ¼šš‘§š‘” ∈ ā„š· ā–¹ š‘§š‘” = š¹ š‘  š‘Šš‘” + š‘š‘” ā–¹ ćƒ”ć‚Æć‚»ćƒ«å˜ä½ć®ē‰¹å¾“é‡ļ¼š š‘§š‘£ ∈ ā„ š» 4 Ć—š‘Š 4 Ć—š· ā–¹ š‘§š‘£ = š‘ˆš‘(š¹ š‘ ) š‘Š š‘£ + š‘š‘£ 12 Text-to-Pixel Contrastive Learning š‘§š‘” = š¹š‘  š‘Šš‘” + š‘š‘” š‘§š‘£ = š‘ˆš‘(š¹š‘ ) š‘Š š‘£ + š‘š‘£ š¹š‘  š¹ š‘
  11. ā–ø ęå¤±é–¢ę•°ć®å…„åŠ› ā–¹ ćƒ†ć‚­ć‚¹ćƒˆē‰¹å¾“é‡ļ¼šš‘§š‘” ā–¹ ćƒ”ć‚Æć‚»ćƒ«å˜ä½ć®ē‰¹å¾“é‡ļ¼š š‘§š‘£ ā–ø šæš‘š‘œš‘› š‘§š‘”

    , š‘§š‘£ = 1 š’«āˆŖš’© Ļƒš‘–āˆˆš’«āˆŖš’© šæš‘š‘œš‘› š‘– (š‘§š‘” , š‘§š‘£ ) ā–ø šæš‘š‘œš‘› š‘– š‘§š‘” , š‘§š‘£ = ቐ āˆ’ log šœŽ š‘§š‘” āˆ™ š‘§š‘£ š‘– , š‘– ∈ š’« āˆ’ log(1 āˆ’ šœŽ š‘§š‘” āˆ™ š‘§š‘£ š‘– ), š‘– ∈ š’© ā–¹ š’«ļ¼šclass ā€œ1ā€ ā–¹ š’©ļ¼šclass ā€œ0ā€ 13 Text-to-pixel contrastive loss š‘§š‘” = š¹š‘  š‘Šš‘” + š‘š‘” š‘§š‘£ = š‘ˆš‘(š¹ š‘ ) š‘Š š‘£ + š‘š‘£ https://github.com/DerrickWang005/CRIS.pytorch/blob/0df39f073acfb9 e6e17d83536a916548905ecfc3/model/segmenter.py#L59 ć‚·ć‚°ćƒ¢ć‚¤ćƒ‰é–¢ę•°ćøå…„åŠ›ć—ćŸć®ć” Upsamplingć§ć‚‚ćØć®ē”»åƒć‚µć‚¤ć‚ŗćø やっていることはBCEと同じでは? ļ¼ˆć‚³ćƒ¼ćƒ‰ć®ęå¤±é–¢ę•°ćÆBCEć§ć‚ć£ćŸļ¼‰
  12. ā–ø ćƒ™ćƒ³ćƒćƒžćƒ¼ć‚Æ 1. RefCOCO:( train, valid )= ( 120,624Ꞛ, 10,834Ꞛ

    ) 2. RefCOCO+:RefCOCOć«ćŠć„ć¦ēµ¶åÆ¾ä½ē½®ć®å˜čŖžćŒé™¤å¤– 3. G-Refļ¼šå¹³å‡ę–‡é•· 8.4 čŖžć€å “ę‰€ć‚„å¤–č¦³ć«é–¢ć™ć‚‹å˜čŖžćŒå¤šć„ ā–ø 学習 ā–¹ ē”»åƒć‚µć‚¤ć‚ŗļ¼š416Ɨ416 ā–¹ ćƒćƒƒćƒć‚µć‚¤ć‚ŗļ¼š64 ā–¹ ćƒžć‚·ćƒ³ļ¼š8 Tesla V100 with 16 GPU VRAM 14 å®ŸéØ“ļ¼š3ć¤ć®ćƒ™ćƒ³ćƒćƒžćƒ¼ć‚Æć§ę€§čƒ½ęÆ”č¼ƒ
  13. ā–ø Baseline model = CRIS w/o visual-language decoder, w/o text-to-

    pixel contrast learning ā–ø Decoderの層数を4ć«ć™ć‚‹ćØéŽå­¦ēæ’ć‚’čµ·ć“ć—ę€§čƒ½ćŒę‚ŖåŒ– 16 Ablation Study:visual-language decoderと text-to-pixel contrast learningć®ęœ‰åŠ¹ę€§ē¢ŗčŖ