Wang, Jorma Laaksonen Department of Computer Science, Aalto University, Finland AACL2022 慶應義塾大学 杉浦孔明研究室 後神美結 Guo, Z., Wang, T. J., & Laaksonen, J. (2022, November). CLIP4IDC: CLIP for Image Difference Captioning. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 33-42).
実世界の画像対と画像間変化のキャプション o Image-Editing-Request 画像編集を施す前後の画像対と対応する編集指示 • 学習環境 o adaptation:V100 GPU×2 o captioning: V100 GPU×1 "The person walking is no longer there" 9
CLIP4IDC: the person walking in the parking lot is gone GT: there is a smaller group of people in the lot CLIP4IDC: there are two people in the right image 内容が一致し、むしろ増えている 書き方が変化の表現ではないが、 内容と着眼点は合っている
⇒ 多階層視覚表現を用いる 13 GT: the person walking is no longer there CLIP4IDC: the person in the parking lot is gone GT 1: the people by the building have moved and joined others GT2: the people in the parking lot have left CLIP4IDC: the people are in the parking lot
o Adaptation(Image-Text Retrieval) o Fine-tuning • 結果 o CLEVR-Change、Spot-the-Diff、Image-Editing-Requestにおいて BLEU、METEOR、CIDEr、ROUGEでSOTA達成 o 本手法を拡張した手法を提案、CoRL2024にてunder review 15
blue ball became yellow GT: the big purple metal block behind the green thing changed to rubber CLIP4IDC: the large purple metal block that is behind the big purple metal sphere became rubber 内容が一致 位置に関する部分がGTと 異なるが、内容は合っている 20