LISAで推論セグメンテーションを楽しもう

LISAで推論セグメンテーションを楽しもう ImVisionLabs株式会社松田宏文 LLMと画像セグメンテーション

LISAとは “LISA: Reasoning Segmentation via Large Language Model” (Lai et
al., 2023, arXiv: 2308.00692) →大規模言語モデルを用いた推論セグメンテーション Large Language Instructed Segmentation Assistant

LISAとは: Example 1 INPUT OUTPUT "Can you segment the camera
lens that is more suitable for photographing nearby objects in this image?" "Sure, [SEG]." 出典) https://github.com/dvlab-research/LISA

LISAとは: Example 2 INPUT OUTPUT "Who was the president of
the US in this image? Please output segmentation mask and explain the reason." "Sure, the segmentation result is [SEG]. The President of the United States in the image is President Obama.” 出典) https://github.com/dvlab-research/LISA

LISAとは INPUT OUTPUT 従来 LISA image segmentation mask NN image
segmentation mask LMM text • 対象物を事前にカテゴリ化して明示的に示す必要がある • 暗黙的な指示を理解することはできない • 入力のテキストをクエリとした複雑な推論が可能例）「テレビのアレ、どこだっけ…？あのチャンネル変えるやつ、アレだよアレ」→答え: リモコン

複雑なクエリの例: 「牛をセグメントして」 INPUT OUTPUT Can you segment the cow? ASSITANT:
Sure, [SEG].

複雑なクエリの例: 「動物をセグメントして」 INPUT OUTPUT Can you segment the animal? ASSITANT:
Sure, the segmentation result is [SEG].

複雑なクエリの例: 「どんな動物がいるか説明して」 INPUT OUTPUT Can you segment the animal? Please
explain what animals are there. ASSITANT: Sure, [SEG]. In the image, there is a cow statue and a calf statue outside a building.

複雑なクエリの例: 「建物の入り口と道のりを教えて」 INPUT OUTPUT Can you segment the entrance of
building? Please explain how we can get to there. ASSITANT: Sure, the segmentation result is [SEG]. In the image, there is a large white cow statue located outside of a building. To get to the entrance of the building, one would need to walk around the cow statue and approach the building from the front. The cow statue is quite prominent, so it should be easy to spot and navigate around while approaching the building.

LISAのアーキテクチャ出典) https://github.com/dvlab-research/LISA

Segment Anything Model (Meta) – Vision Backbone 出典) https://github.com/facebookresearch/segment-anything

LLaVA: Large Language and Vision Assistant - Multi Modal LLM
出典) https://github.com/haotian-liu/LLaVA https://llava.hliu.cc/

VisionLLM 出典) https://github.com/OpenGVLab/VisionLLM

まとめ • LISAは雑なクエリでも、LLMの推論のおかげでいろいろな物体のセグメンテーションができる • 内部的にはVision BackboneとMulti Modal-LLMの二つの仕組みが使われている •
異なるアプローチとしてEnd-to-EndでセグメンテーションもできるLMMもある

References • LISA: Reasoning Segmentation via Large Language Model https://github.com/dvlab-research/LISA
https://arxiv.org/abs/2308.00692 • LLaVA: Large Language and Vision Assistant https://github.com/haotian-liu/LLaVA • Segment Anything https://github.com/facebookresearch/segment-anything • VisionLLM https://github.com/OpenGVLab/VisionLLM https://arxiv.org/abs/2305.11175

LISAで推論セグメンテーションを楽しもう

LISAで推論セグメンテーションを楽しもう

Hirofumi Matsuda

Other Decks in Technology

Featured

Transcript