Slide 1

Slide 1 text

LISAで推論セグメンテーションを楽しもう ImVisionLabs株式会社 松田 宏文 LLMと画像セグメンテーション

Slide 2

Slide 2 text

LISAとは “LISA: Reasoning Segmentation via Large Language Model” (Lai et al., 2023, arXiv: 2308.00692) →大規模言語モデルを用いた推論セグメンテーション Large Language Instructed Segmentation Assistant

Slide 3

Slide 3 text

LISAとは: Example 1 INPUT OUTPUT "Can you segment the camera lens that is more suitable for photographing nearby objects in this image?" "Sure, [SEG]." 出典) https://github.com/dvlab-research/LISA

Slide 4

Slide 4 text

LISAとは: Example 2 INPUT OUTPUT "Who was the president of the US in this image? Please output segmentation mask and explain the reason." "Sure, the segmentation result is [SEG]. The President of the United States in the image is President Obama.” 出典) https://github.com/dvlab-research/LISA

Slide 5

Slide 5 text

LISAとは INPUT OUTPUT 従来 LISA image segmentation mask NN image segmentation mask LMM text • 対象物を事前にカテゴリ化して明示的に示す必要がある • 暗黙的な指示を理解することはできない • 入力のテキストをクエリとした複雑な推論が可能 例)「テレビのアレ、どこだっけ…?あのチャンネル変 えるやつ、アレだよアレ」→答え: リモコン

Slide 6

Slide 6 text

複雑なクエリの例: 「牛をセグメントして」 INPUT OUTPUT Can you segment the cow? ASSITANT: Sure, [SEG].

Slide 7

Slide 7 text

複雑なクエリの例: 「動物をセグメントして」 INPUT OUTPUT Can you segment the animal? ASSITANT: Sure, the segmentation result is [SEG].

Slide 8

Slide 8 text

複雑なクエリの例: 「どんな動物がいるか説明して」 INPUT OUTPUT Can you segment the animal? Please explain what animals are there. ASSITANT: Sure, [SEG]. In the image, there is a cow statue and a calf statue outside a building.

Slide 9

Slide 9 text

複雑なクエリの例: 「建物の入り口と道のりを教えて」 INPUT OUTPUT Can you segment the entrance of building? Please explain how we can get to there. ASSITANT: Sure, the segmentation result is [SEG]. In the image, there is a large white cow statue located outside of a building. To get to the entrance of the building, one would need to walk around the cow statue and approach the building from the front. The cow statue is quite prominent, so it should be easy to spot and navigate around while approaching the building.

Slide 10

Slide 10 text

LISAのアーキテクチャ 出典) https://github.com/dvlab-research/LISA

Slide 11

Slide 11 text

Segment Anything Model (Meta) – Vision Backbone 出典) https://github.com/facebookresearch/segment-anything

Slide 12

Slide 12 text

LLaVA: Large Language and Vision Assistant - Multi Modal LLM 出典) https://github.com/haotian-liu/LLaVA https://llava.hliu.cc/

Slide 13

Slide 13 text

VisionLLM 出典) https://github.com/OpenGVLab/VisionLLM

Slide 14

Slide 14 text

まとめ • LISAは雑なクエリでも、LLMの推論のおかげでいろいろな物体のセ グメンテーションができる • 内部的にはVision BackboneとMulti Modal-LLMの二つの仕組みが使われて いる • 異なるアプローチとしてEnd-to-Endでセグメンテーションもでき るLMMもある

Slide 15

Slide 15 text

References • LISA: Reasoning Segmentation via Large Language Model https://github.com/dvlab-research/LISA https://arxiv.org/abs/2308.00692 • LLaVA: Large Language and Vision Assistant https://github.com/haotian-liu/LLaVA • Segment Anything https://github.com/facebookresearch/segment-anything • VisionLLM https://github.com/OpenGVLab/VisionLLM https://arxiv.org/abs/2305.11175