EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing

EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote
Sensing  修　浩毅  1 第9回 SatAI.challenge勉強会 

目次   2 • 自己紹介スライド  • 研究の1ページサマリ紹介   • 研究の背景（Introduction）
  • 手法について（Method）   • 実験（Experiment）  • 結論（Conclusion） 

3 著者紹介 This image was generated by ChatGPT

修　浩毅産総研データプラットフォーム研究チーム • 3次元点群解析 • コンピュータ・グラフィックス • 建物被害検知点群セグメンテーション
自己紹介 4 GitHub Linkedin 点群からの法線推定航空ライダーからの建物被害検知

5 研究の1ページサマリ紹介 This image was generated by ChatGPT

EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote
Sensing     6 • 初のRS visual prompting対応MLLMを提案。衛星画像をpoint, region, imageレベルで判読可能。   • visual prompting 学習フレームワークを提案   • RS visual promptingデータセット（RSVP）を提案   • 下流タスクにてSoTAを達成   RS分野初のvisual promptingに対応可能マルチモーダル大規模言語モデル（MLLM) EarthMarkerを構築  Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

7 研究の背景 This image was generated by ChatGPT

背景： Prompting   8 • 大規模言語モデル（LLM）は、事前学習後にpromptを与えることで、学習時に見たことのないタスクやデータ分布にも対応できる  Brown et al.
(2020), “Language Models are Few-Shot Learners”, arxiv より引用

背景： Visual prompting   9 • Visual prompting に対応するモデルを構築することで、学習時に見たことのないタスクやデータ分布にも対応できるビジョンモデルを構築可能
  Kirillov et al. (2023), “Segment Anything”, arxiv より引用

背景： Visual prompting for remote sensing   10 • Visual
prompting に対応するRSモデルを構築することで  ◦ 未知のタスクやデータに対応   ◦ 興味があるところだけ解析   Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

背景： Visual prompting for remote sensing   11 • visual
prompting をRS 画像に適用する際の問題点   ◦ リモセン画像の判読は難しい   ▪ scale variations  ▪ cross-category diversity   ▪ complex contextual semantic information  ◦ 自然言語だけでは領域を正確に定義できない   ◦ （先行研究）image-levelの判読は可能だが、より詳細な判読（region-level, point-level）は難しい  Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用上記の問題を考慮し、visual prompting に対応するRS MLLMを構築する 

12 手法について This image was generated by ChatGPT

Model architecture   13 Sharing visual encoding module   •
画像とvisual promptを共通のencoderで処理することによって、両者の関係性をより考慮した特徴量を作成  • MoV: parallel encoder (DINOv2-ViT L/14 & CLIP-ConvNeXt)  • 画像はmulti-resolution化   • visual promptは画像化     Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Model architecture   14 modality-align projection   • visual tokens
(画像とvisual prompt特徴量）を言語モデルが扱える特徴量空間に投影     text tokenizer  • text tokenizerでtext instructionを埋め込む     上記の二つを組み合わせ、multimodal input sequenceを作成し、LLMに入れる    Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Model architecture   15 LLM decoder  • 画像、visual promptを組み合わせることによって、 image,
region, point-levelの表現を扱う   • text instructionを組み合わせることによって、テキストに応じた判読結果を生成   Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Cross-domain training   16 自然画像とリモートセンシング（RS）データのドメインギャップを埋め、一般的なドメインの知識を効果的にRSドメインへ適応させるため、Cross-domain trainingを提案    
  Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Cross-domain training   17 Phase 1: multi-domain image-text alignment  
自然画像とRS画像を同時用いて画像とテキストをalignし、包括的なvision-langauge表現を獲得   • データセット  ◦ COCO Caption  ◦ RSVP (本研究で提案）   • projectionだけトレーニングし、画像とテキストの alignmentだけを学習   Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Cross-domain training   18 Phase 2: Spatial Perception Tuning  
空間的な、オブジェクトレベルの表現を獲得する   • データセット  ◦ RefCOCO  ◦ RefCOCO+  • LLMをトレーニングし、より詳細な空間的概念を扱えるようにする  Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

Cross-domain training   19 Phase 3: RS Visual Prompting Tuning
  モデルがtext instructionに従い、region-level、point-level のタスクを実行できるようにする   • データセット：RSVP （本研究で提案）   ◦ region-text, point-text ペア   • LoRAでparameter-efficient training   Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

RS visual prompting dataset construction   20 既存のRSデータセットからの変換   •
タスク：scene classification, referring object classification, image captioning, region captioning, and relationship analyses   • 解像度  • グローバル    data structure  for each item,   • Visual prompts   • User instructions  • (answers)  • an image     Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

RS visual prompting dataset construction   21 image-level   •
image classification and image captioning のデータセットから変換  • text instruction: “Please provide a detailed description of the ⟨Region i⟩ in the image”  • バウンディングボックスは画像全体     Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 region-level   • object detection のデータセットから変換   • GTバウンディングボックスをそのままRegionに使用   • text instruction: “please provide the brief caption of each marked region in the image”   • answer format: “⟨Region 1⟩ : A big airplane on the left\n < ⟨Region 2⟩ : A small vehicle on the top\n, . . . ,‘bbox’: [x1, y1, x2, y2], . . .”     point-level   • instance, semantic segmentation のデータセットから変換  ◦ ins. seg.: representative points (?)   ◦ sem. seg.: 32x32 patch に分割し、各パッチからランダムにPointを一つサンプリング  • text instruction: “please identify the category of each marked point in the image.”   • answer format “⟨Mark 1⟩ : Label 1\n⟨Mark 2⟩ : Label 2\n, . . . ,‘points’: [x1, y1], [x2, y2], . . .”    

GPT4V-Assisted Visual Prompting Data Generation   • クラスラベルや既存のデータセットのキャプションはシンプル  •
GPT4Vによって既存データを拡張   ◦ Set-of-Mark (SoM) promptingを使用   RS visual prompting dataset construction   22 Yang et al. (2023), “Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V”, arxiv より引用

23 実験 This image was generated by ChatGPT

• prompt:   ◦ image-level bounding box   ◦ ”Please
identify the object category of each marked region in the image”   • findings   ◦ Non-RSのMLLMを大幅に上回る   ◦ GeoChatよりも高精度   24 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 zero-shot scene classification   Non RS  RS 

• prompt:   ◦ image-level bounding box   ◦ ”Please
provide a brief caption of each marked region in the image.”   • findings   ◦ 既存のexpertモデルよりも高い精度を達成   25 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 Image Captioning on NWPU-Captions dataset   Non RS  RS 

• prompt:   ◦ region-level bounding box   ◦ ”Please
identify the category of the marked region in the image”   • findings   ◦ 既存のNon-RS 高性能   ◦ 既存のRSモデルと比べても大幅に性能アップ   26 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 Referring Object Classification on DIOR-RSVG dataset   Non RS  RS 

• prompt:   ◦ region-level bounding box   ◦ ”Please
provide a brief caption of each marked region in the image”   • findings   ◦ 既存のNon-RS 高性能   ◦ 既存のRSモデルと比べても大幅に性能アップ   27 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 Region Captioning on DIOR-RSVG dataset  

ablation studies  • shared encoder：CNNとViTの組み合わせが一番いい   • データセットの組み合わせ：全部合わせたほうがいい
    計算コスト  • GPU：an NVIDIA RTX A6000   • 計算コストは比較的高い   28 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用 Ablation studies and computation analysis  

29 Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal
Large Language Model for Remote Sensing”, IEEE TGRS より引用 Visualization  

Large Language Model for Remote Sensing”, IEEE TGRS より引用 Visualization (failure cases)  

33 結論 This image was generated by ChatGPT

結論   34 • RSで初のvisual prompting MLLMを構築   • visual
prompting MLLM構築のためのフレームワークとデータセットを提案   • image-level, region-level 及びpoint-levelの下流タスクにおいてNon-RS、RSモデルよりも大幅に性能向上       Zhang et al. (2025), “EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing”, IEEE TGRS より引用

EarthMarker: A Visual Prompting Multimodal Larg...

EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing

More Decks by SatAI.challenge

Other Decks in Research

Featured

Transcript