Upgrade to Pro — share decks privately, control downloads, hide ads and more …

コンピュータビジョンによるロボットの視覚と判断:宇宙空間での適応と課題

 コンピュータビジョンによるロボットの視覚と判断:宇宙空間での適応と課題

2025年8月1日:Sphear #1「宇宙 × AI」
中部大学理工学部AIロボティクス学科 藤吉弘亘

JAXA研究者とロボット工学の第一人者が登壇──ロケット・ロボット・月面探査の最前線 Sphear #1

【 Sphearとは 】
Sphear(スフィア)は、「宇宙 × AI」を軸に、月・火星移住を共に加速させるコミュニティです。 AI、ロボティクス、宇宙工学といった専門領域を越えて、エンジニア、デザイナー、建築家、教育者、心理学者、アーティストなど、多様な視点を持つ人々が集い、自由な発想と対話から新たな未来を構想していきます。

Avatar for Hironobu Fujiyoshi

Hironobu Fujiyoshi

July 31, 2025
Tweet

More Decks by Hironobu Fujiyoshi

Other Decks in Science

Transcript

  1. ࣗݾ঺հɿ౻٢߂࿱ GVKJZPTIJ!GTDDIVCVBDKQ 2 த෦େֶϩΰ த෦େֶϩΰ  ֶྺɿ ೥ذೆ޻ۀߴߍిࢠՊଔۀ ೥த෦େֶిࢠ޻ֶՊଔۀ ೥த෦େֶେֶӃम࢜՝ఔमྃ

    ೥த෦େֶେֶӃത࢜ޙظ՝ఔຬظୀֶʢത࢜ʣ ݚڀ׆ಈɿ ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴϙευΫݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ߨࢣ ೥த෦େֶ޻ֶ෦।ڭत ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ڭत ~  ೥ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ ݱࡏʹࢸΔ ֶ֎׆ಈɿ ೔ຊσΟʔϓϥʔχϯάڠձཧࣄ ΫϩεΞϙΠϯτϝϯτʢσϯιʔʣ vol.162 ౻٢߂࿱ʢத෦େֶʣʮ“+AI”ͰมΘΔະདྷʯ IUUQTXXXZPVUVCFDPNXBUDI W[K&0J7)6 ౻٢߂࿱ ஑ాӯࣿ͞Μ (೫໦ࡔ46) ઒ాेເ͞Μ
  2. ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF

    "EB#PPTU   CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ୈੈ୅ɿ$//ʹΑΔಛ௃දݱ֫ಘ ୈੈ୅ɿϋϯυΫϥϑτಛ௃ ը૾ೝٕࣝज़ͷมભ
  3. $//ʢ৞ΈࠐΈχϡʔϥϧωοτϫʔΫʣͷಛ௃நग़ 4 த෦େֶϩΰ த෦େֶϩΰ $POWMBZFS 1PPMJOHMBZFS *OQVUMBZFS $POWMBZFS 1PPMJOHMBZFS '$MBZFS

    0VUQVUMBZFS '$MBZFS *OQVUJNBHF YY    ⋯ ⋯       'MBUUFO LFSOFMT TJ[FYY TUSJEF QBEEJOH LFSOFMT TJ[FYY TUSJEF QBEEJOH     ɹ৞ΈࠐΈͱϓʔϦϯάΛଟஈʹ܁Γฦ͢͜ͱͰ޿͍ൣғͷॏཁͳಛ௃Λू໿͠ ɹɹશ݁߹૚ͰҐஔʹґଘ͠ͳ͍ಛ௃Λ֫ಘʢϩʔΧϧˠϛυϧˠάϩʔόϧʣ ৞ΈࠐΈ ϓʔϦϯά ৞ΈࠐΈ ϓʔϦϯά શ݁߹ શ݁߹
  4. w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭֶͯ͠श $//ʹΑΔଟ༷ͳλεΫ΁ͷԠ༻  ʜ z1FSTPOz W H W′ 

    H′  H W W H ըૉ͝ͱʹΫϥε֬཰Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚ C άϦου͝ͱʹ Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ Ϋϥε֬཰Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁Ռ $// $// ෺ମݕग़ɹ ը૾෼ྨɹ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ ୅දతͳख๏ "MFY/FU 7(( (PPHMF/FU 3FT/FU 4&/FU 'BTUFS3$// :0-0 44% '1/ .%FU $FOUFS/FU .BTL3$// '$/ 6/FU 4FH/FU 141/FU %FFQ-BCW
  5. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿ7J5ʹΑΔಛ௃දݱ֫ಘ
  6. w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏  ը૾Λݻఆύονʹ෼ղ  4FMG"UUFOUJPOʹΑΓύονؒͷؔ܎ੑΛଊ͑Δ  *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ405" 7JTJPO5SBOTGPSNFS 7J5

    <%PTPWJUTLJZ *$-3>  Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 p dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together 5SBOTGPSNFS 7J5 7J5ͷ࢓૊Έ
  7. w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3

    v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. ֤ύον͕ଞͷύονʹରͯ͠ͲΕ͚ͩʮ஫໨ʯ͍ͯ͠Δ͔Λද͢"UUFOUJPOXFJHIUΛܭࢉ ˠը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ ̂ α = softmax( QKT dk )  x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ 4FMG"UUFOUJPOͷ࢓૊Έ "UUFOUJPOXFJHIU qi = Wq ei ki = Wk ei vi = Wv ei ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵
  8. w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3

    v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output1 output2 output3 output4 output5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. Attention(Q, K, V) = ̂ αV  x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ ᶅ7ͱͷՃॏ࿨ʹΑΓ ग़ྗΛܭࢉ x5 x4 x3 x2 x1 4FMG"UUFOUJPOͷ࢓૊Έ
  9. w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ<5VKJ $PH4DJ> 7J5ʹ͓͚Δಛ௃දݱ֫ಘ  ataset. off-diagonal ng the error

    nfusion ma- mpares what 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) ςΫενϟ ܗঢ় ਓؒ ♦︎ ˠܗঢ়Λॏࢹ $// ˔˙˛˔ ˠςΫενϟΛॏࢹ 7J5 ˝˝ ˠ$//ͱൺ΂ͯܗঢ়Λॏࢹ
  10. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿ7-.
  11. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  12. w $-*1 $POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH ʹΑΔಛ௃දݱ֫ಘ  ը૾ͱݴޠͷಛ௃ྔΛඥ͚ͮΔΑ͏ʹରরֶश  ԯͷը૾ͱݴޠϖΞΛ΢Σϒ͔Βऩूֶͯ͠शʹ࢖༻ ը૾ͱςΩετΛϖΞͰֶशɿը૾ͱݴޠͷରԠ 

    ʜ    ʜ     ʜ     ʜ     ʜ  ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎ $&MPTT
  13. w ඥ͚ͮͨը૾ͱݴޠͷؔ܎͔ΒΫϥε෼ྨΛθϩγϣοτͰղ͘͜ͱ͕Մೳ  ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़  Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़  ը૾ͱςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉˠྨࣅ౓͕࠷΋ߴ͍ςΩετͷΫϥεͱ൑ఆ $-*1ʹΑΔಛ௃දݱ֫ಘ 

    I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ
  14. w $*'"3ͷඈߦػΫϥεʹ͓͍ͯը૾ͱϓϩϯϓτؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ϓϩϯϓτ͝ͱʹ࿮Λ৭෇͚ ը૾ͱݴޠʢΩϟϓγϣϯʣΛඥ͚ͮΔޮՌ  ੨࿮ BQIPUPPGBBJSQMBOF fl ZJOHUISPVHIUIFCMVFTLZ

    ੺࿮ BCMBDLBOEXIJUFQIPUP PGBBJSQMBOF ྘࿮ BQIPUPPGBBJSQMBOF UIBUMBOEFEPOUIFHSPVOE ˠಉҰΫϥε಺ʹ͓͍ͯ΋ݴޠʹ߹Θͤͨॊೈͳಛ௃ɾ஌ࣝΛ֫ಘ
  15. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  16. w ࢹ֮৘ใͱςΩετ৘ใΛಉ࣌ʹॲཧͯ͠ੜ੒͢Δ7-.  4FMG"UUFUOJPO ɿύονը૾ʢೖྗτʔΫϯʣؒͷؔ܎ʹج͍ͮͯಛ௃Λදݱ  $SPTT"UUFOUJPO ɿը૾ͷग़ྗτʔΫϯͱ࣭໰จͱͷؔ܎ʹج͍ͮͨಛ௃Λදݱ ςΩετΛग़ྗɿ7J5 --.

     *NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classification, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. ςΩετೖྗɿ࣭໰จʢϓϩϯϓτʣ ςΩετग़ྗɿճ౴จ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ը૾ೖྗɿ
  17. w 7JTVBM2VFTUJPO"OTXFSJOH 72"   ը૾ʹؔ͢Δ࣭໰ΛจͰ෇͚ɼ౴͑Λੜ੒͢ΔλεΫ ʢྫʣ࣭໰ʮ͜ͷը૾Ͱݘ͸ԿΛ͍ͯ͠·͔͢ ʯˠɹग़ྗʮϘʔϧΛ௥͍͔͚͍ͯ·͢ʯ w *NBHF$BQUJPOJOH

     ը૾Λೖྗͱ͠ɼͦͷ಺༰Λ؆ܿʹจষͰઆ໌͢ΔλεΫ ʢྫʣೖྗਫंΛ֦େڸͰݟΔগ೥ͷը૾ɹˠग़ྗʮগ೥͕ਫंΛ؍࡯͍ͯ͠Δʯ w 5FYUUP*NBHF(FOFSBUJPO  จࣈͰࣄ෺΍৔໘Λࢦఆͨ͠આ໌Λ΋ͱʹɺରԠ͢Δը૾ΛҰ͔Βੜ੒͢ΔλεΫ ʢྫʣจʮେւͷલͷΧϥʔϑϧͷεʔπΛணͨݘʯˠɹग़ྗରԠ͢Δը૾Λੜ੒ w 3FGFSSJOH&YQSFTTJPO$PNQSFIFOTJPO  ࢦࣔઆ໌จΛ΋ͱʹɼը૾தͷର৅෺Λಛఆ͢ΔλεΫ ʢྫʣจʮ͠Ζ͍ҥΛணͨਓͷӈଆͷݘʯˠग़ྗࢦఆ͞ΕͨݘͷൣғΛό΢ϯσΟϯά 7-.ͷओཁλεΫͱͦͷྫ 
  18. w --.ɿେن໛ͳςΩετσʔλΛ༻͍ͨࣄલֶशʹΑΓɼ෯޿͍λεΫ΁ͷదԠ͕Մೳ  จ຺Λߟྀͨ͠ݴޠཧղͱੜ੒͕Մೳ  5SBOTGPSNFSΛجͱͨ͠൚༻తͳϞσϧ͕ొ৔ େن໛ݴޠϞσϧ --. ͷൃల த෦େֶϩΰ

    த෦େֶϩΰ 5SBOTGPSNFS<7BTXBOJ /*14> (15 (15 ʜ #&35<%FWMJO "$-> --B."<5PVWSPO BS9JW> 7JDVOB<1FOH BS9JW> ,0","<1FOH /FVSM14> %F#&35B<)F BS9JW> 3P#&35B<-JV *$-3> "-#&35<-BO *$-3> 
  19. w $-*1ͷը૾Τϯίʔμͱ--B."ϕʔεͷݴޠσίʔμ --. Λར༻ͨ͠7-.  ը૾ಛ௃ͱݴޠಛ௃Λڮ౉͢͠Δ7JTJPOMBOHVBHFDPOOFDUPSΛಋೖ  ܭࢉίετ͕গͳ͘ɼ୹ظؒͰֶशՄೳʢº"(16Ͱ໿೔ʣ --B7"<-JV $713>

    ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 6TFS *GUIFSFBSFGBDUVBMFSSPSTJOUIFRVFTUJPOT QPJOUJUPVU JGOPU QSPDFFEUPBOTXFSJOHUIFRVFTUJPO 8IBU`TIBQQFOJOHJOUIFEFTFSU ࣭໰ʹࣄ࣮ޡೝ͕͋Δ৔߹͸ࢦఠ͍ͯͩ͘͠͞ɻͦ͏Ͱͳ͍৔߹͸ɺ࣭໰΁ͷճ౴ʹਐΜͰ͍ͩ͘͞ɻ ࠭യͰ͸Կ͕ى͍ͬͯ͜ΔͷͰ͠ΐ͏͔ʁ --B7" 5IFSFBSFOPEFTFSUTJOUIFJNBHF5IFJNBHFGFBUVSFTBCFBDI XJUIQBMNUSFFT BDJUZTLZMJOF BOEBMBSHFCPEZPGXBUFS ը૾ʹ͸࠭യ͸͍ࣸͬͯ·ͤΜɻ Ϡγͷ໦͕ੜ͍ໜΔϏʔνɺ֗ͷεΧΠϥΠϯɺͦͯ͠޿େͳਫҬ͕͍ࣸͬͯ·͢ɻ ςΩετग़ྗ 
  20.  ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH   ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹  7JTJPOMBOHVBHFDPOOFDUPS .-1 ͷΈΛֶश

     ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH   ສ݅ͷࢦࣔσʔλΛ࢖༻  ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश --B7"ͷֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 5VOFE 'SP[FO 
  21.  ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH   ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹  7JTJPOMBOHVBHFDPOOFDUPS .-1 ͷΈΛֶश

     ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH   ສ݅ͷࢦࣔσʔλΛ࢖༻  ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 5VOFE 'SP[FO --B7"ͷֶश ˠ$-*1ͷը૾ΤϯίʔμΛར༻͠ ը૾ಛ௃ΛݴޠۭؒʹదԠ 
  22. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  23. w (FNJOJ<"OJM BS9JW>  (PPHMF͕։ൃͨ͠ը૾ɾԻ੠ɾಈըɾςΩετશͯΛԣஅ͢ΔϚϧνϞʔμϧ--. .--.   ը૾ཧղʹՃ͑ͯԻ੠ཧղɾಈըղੳɾίʔυੜ੒ɾଟݴޠରԠ΁ͱਐԽ 

    ֤ϞμϦςΟΛઐ༻ΤϯίʔμͰॲཧͯ͠ڞ௨ͷ5SBOTGPSNFSσίʔμʹ౷߹ͯ͠ਪ࿦ ϚϧνϞʔμϧɿ༷ʑͳϞμϦςΟͷೖग़ྗ 
  24. w .--.ͱͯ͠(PPHMF͕։ൃͨ͠(FNJOJ<)XBOH BS9JW>Λ࢖༻  (FNJOJɿը૾ͱςΩετΛೖྗ͠ɼςΩετΛग़ྗ͢ΔࣗݾճؼܕϞσϧ 5SBOTGPSNFS%FDPEFS  w ӡసλεΫΛ(FNJOJͷೖग़ྗؔ܎ʹམͱ͠ࠐΜͰ&.."Λߏங 

    ೖྗɿपғͷ৘ใʢը૾ʣɼӡసφϏήʔγϣϯʢςΩετʣɼաڈͷࣗं྆ͷ৘ใʢςΩετʣ  ग़ྗɿӡసλεΫʹؔ͢Δग़ྗʢςΩετʣ 7-.ʹΑΔࣗಈӡసɿ&.."<)XBOH BS9JW> 
  25. w 3ʙ3ͷεςοϓͷࢥߟΛߦͳ͔ͬͯΒະདྷͷߦಈ 8BZ1PJOU Λੜ੒ &.."ʹΑΔϓϥϯχϯάɿ$IBJOPGUIPVHIU1SPNQUJOH  γʔϯ 3 ɿఱީɼ࣌ࠁɼަ௨ঢ়گɼಓ࿏৚݅ͳͲΛؚΜͩઆ໌ ɹʢྫʣఱؾ͸շ੖ͰனؒͰ͢ɻಓ࿏͸ंઢͷ෼཭͞Ε͍ͯͳ͍௨ΓͰɼதԝʹԣஅาಓ͕͋Γ·͢ɻ

    ɹɹɹ௨Γͷ྆ଆʹ͸றं͞Εͨं͕͋Γ·͢ɻ ॏཁͳ෺ମ 3 ɿࣗं྆ͷӡసʹӨڹΛ༩͑ΔՄೳੑ͕͋Δ෺ମΛಛఆʢ%#&7࠲ඪͰදݱʣ ɹʢྫʣาߦऀ͕< >ʹ͍·͢ɻं͕྆< >ʹ͍·͢ɻɹ ॏཁͳ෺ମͷڍಈ 3 ɿॏཁͳ෺ମͷݱࡏͷঢ়ଶͱҙਤΛઆ໌ ɹʢྫʣาߦऀ͸ݱࡏɼาಓʹཱ͍ͬͯͯಓ࿏Λݟ͓ͯΓɼಓ࿏Λԣஅ͠Α͏ͱ͍ͯ͠ΔՄೳੑ͕͋Γ·͢ɻ ӡసͷҙࢥܾఆ 3 ɿ؍ଌ݁ՌΛجʹӡసܭըΛཁ໿ʢӡసͷҙࢥܾఆΛΧςΰϦʹ෼ྨʣ ɹʢྫʣݱࡏͷ௿଎Λҡ࣋͢΂͖Ͱ͢ɻ
  26. &.."ʹΑΔϓϥϯχϯάྫ  %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά ΰϛା ˠΰϛାʢݕग़ର৅Ͱ͸ͳ͍෺ମʣΛආ͚ΔΑ͏ʹϓϥϯχϯά

    ˠϦεʢݕग़ର৅Ͱ͸ͳ͍෺ମʣʹͿ͔ͭΒͳ͍Α͏ʹݮ଎ Ϧε ˠ৴߸͕ԫ৭ͷͨΊݮ଎ ৴߸ ଟ༷ͳӡసγφϦΦʹ͓͍ͯ҆શ͔ͭޮ཰తͳϓϥϯχϯάΛੜ੒ ˠ৴߸͕੺৭ͷͨΊఀࢭ ৴߸
  27. w ϩϘοτͷΞΫγϣϯΛੜ੒͢ΔͨΊʹ(FNJOJΛϑΝΠϯνϡʔχϯάͨ͠7-"Ϟσϧ  Ϋϥ΢υͰϗετ͞ΕΔ7JTJPO-BOHVBHF"DUJPO 7-" όοΫϘʔϯ  ϩϘοτͷΦϯϘʔυίϯϐϡʔλ্Ͱಈ࡞͢ΔϩʔΧϧΞΫγϣϯσίʔμ ܥ Gemini

    Robotics: Bringing AI into the Physical World Figure 1 | Overview of the Gemini Robotics family of embodied AI models. Gemini 2.0 already exhibits capabilities relevant to robotics such as semantic safety understanding and long contexts. The robotics-specific training and the optional specialization processes enable the Gemini Robotics models to exhibit a variety of 7-"ϞσϧʹΑΔϩϘοτಈ࡞ੜ੒ɿ(FNJOJ3PCPUJDT (PPHMF  IUUQTXXXZPVUVCFDPNXBUDI W.W(ONN1D
  28. w ࣄલֶशࡁΈ--B7"#ϞσϧΛӉ஦؀ڥͰͷΞϓϦέʔγϣϯʹඍௐ੔͞Εͨ7-. Ӊ஦ۭؒʹରԠͨ͠7-.ɿ4QBDF--B7"<'PVUUFS BS9JW>  Figure 2: We present Space-LLaVA,

    initialized from a pre-trained LLaVA 13B model [2] and fine-tuned to extraterrestrial applications with our synthetically generated dataset of, e.g., instruction-following, conversations constructed from three
  29. w ໨త  Ӊ஦ۭؒʹదԠͨ͠7-.ߏஙʹඞཁͳߴ඼࣭ͳࢹ֮ɾݴޠσʔλͷੜ੒  θϩγϣοτੑೳͷڧԽͱϑΝΠϯνϡʔχϯάͷͨΊͷج൫ w ࢖༻͢Δσʔληοτ 4QBDF--B7"ͷσʔλੜ੒ύΠϓϥΠϯ 

    σʔληοτ ಺༰ ओͳ༻్ "*.BST Ր੕ը૾ʴ஍ܗϚεΫ ஍ܗೝࣝͱจ຺෇͖࣭໰Ԡ౴ͷੜ੒ .*$% Ր੕ը૾ʴࣗવݴޠΩϟϓγϣϯ Ωϟϓγϣϯˠ࣭໰ม׵ 4QBDF4DJFODF2" Ӊ஦Պֶ࿦จϕʔεͷ2" Պֶత஌ࣝཧղͱਪ࿦
  30. w ΩϡϦΦγςΟϩʔόʔ͔Βऩू͞ΕͨՐ੕ͷ஍ܗͷສຕͷը૾  Ϋϥ΢υιʔεʹΑΔηάϝϯςʔγϣϯϚεΫ w ϚεΫʴը૾Λ(15Pʹೖྗͯ͠஍ܗهड़ɾཻࢠ෼ੳͳͲͷ2"ϖΞΛੜ੒ "*.BSTσʔληοτ<4XBO $71384>  (a)

    MSL NAVCAM image of Mars’ landscape. (b) Crowd-sourced segmentation masks superimposed on Martian landscape. Figure 3: The AI4Mars dataset [20] provides access to image captures of Mars’ terrain with crowd-sourced annotations for four terrain classes: “regolith”, “sand”, “bedrock”, “large rock(s)”. Terrain beyond 30m is left unlabeled. needed for extraterrestrial applications. As a first step towards a space foundation model, we demon- strate the opportunity for FMs to mitigate data scarcity by synthetically augmenting extraterrestrial science datasets, such as AI4Mars. Specifically, we generate a multi-modal dataset comprised of 150k QA tuples designed to emulate the detailed sensory reasoning required for tasks like identifying sites of scientific interest. We fine-tune an open-source Vision- Language Model (VLM) on our synthetic dataset, herein referred to as the Space-LLaVA dataset, and demonstrate the model’s utility by providing language annotations on planetary observations and tasks withheld from training. That lunar rover. Our evaluations demonstrate that: 1) existing VLMs are deficient visual reasoners in extraterrestrial applications; 2) our Space-LLaVA dataset endows a SoTA VLM with zero-shot performance increases servicing unseen extraterrestrial task types through instruction-tuning; 3) a small percentage, e.g., 20%, of the pre-training data is sufficient to safeguard against catastrophic forgetting; 4) FMs can be effectively integrated into modular autonomy stacks to enable embodied high-level planning in space robotics. 2. RELATED WORK Vision-Language Models: The advent of the Transformer [24] and derivative architectures, e.g., the Vision-Transformer [25], have powered recent advances in natural language and image processing through the use of VLMs trained on internet- scale text and image databases, e.g., Common Crawl and WebImageText [26]. Early work in vision-language modeling at scale [27] aligns a latent representation of vision and language by using a vision and text encoder with a contrastive learning objective; a VLM builds on this architecture by using a language model for open-ended visual reasoning such as VQA [28, 29, 23, 30]. In this work, we investigate adapting LLaVA-v1.5-13B [2] to extraterrestrial robotics through fine- tuning given this model is SoTA among open-source models on standard VQA benchmarks [18, 31]. Foundation Models in Robotics: Prior work has incorporated foundation models within the broader robot autonomy stack in various ways ranging from planning [9], decision making [32] and semantic reasoning [7, 6] to visual reasoning [33]. How- ever, the opportunity for foundation models in extraterrestrial robotics represents an emerging area of research. The Robot Operating System Agent [14] employs FMs to build a human- ஍ܗΫϥεɿ wϨΰϦε SFHPMJUI  w࠭ TBOE  wؠ൫ CFESPDL  wେ͖ͳؠ MBSHFSPDL T /"4"ͷՐ੕୳ࠪंɿ$VSJPTJUZ applications, we develop a VQA generation pipeline based on the AI4Mars and MICD [42] datasets supplemented by recent publications in astrophysics. Explicitly, we translate AI4Mars’ segmentation masks into visual context for GPT- assisted annotation of seven terrain-based, semantic tasks on Martian imagery, and inspired by cosmosage [44], we introduce our own QA dataset reflecting scientific insights and facts captured by publications in arXiv’s astrophysics category, e.g., Earth and Planetary Astrophysics, which we refer to as the SpaceScienceQA dataset. We first discuss our simple and scalable methodology to produce fine-grained sensory reasoning tasks on the AI4Mars dataset and MICD. Then, we detail our approach to syn- thetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. Figure 7: Our SpaceScienceQA dataset offers QA tuples evaluating a language model’s understanding of scientific insights and facts in astrophysics. Verbose sections of the question and answer are omitted for brevity. assisted image annotation. These seven tasks, e.g., terrain comparison, listed fully in Section A, are designed to support Space-LLaVA as a tool for annotating planetary imagery, whose terrain-aware annotations may be used downstream by a specialized, task-specific ML algorithm. For each task, we design a total of ten questions to accomplish the same objective with varied prose, e.g., if the task is scene description, then we may pose the question as 1) “describe the landscape in view.” or 2) “what do you see in this image?”, etc., so as to discourage over-fitting to a particular prompt’s writing style in adaptation, i.e., fine-tuning. Before we query GPT-4o to perform e.g., terrain comparison, for a particular image, we first superimpose the appropriate terrain segmentation mask(s) on the original MSL NAVCAM image to color-code the landscape, as shown in Figure 6, creating visual context to support GPT-4o’s analysis. Through the use of visual context and additional language context provided in the prompt, we request the desired annotation in a format that is readily discernible zero-shot by a SoTA VLM like GPT-4o, i.e., the requested annotation does not require prior, expert knowledge to answer the question. Importantly, all visual and language context is only provided to GPT-4o to promote high-quality data curation; this same context is withheld from training Space-LLaVA as these features are not available at inference. Further details on the specific prompt used for data curation, e.g., the user and system message, are provided in Section A. Then, with the MICD dataset, we have the inverse problem: dataset and MICD. Then, we detail our approach to syn- thetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. (b) Grain Characterization: GPT-4o annotates a candidate AI4Mars landscape by detailing the size and arrangement of particles Space-LLaVA as whose terrain-awa a specialized, task design a total of ten with varied prose then we may pose in view.” or 2) “wh discourage over-fi in adaptation, i.e. perform e.g., terr we first superimp mask(s) on the ori the landscape, as s support GPT-4o’s and additional lan request the desire discernible zero-s requested annotati to answer the que context is only pr data curation; thi Space-LLaVA as Further details on e.g., the user and s Then, with the M the MICD datase geological and ter and we simply mu precedes the answ which request a ca specific examples GPT-assisted Ann As a first step to e extraterrestrial sc to build an FM SpaceScienceQA scientific insights ஍ܗهड़ͷ2" ཻࢠ෼ੳͷ2"
  31. w Ր੕ͷ஍ܗಛ௃Λهड़ͨ͠ ૊ͷը૾ΩϟϓγϣϯϖΞ  ஍࣭ߏ଄ʢྫɿଯੵؠɺؠੴɺ࠭ɺϨΰϦεʣɺ஍ܗɺࢹ֮తಛ௃ʢܗঢ়ɾ഑ஔͳͲʣΛهड़ w طଘͷΩϟϓγϣϯ " ʹద੾ʹઌߦ͢Δ࣭໰ 2

    Λܾఆ͢Δͱ͍͏ʮٯ໰୊ʯΛղ͍ͯ2" ϖΞΛੜ੒  ྫɿl1SPWJEFBTIPSUDBQUJPOGPSUIJTJNBHFz  ྫɿl4VNNBSJ[FUIFSFMFWBOUGFBUVSFTJOZPVSWJFXz .BSUJBO*NBHF$BQUJPO%BUBTFU .*$% <2JV 1MBOFU4QBDF4DJ>  raining. That objectives, to ion, can pro- n imagery for in instruction- ently answer ergent ability ually present aracterization e generations c forgetting— broad perfor- d by the fine- LaVA with a on-following idate that our ior through a rks; for com- ustworthiness ct domain of cy of current e models can ng FMs into erfaces with Specifically, s realistic 3D omous rover the use of an monitor for a robotics represents an emerging area of research. The Robot Operating System Agent [14] employs FMs to build a human- robot language interface for operators using bespoke robotic technology; SpaceTransformers [34] fine-tunes three varia- tions of the BERT [35] architecture on a corpora of systems engineering texts and an augmented mission standards dataset to recognize space mission requirements. In a similar context, SpaceQA [36] builds on SpaceTransformers by creating an LLM for space mission design, which is suitable for pre- launch mission design and evaluation but is not extensible to in-flight robot operations. Toward the use of a foundation model for in-flight operation, [37] leverages GPT-3.5 [38] as the policy backbone for language-based autonomous satellite Figure 4: Sample from the Martian Image Caption Dataset (MICD): “conglomerate outcrops and float rocks and regolith.” 3 Ωϟϓγϣϯྫɿ lDPOHMPNFSBUFPVUDSPQTBOE fl PBUSPDLTBOESFHPMJUIz l᛽ؠͷ࿐಄ɺුੴɺͦͯ͠ϨΰϦεʢද૚౔ʣz
  32. w ݴޠϞσϧͷఱମՊֶʹؔ͢ΔཧղΛਂΊΔ͜ͱΛ໨తͱͯ͠ઃܭ w (15ʹΑΓBS9JWͷఱମ෺ཧֶΧςΰϦʹ͓͚Δ࠷৽ͷ ͷग़൛෺͔Β2"ϖΞΛੜ੒ ʢӉ஦࿦ͱۜՏ֎ఱମ෺ཧֶɺ஍ٿͱ࿭੕ఱମ෺ཧֶɺۜՏͷఱମ෺ཧֶɺఱମ෺ཧֶͷͨΊͷػثͱํ๏ɺଠཅͱ߃੕ͷఱମ෺ཧֶʣ 4QBDF4DJFODF2"  Figure 7:

    Our SpaceScienceQA dataset offers QA tuples ࣭໰ɿ 1BSLFS4PMBS1SPCFϛογϣϯͷओ ͳ໨త͸ԿͰ͋ΓɺͦΕΒͷ໨ඪΛ ୡ੒͢ΔͨΊʹ౥ࡌ͞Ε͍ͯΔՊֶ ػث͸ԿͰ͔͢ʁ ճ౴ɿ 141ϛογϣϯʹ͸ओʹͭͷ໨త ͕͋Γ·͢ɻୈҰʹɺίϩφՃ೤ͷ ࢓૊ΈͱɺͦΕʹଓ͘ଠཅ෩ͷ์ग़ Λௐࠪ͢Δ͜ͱͰ͢ɻ͜ΕΒͷ໨త Λୡ੒͢ΔͨΊʹɺ141ʹ͸ͭͷ ՊֶػثεΠʔτ͕౥ࡌ͞Ε͍ͯ· ͢ɻྫ͑͹ɺଠཅͷ౷߹Պֶ؍ଌ ʢ*40*4ʣɺి࣓৔؍ʢ'*&-%4ʣ ͳͲ͕͋Γ·͢ɻ her, while afore- EED+ [21], offer s, e.g., semantic ese multi-modal uage which may FM. Therefore, pabilities of FMs o extraterrestrial n pipeline based upplemented by tly, we translate context for GPT- semantic tasks mosage [44], we Figure 7: Our SpaceScienceQA dataset offers QA tuples ࣭໰ɿ ߴ࣭ྔ9ઢ࿈੕ʢ).9#ʣʹ͓͚Δ ॳظ࣭ྔൺ͸ɺ಺෦Ͱܗ੒͞ΕΔத ੑࢠ੕ͷεϐϯपظʹͲͷΑ͏ͳӨ ڹΛ༩͑Δͷ͔ʁ ճ౴ɿ ).9#ʹ͓͚Δॳظ࣭ྔൺ͸ɺத ੑࢠ੕ͷεϐϯपظʹେ͖ͳӨڹΛ ༩͑·͢ɻ࣭ྔൺ͕ʹ͍ۙɺ·ͨ ͸࿈੕ܥ͕ඇৗʹେ࣭ྔͰ͋Δ৔ ߹ɺ).9#ͷण໋͸୹͘ͳΓ·͢ɻ ͜ΕʹΑΓɺڧ͍࣓৔͕ܗ੒͞Εɺ ൺֱతେ͖ͳ४ฏߧεϐϯपظ͕ಘ ΒΕ·͢ɻͦͷ݁Ռɺதੑࢠ੕ͷε ϐϯपظ͸ΑΓ௕͘ͳΓ·͢ɻ
  33. w طଘͷ--B7"#ϞσϧΛϑΝΠϯνϡʔχϯάͯ͠Ӊ஦ۭؒʹಛԽͨ͠7-.Λߏங w ϑΝΠϯνϡʔχϯάઓུ  7JTJPO&ODPEFS͸ౚ݁ʢGSFF[Fʣ  ߋ৽ର৅ɿ-BOHVBHF.PEFMʢ7JDVOB#ʣʴ.VMUJNPEBM"EBQUFS૚  -PTTؔ਺ɿΫϩεΤϯτϩϐʔʢԠ౴ͷஞޠతτʔΫϯ༧ଌʣ

    w ๨٫๷ࢭͷ޻෉ʢ$BUBTUSPQIJD'PSHFUUJOHରࡦʣ  Ӊ஦ಛԽσʔλͷΈͰ܇࿅͢Δͱɺ൚༻ੑ͕ࣦΘΕΔՄೳੑ  ղܾࡦɿ--B7"*OTUSVDULͷҰ෦Λࠞ߹ʢʣ  ޮՌɿ*OTUSVDUJPOGPMMPXJOHੑೳͱઐ໳ੑͷཱ྆ 4QBDF--B7"ͷֶश 
  34. w ೖྗɿ  ը૾ɿϩʔόʔͷ౥ࡌΧϝϥը૾ʴ݄໘ͷ্ۭ͔ΒͷϏϡʔʢτοϓμ΢ϯϏϡʔʣ  ςΩετɿλεΫهड़ͱఏҊ͞Εͨ8BZ1PJOUͷϦετ w ग़ྗɿ  ఏҊ͞Εͨܦ࿏ͷςΩετ෼ੳʢ҆શੑɺ࣮ݱՄೳੑͷධՁΛؚΉʣʴඞཁʹԠͯ͡ɺ୅ସ8BZ1PJOUͷηοτ

    4QBDF--B7"ͷʹΑΔ݄໘ϞϏϦςΟγφϦΦ  Figure 11: The lunar mobility scenario. (Left) A lunar rover and lander are situated in a virtual lunar environment. The rover, equipped with multiple onboard cameras, must navigate from its starting position to the lander, guided by a candidate path plan, such as one provided by a hypothetical ground team. (Right) The FM serves as a high-level path planner and runtime monitor, λεΫهड़ 8BZ1PJOU ஍্νʔϜ͸ɺϩʔόʔ͕ϥϯμʔʢண཮ػʣʹ޲͔ͬͯφϏήʔτ͢ΔͨΊͷܭըΛఏҊ͍ͯ͠·͢ɻఏ Ҋ͞Εͨܦ࿏ʹϋβʔυʢةݥՕॴʣ͕ͳ͍͔Λ֬ೝ͠ɺඞཁʹԠͯ͡୅ସϧʔτΛఏҊ͍ͯͩ͘͠͞ɻ ఏҊ͞Εͨܦ༝஍఺͸ҎԼͷ௨ΓͰ͢ɿ<            > ͜ͷܦ࿏͸ඇৗʹӨͷೱ͍ΤϦΞΛ௨ա͢ΔͨΊɺϋβʔυΛݟམͱ͢Մೳੑ͕͋Γ·͢ɻϩʔόʔ͕҆શʹϥϯμʔ ʹ໭ΔͨΊͷ୅ସφϏήʔγϣϯϓϥϯΛҎԼʹఏҊ͠·͢ɿ<               >