Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[Journal club ] PHyCLIP: ðð-Product of Hyperbol...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
Technology
36
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[Journal club ] PHyCLIP: ðð-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation
keio_smilab
PRO
0
99
[Journal club] ReLaGS: Relational Language Gaussian Splatting
keio_smilab
PRO
0
95
[Journal club] Flow as the Cross-Domain Manipulation Interface
keio_smilab
PRO
0
88
Mobi-ð: Mobilizing Your Robot Learning Policy
keio_smilab
PRO
0
150
A Gentle Introduction to Transformers
keio_smilab
PRO
16
6.7k
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
keio_smilab
PRO
0
58
[Journal club] VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
keio_smilab
PRO
0
130
[Journal club] Improved Mean Flows: On the Challenges of Fastforward Generative Models
keio_smilab
PRO
0
190
[Journal club] MemER: Scaling Up Memory for Robot Control via Experience Retrieval
keio_smilab
PRO
0
140
Other Decks in Technology
See All in Technology
ChatworkãšBPaaS ç°ãªãç¹æ§ã§åŠãã AIæ©èœéçºã® ãã¹ããã©ã¯ãã£ã¹
kubell_hr
2
3.1k
Mastering Ruby Box
tagomoris
3
150
ReactããŸã 楜ãããŠè
uhyo
7
4.2k
ã«ãŒã«ãã«ã¹ã¿ã æ©èœããšãã䜿ãïŒçæ³ã®åºåãåŒãåºãããã«ä»ç¥ãããIBM Bob 5ã€ã®æ©èœ
muehara
1
360
ãéãäœãããããæ£ããäœãããž â çæAIæä»£ã®éçºãããŒæ¹é©ã® ããŒãããããšå®è¡ â
starfish719
0
8.7k
Platform engineering for developers, architects & the rest of us (AI agents)
danielbryantuk
0
190
Oracle AI Database@AzureïŒãµãŒãã¹æŠèŠã®ã玹ä»
oracle4engineer
PRO
6
1.9k
ãã®PoCãäœãæ€èšŒããã€ããã§ãããïŒ AIãããã¯ãã®äŸ¡å€æ€èšŒã§é¥ã£ãèœãšã穎
techtekt
PRO
0
150
10åã®çç£æ§ãå®çŸããAIé§å䞊åãšãŒãžã§ã³ãã®ãã¹ãŠ
kumaiu
4
950
ãµã€ããŒã»ãã¥ãªãã£æŠè« / Introduction to Cybersecurity
ks91
PRO
0
170
GoãšSIMDãšWasmã®ä»ã
askua
3
510
äœã£ãŠçµããã«ããªã ã¿ã€ããŒã®ã»ãã³ãã£ãã¯ã¬ã€ã€ãŒè²æã®çŸåšå°
chanyou0311
1
390
Featured
See All Featured
WENDY [Excerpt]
tessaabrams
11
38k
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
220
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
150
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
380
Paper Plane (Part 1)
katiecoart
PRO
0
8.6k
Utilizing Notion as your number one productivity tool
mfonobong
4
320
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
290
Whatâs in a name? Adding method to the madness
productmarketing
PRO
24
4.1k
GraphQLãšã®åãåãæ¹2022幎ç
quramy
50
15k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.7k
The Invisible Side of Design
smashingmag
302
52k
Transcript
PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality
in Vision- Language Representation Learning ICLR26 æ ¶æçŸ©å¡Ÿå€§åŠ ææµŠåæç 究宀 é«ç§æå² Daiki Yoshikawa1, Takashi Matsubara1, 2 1Hokkaido University, 2CyberAgent Daiki Yoshikawa, et al. PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning. ICLR2026.
2 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å æŠèŠ â¢ â¢ â¢ â¢
3 éå±€æ§ãšæ§ææ§ãåæã«è¡šçŸããããšã¯é£ãã VLMãæ±ãã¹ã2çš®é¡ã®æå³æ§é â« éå±€æ§ (hierarchy) â« èšèªæŠå¿µã¯æšæ§é çã«åé¡ã§ãã (e.g., WordNet
[Miller, 95]) â« äŸ: dog ⪯ mammal ⪯ animal â« äžäœã®æŠå¿µã»ã©å ·äœç â« æ§ææ§ (compositionality) â« äŸ: âa dog in a carâ â« ç»åãæç« ã¯è€æ°æŠå¿µã®å ±èµ· CLIP [Radford+, ICML21] ã¯åäžã®ãŠãŒã¯ãªãã空éäžã®ïŒã€ã®ãã¯ãã«ãšããŠè¡šçŸ ï hierarchy ãš compositionality ãåäžç©ºéã§åæã«è¡šçŸã§ããªã èæ¯ (1/3) ⢠⢠⢠â¢
4 åæ²ç©ºé㯠hierarchy ãèªç¶ã«è¡šçŸã§ãã Poincaré Embeddings [Nickel+, NeurIPS17] â« èæ¯
â« åèªã»ã°ã©ãã«ã¯æœåšç㪠hierarchy ãååš â« äœæ¬¡å ã®ãŠãŒã¯ãªãã空éã§ã¯æ·±ãéå±€æ§é ã 衚ããªã (âµ âð: å€é åŒç éå±€æ§é : ææ°é¢æ°ç) â« ææ¡: ãã¢ã³ã«ã¬ã¢ãã«ãžã®åãèŸŒã¿ â« åæ²ç©ºéã§ã¯ç©ºéãææ°é¢æ°çã«åºãã â é£ç¶çãªæšæ§é ãšããŠéå±€æ§é ãèªç¶ã«è¡šçŸ â« ãã«ã ð ãéå±€, è·é¢ ð ð, ð ãé¡äŒŒåºŠã衚ã â« çµæ â« WordNet [Miller, 95] ã®ãããªå€§èŠæš¡åé¡äœç³»ã®åã蟌㿠⺠衚çŸå®¹éã»æ±åæ§èœãšãã«åŸæ¥ææ³ãåé§ âº ç¹ã«äœæ¬¡å ã§ãé«ã粟床ãç¶æ èæ¯ (2/3) ⢠⢠⢠⢠WordNet ã®åºä¹³é¡ subtree ãåæ²ç©ºé (ð = 2) ã§èšç·Ž
5 â« ç»åã»æç« ã¯è€æ°æŠå¿µã®å ±èµ·ãšããŠè¡šãã â« âa dog in a carâ â
{dog, car} â« âa cat and a bikeâ â {cat, bike} ï è€æ°æŠå¿µã®å ±èµ·ã hierarchy ã衚ãåäžã®åæ²ç©ºéã§è¡šçŸã§ããªã â« ããŒã«ä»£æ°ãšããŠã®è§£é â« atomic concepts: ð¶ = {ð1 , ð2 , ⊠, ðð } â« è€åæŠå¿µ: ð â ð¶ â« å atomic concept ãå«ãŸãããã©ãã bit ã§æãã â è€åæŠå¿µ ð, ð ã®è·é¢ã¯ããã³ã°è·é¢ ðð -product (ååæ²ç©ºéã®è·é¢ã®å) Compositionality 㯠Boolean-like ãªæ§é ãæã€ èæ¯ (3/3) ⢠⢠⢠⢠ð¶ = {dog, cat, car, bike} ð = {dog, car}, ð = {dog} ð ð = 1,0,1,0 ð ð = 1,0,0,0 ããã³ã°è·é¢: ðHam ð ð , ð ð = 1 ðð -product: ð1 ð, ð = à· ð=1 ð ð âð ð ð¥ ð , ðŠ ð
6 Vision-Language Representation Learning â¢ â¢ â¢ â¢ ææ³ æŠèŠ
ç¹åŸŽ CLIP [Radford+, ICML21] ç»åã»ããã¹ããåäžã®ãŠãŒã¯ãªã ã空éãžåå ï hierarchy ã compositionality ãæç€ºçã« æ±ããªã MERU [Desai+, ICML23] CLIP ã®åã蟌ã¿ç©ºéãåæ²ç©ºéãž æ¡åŒµ ⺠hierarchy ã®æšæ§é ãè¡šçŸ ï compositionality ã¯èæ ®ããŠããªã HyCoCLIP [Pal+, ICLR25] bounding box supervision ãå°å ¥ hyperbolic entailment cone ãå°å ¥ ⺠object-level ã® hierarchy ãæç€ºçã«åŠç¿ ï compositionality ã®æ±ãã¯éå®ç é¢é£ç ç©¶ MERU HyCoCLIP
7 PHyCLIP ã®å šäœå â« è€æ°ã® hyperbolic factor ã®ç©ºéãžåã蟌ã â« ð
åã® ð æ¬¡å åæ²ç©ºé âð ð â å šäœã§ ðð æ¬¡å ææ¡ææ³ (1/3) ⢠⢠⢠â¢
8 ç»åã»ããã¹ããåæ²ç©ºéãžåã蟌ã â« åæ²ç©ºéãžã®åã蟌㿠⫠ðð æ¬¡å ç¹åŸŽéã ð åã«åå² â«
åå²ãã ð ð ãåæ²ç©ºéã«åå ð ð â âð â ð ð â âð ð â« è·é¢ã®å®çŸ© (ðð -product metric) ð1 ð¿, ð = à· ð=1 ð ð âð ð ð(ð), ð(ð) ðavg ð¿, ð = 1 ð ð1 ð¿, ð â« object-level ã«ã¯ãããããç»åã»ããã¹ããäœ¿çš â« å ¥å: ð°, ð», ð°box, ð»box â« image 㯠text ããå ·äœç â« å ã® image/text ã¯ã¯ãããããããã®ããå ·äœç ææ¡ææ³ (2/3) ⢠⢠⢠⢠Entailment Relation ð° ⪯ ð» ð°box ⪯ ð»box ð° ⪯ ð°box ð» ⪯ ð»box
9 Loss function: 察å¿é¢ä¿ãšéå±€é¢ä¿ãåæã«åŠç¿ æå€±é¢æ°: âoverall = âcont + ðŸâent
ææ¡ææ³ (3/3) ⢠⢠⢠⢠Contrastive Loss â« æšæºç㪠InfoNCE âcont {ð¿ð }, {ðð } = â à· ðâðµ log exp âðavg ð¿ð , ðð /ð Ï ðâðµ exp âðavg ð¿ð , ðð /ð â« ãã¹ãŠã®ãã¢ã§å¹³å âcont = 1 4 ൬ ൰ âcont {ð°ð }, {ð»ð } + âcont {ð»ð }, {ð°ð } + âcont {ð°ð box}, {ð»ð box} + âcont {ð»ð box}, {ð°ð box} Entailment Loss â« entailment cone ã§é åºé¢ä¿ã衚ã ð ð â ð¶ ð ð ⺠ð ð ⪯ ð ð â« entailment cone ããå€ããã眰å âent, ð ð¿, ð = max 0, ð ð ð , ð ð â ðð ð ð âent ð¿, ð = 1 ð à· ð=1 ð âent, ð ð¿, ð ð ð ð , ð ð : y ãã x ã®è§åºŠ ð ð ð : cone ã®åéå£è§ ð: ããŒãžã³
10 GRIT ãçšããåŠç¿ â« èšç·ŽããŒã¿ã»ãã â« GRIT [Peng+, ICCV23]: èªåã¢ãããŒã·ã§ã³ããã
image-text ã㢠+ bbox â« 14.0M image-text pairs / 26.6M box annotations â« PHyCLIP ã®èšå® â« ð = 64, ð = 8 (åèš: 512次å ) â« ðŸ = 0.2 â« optimizer: AdamW â« å®éšç°å¢ â« GPU: A100 Ã4 â« iterations: 500,000 â« batch size: 768 å®éšèšå® ⢠⢠⢠â¢
11 â« Zero-shot Image Classification ⺠PHyCLIP ã¯å šäœãéããŠæ¢åææ³ãäžåã (specialized ã¯
GRIT ã®ååžå€) ⺠ç¹ã« General ã§é«ãã¹ã³ã¢ â è€æ°ã®åæ²ç©ºéã«ãã concept families ã®çè§£ãæå¹ PHyCLIP ã¯ç»ååé¡ã¿ã¹ã¯ã§æ¢åææ³ãäžåã å®éççµæ (1/3) ⢠⢠⢠â¢
12 PHyCLIP 㯠retrieval ãšéå±€åé¡ã§æ¢åææ³ãäžåã â« Zero-shot Retrieval & Hierarchical
Classification ⺠PHyCLIP ã¯ã»ãšãã©ã® retrieval ææšã§æ¢åææ³ãäžåã ⺠Hierarchical Classification (äºæž¬ã©ãã«ãš GT ãã©ãã ã WordNet äžã§è¿ãã) ã® å šãŠã®ææšã§æ¢åææ³ãäžåã å®éççµæ (2/3) ⢠⢠⢠â¢
13 PHyCLIP 㯠compositionality ã®çè§£ãæ¹å â« Compositional Understanding â« ãã£ãã·ã§ã³ã®äžéšã倿Žãã
hard negative ãã GT ã®ãã£ãã·ã§ã³ãèå¥ â« VL-CheckList-Object: ãã£ãã·ã§ã³äžã®ç©äœãå¥ã®ç©äœã«çœ®æ â« SugarCrepe: object/attribute/relation ã«å¯Ÿã㊠replace/swap/add ⺠VL-CheckList-Object ã§ã¯å šãŠã®ãµãã»ããã§ PHyCLIP ãæ¢åææ³ãäžåã â äœçœ®ã倧ããã«é å¥ã«ç©äœã®ååšãè¡šçŸ ï relation replacement ã object swapping ã§ã¯æ§èœãäœäž â Boolean-like ãªèšèšã«ããç©äœå士ã®é¢ä¿æ§ã®çè§£ã«åŒ±ã å®éççµæ (3/3) ⢠⢠⢠â¢
14 â« ç»åã®ãã«ã ã¯ããã¹ãã®ãã«ã ãã倧ããçãç¯å²ã«éäž (âµ ç»åã¯ããã¹ãããå ·äœç: ð°ð ⪯ ð»ð ) â«
åã ã® factor å ã§ã¯ããããã®ãã«ã ã®ååžãéãªãåºã忣 ⺠PHyCLIP ã¯åã蟌ã¿ç©ºéã®åºãé åãæŽ»çš åã ã® factor ã§åã蟌ã¿ç©ºéãæå¹æŽ»çš 宿§ççµæ (1/2) ⢠⢠⢠â¢
15 â« dog 㯠â39 ð , car 㯠â9
ð 㧠掻æ§å â« dog and car ã§ã¯åæã«æŽ» æ§å â« â39 ð ã§ã¯åºä¹³é¡, â9 ð ã§ã¯ä¹ãç©/æ¥çšå ã®éå±€æ§é ãçŸãã å hyperbolic factor ã¯æŠå¿µããšã® hierarchy ã衚ã 宿§ççµæ (2/2) ⢠⢠⢠â¢
18 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å ãŸãšã ⢠⢠⢠â¢
19 Poincaré Embeddings [Nickel+, NeurIPS17] ã®è©³çް â« Poincaré ã¢ãã« â«
Riemannian metric tensor ðð¥ = 2 1â ð 2 2 ððž (ððž : Euclidean metric tensor) â« ç¹ ð¢, ð£ â â¬ð¹ éã®è·é¢ ð ð, ð = arcosh 1 + 2 ð â ð 2 1 â ð 2 1 â ð 2 â« Optimization ðœð¡+1 â ðððð ðœð¡ â ðð¡ 1 â ðœð¡ 2 2 4 âðž â« Loss â Î = à· ð¢,ð£ âð log ðâð ð,ð Ï ðâ²âð© ð¢ ðâð ð,ðâ² Appendix (1/4) ⢠⢠⢠â¢
20 PHyCLIP ã®å®è£ 詳现 Appendix (2/4) ⢠⢠⢠⢠PHyCLIP
㯠Lorents model [Nickel+, ICML18] ã§ hyperbolic factor ãå®è£ (æ²ç âð¶ð 㯠learnable) â« Minkowski inner product: æéæ¹åã®ã¿è² ã®å ç© à· ð = ð¥0 , ð¥1 , ⊠, ð¥ð , ð = ð¥1 , ⊠, ð¥ð â âð à· ð, à· ð âð,1 = âð¥0 ðŠ0 + ð, ð âð â« åæ²ç©ºéãåæ²é¢ãšããŠè¡šçŸ ððŒ ð = à· ð â âð,1 à· ð, à· ð âð,1 = âðŒâ1, ð¥0 > 0 â« Lorentz distance ð ððŒ ð à· ð, à· ð = ðŒâ1/2 arccosh âðŒ à· ð, à· ð âð,1 â« Exponential map: ð ãåæ²ç©ºéäžã®ç¹ãžåå à· ð = expà· ðš ðŒ ð = cosh ðŒ ð à· ð + sinh ðŒ ð ðŒ ð ð â« Entailment Cones in the Lorents Model ð ð = sinâ1 min 1, 2ðŸ ðŒ ð âð ð ð, ð = cosâ1 ð¥0 + ðŒ ð, ð âðŒ ð ðŠ0 ð âð ðŒ ð, ð âðŒ ð 2 â 1
21 Ablation Study Appendix (3/4) ⢠⢠⢠â¢
22 Additional Visualizations Appendix (4/4) ⢠⢠⢠â¢