Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Paper Introduction] I Spy a Metaphor: Large La...

Avatar for YY. PU YY. PU
June 17, 2025
94

[Paper Introduction] I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

2025/06/19

Paper introduction@TanichuLab

https://sites.google.com/view/tanichu-lab-ku/

Avatar for YY. PU

YY. PU

June 17, 2025
Tweet

Transcript

  1. I Spy a Metaphor: Large Language Models and Diffusion Models

    Co-Create Visual Metaphors Symbol Emergence System Lab. Journal Club Calendar Presenter: Yongyu Pu 1
  2. Paper Information • Title: I Spy a Metaphor: Large Language

    Models and Diffusion Models Co-Create Visual Metaphors • Author: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn et al. • Pub. Date: 2023.07 • Link: https://aclanthology.org/2023.findings-acl.465/ 2
  3. Background • Visual Metaphor • A powerful rhetorical device that

    uses images to communicate creative or persuasive ideas. • The Research Task • To generate visual metaphors from linguistic metaphors. For example, turning the text "My bedroom is a pig sty" into a representative image 3
  4. Background • The Core Challenge for AI • Standard diffusion

    models (like DALL- E 2) find this task difficult because it requires understanding implicit meaning and compositionality. • For the "pig sty" metaphor, a model might just create a clean, pink-colored room, completely missing the implicit meaning of "messy". 4
  5. Problem • Core Problem: Standard text-to-image models, like DALL-E 2,

    perform poorly when processing figurative language, especially metaphors. • Failure to Grasp Implicit Meaning • Failure in Compositionality • Under-specification of Prompts 5
  6. Research Contribution • Novel Approach: LLM-Diffusion Model Collaboration • New

    method where a Large Language Model (LLM) collaborates with a text-to-image diffusion model • High-Quality Dataset: HAIVMet • The creation of HAIVMet (Human-AI Visual Metaphor), a new, large- scale dataset. • Thorough Evaluation Framework • The paper presents a comprehensive evaluation of this new approach. 6
  7. Method: A Human-AI Collaboration Framework 7 HAIVMet Dataset (Human-AI Visual

    Metaphor) 1. Select Visual Grounded Metaphors 2. CoT Prompting to Generate Visual Elaborations 3. Human Filtering
  8. Method: A Human-AI Collaboration Framework • Step 1: Selecting Visually

    Grounded Metaphors • Visually Grounded Metaphors: Such as metaphors of concrete subjects or some abstract subjects like ‘confusion’ as a question mark or ‘idea’ as a lightbulb over someone’s head 8 ‘idea’ as a lightbulb
  9. • Step 2: Generating Visual Elaborations via Chain-of-Thought (CoT) and

    Few-Shot • Human experts then validate and, if necessary, perform minor edits on these elaborations (this occurred in 29% of cases). • Step 3: Image Generation and Human Filtering • The detailed "visual elaboration" from Step 2 is used as a prompt for a diffusion model (DALL-E 2) to generate images. 9 Method: A Human-AI Collaboration Framework
  10. Evaluation • Goal: To assess the impact of the LLM-Diffusion

    Model collaboration • Evaluation Setup: • Conducted by three professional illustrators and designers recruited via the Upwork platform. • The illustrators ranked the outputs from five different system setups (Among 100 random metaphors, ) • Rankings were based on the image representation of the metaphor. Raters also provided natural language instructions to improve imperfect images. 11
  11. Evaluation 12 LLM-DALL-E2 is the winner with the highest average

    rank and the least lost cause and average number of instructions. Avg Rank: the average ranking given by 3 human evaluators % Lost Cause: the percentage of the images labelled as ‘Lost Cause’, which are considered to have 5 edits to ensure fairness in this computation Avg # of Instructions: the average number of edits needed to make the image perfect otherwise
  12. Human-AI Collaboration Evaluation • Intrinsic Evaluation • Extrinsic Evaluation: Visual

    Entailment Task 13 The images from HAIVMet obtained the highest preference and lowest lost cause rate compared to other models Model’s performance in visual entailment results task rose after adding HAIVMet into the training set.
  13. Compositionality in Visual Metaphors • Some researchers argue that metaphors

    arises through cross- domain composition and a visual material rather than conceptual. • Many images from HAIVMet dataset showcase the compositional nature of visual metaphors 14
  14. Overall Discussion • Key to Success: Role-sharing • LLM(Reasoning): A

    ‘translator’, interpreting abstract metaphors into concrete instructions. • Diffusion Model(Depicting): An ‘artist’, generating high-quality images from the specific instructions • Human(Evaluating): Managing context and quality that AI alone cannot. 15
  15. Conclusion • Contributions • The key finding is that using

    a LLM with CoT prompting to generate detailed "visual elaborations" significantly improves the quality of visual metaphors produced by diffusion models • The resulting HAIVMet dataset • Limitations: • The Human-AI collaboration process can be time-consuming. • The best-performing models used in the study are not open-source and are accessed via paid APIs. • The current work is limited to English-only metaphors. 16