[Paper Introduction] I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

I Spy a Metaphor: Large Language Models and Diffusion Models
Co-Create Visual Metaphors Symbol Emergence System Lab. Journal Club Calendar Presenter: Yongyu Pu 1

Paper Information • Title: I Spy a Metaphor: Large Language
Models and Diffusion Models Co-Create Visual Metaphors • Author: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn et al. • Pub. Date: 2023.07 • Link: https://aclanthology.org/2023.findings-acl.465/ 2

Background • Visual Metaphor • A powerful rhetorical device that
uses images to communicate creative or persuasive ideas. • The Research Task • To generate visual metaphors from linguistic metaphors. For example, turning the text "My bedroom is a pig sty" into a representative image 3

Background • The Core Challenge for AI • Standard diffusion
models (like DALL- E 2) find this task difficult because it requires understanding implicit meaning and compositionality. • For the "pig sty" metaphor, a model might just create a clean, pink-colored room, completely missing the implicit meaning of "messy". 4

Problem • Core Problem: Standard text-to-image models, like DALL-E 2,
perform poorly when processing figurative language, especially metaphors. • Failure to Grasp Implicit Meaning • Failure in Compositionality • Under-specification of Prompts 5

Research Contribution • Novel Approach: LLM-Diffusion Model Collaboration • New
method where a Large Language Model (LLM) collaborates with a text-to-image diffusion model • High-Quality Dataset: HAIVMet • The creation of HAIVMet (Human-AI Visual Metaphor), a new, large- scale dataset. • Thorough Evaluation Framework • The paper presents a comprehensive evaluation of this new approach. 6

Method: A Human-AI Collaboration Framework 7 HAIVMet Dataset (Human-AI Visual
Metaphor) 1. Select Visual Grounded Metaphors 2. CoT Prompting to Generate Visual Elaborations 3. Human Filtering

Method: A Human-AI Collaboration Framework • Step 1: Selecting Visually
Grounded Metaphors • Visually Grounded Metaphors: Such as metaphors of concrete subjects or some abstract subjects like ‘confusion’ as a question mark or ‘idea’ as a lightbulb over someone’s head 8 ‘idea’ as a lightbulb

• Step 2: Generating Visual Elaborations via Chain-of-Thought (CoT) and
Few-Shot • Human experts then validate and, if necessary, perform minor edits on these elaborations (this occurred in 29% of cases). • Step 3: Image Generation and Human Filtering • The detailed "visual elaboration" from Step 2 is used as a prompt for a diffusion model (DALL-E 2) to generate images. 9 Method: A Human-AI Collaboration Framework

Chain-of-Thought Prompt and Output: 10 Method: A Human-AI Collaboration Framework
…………

Evaluation • Goal: To assess the impact of the LLM-Diffusion
Model collaboration • Evaluation Setup: • Conducted by three professional illustrators and designers recruited via the Upwork platform. • The illustrators ranked the outputs from five different system setups (Among 100 random metaphors, ) • Rankings were based on the image representation of the metaphor. Raters also provided natural language instructions to improve imperfect images. 11

Evaluation 12 LLM-DALL-E２ is the winner with the highest average
rank and the least lost cause and average number of instructions. Avg Rank: the average ranking given by 3 human evaluators % Lost Cause: the percentage of the images labelled as ‘Lost Cause’, which are considered to have 5 edits to ensure fairness in this computation Avg # of Instructions: the average number of edits needed to make the image perfect otherwise

Human-AI Collaboration Evaluation • Intrinsic Evaluation • Extrinsic Evaluation: Visual
Entailment Task 13 The images from HAIVMet obtained the highest preference and lowest lost cause rate compared to other models Model’s performance in visual entailment results task rose after adding HAIVMet into the training set.

Compositionality in Visual Metaphors • Some researchers argue that metaphors
arises through cross- domain composition and a visual material rather than conceptual. • Many images from HAIVMet dataset showcase the compositional nature of visual metaphors 14

Overall Discussion • Key to Success: Role-sharing • LLM(Reasoning): A
‘translator’, interpreting abstract metaphors into concrete instructions. • Diffusion Model(Depicting): An ‘artist’, generating high-quality images from the specific instructions • Human(Evaluating): Managing context and quality that AI alone cannot. 15

Conclusion • Contributions • The key finding is that using
a LLM with CoT prompting to generate detailed "visual elaborations" significantly improves the quality of visual metaphors produced by diffusion models • The resulting HAIVMet dataset • Limitations: • The Human-AI collaboration process can be time-consuming. • The best-performing models used in the study are not open-source and are accessed via paid APIs. • The current work is limited to English-only metaphors. 16

[Paper Introduction] I Spy a Metaphor: Large La...

[Paper Introduction] I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

YY. PU

More Decks by YY. PU

Featured

Transcript

I Spy a Metaphor: Large Language Models and Diffusion Models

Paper Information • Title: I Spy a Metaphor: Large Language

Background • Visual Metaphor • A powerful rhetorical device that

Background • The Core Challenge for AI • Standard diffusion

Problem • Core Problem: Standard text-to-image models, like DALL-E 2,

Research Contribution • Novel Approach: LLM-Diffusion Model Collaboration • New

Method: A Human-AI Collaboration Framework 7 HAIVMet Dataset (Human-AI Visual

Method: A Human-AI Collaboration Framework • Step 1: Selecting Visually

• Step 2: Generating Visual Elaborations via Chain-of-Thought (CoT) and

Chain-of-Thought Prompt and Output: 10 Method: A Human-AI Collaboration Framework

Evaluation • Goal: To assess the impact of the LLM-Diffusion

Evaluation 12 LLM-DALL-E２ is the winner with the highest average

Human-AI Collaboration Evaluation • Intrinsic Evaluation • Extrinsic Evaluation: Visual

Compositionality in Visual Metaphors • Some researchers argue that metaphors

Overall Discussion • Key to Success: Role-sharing • LLM(Reasoning): A

Conclusion • Contributions • The key finding is that using