[Paper Introduction] I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Slide 1

Slide 1 text

I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors Symbol Emergence System Lab. Journal Club Calendar Presenter: Yongyu Pu 1

Slide 2

Slide 2 text

Paper Information • Title: I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors • Author: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn et al. • Pub. Date: 2023.07 • Link: https://aclanthology.org/2023.findings-acl.465/ 2

Slide 3

Slide 3 text

Background • Visual Metaphor • A powerful rhetorical device that uses images to communicate creative or persuasive ideas. • The Research Task • To generate visual metaphors from linguistic metaphors. For example, turning the text "My bedroom is a pig sty" into a representative image 3

Slide 4

Slide 4 text

Background • The Core Challenge for AI • Standard diffusion models (like DALL- E 2) find this task difficult because it requires understanding implicit meaning and compositionality. • For the "pig sty" metaphor, a model might just create a clean, pink-colored room, completely missing the implicit meaning of "messy". 4

Slide 5

Slide 5 text

Problem • Core Problem: Standard text-to-image models, like DALL-E 2, perform poorly when processing figurative language, especially metaphors. • Failure to Grasp Implicit Meaning • Failure in Compositionality • Under-specification of Prompts 5

Slide 6

Slide 6 text

Research Contribution • Novel Approach: LLM-Diffusion Model Collaboration • New method where a Large Language Model (LLM) collaborates with a text-to-image diffusion model • High-Quality Dataset: HAIVMet • The creation of HAIVMet (Human-AI Visual Metaphor), a new, large- scale dataset. • Thorough Evaluation Framework • The paper presents a comprehensive evaluation of this new approach. 6

Slide 7

Slide 7 text

Method: A Human-AI Collaboration Framework 7 HAIVMet Dataset (Human-AI Visual Metaphor) 1. Select Visual Grounded Metaphors 2. CoT Prompting to Generate Visual Elaborations 3. Human Filtering

Slide 8

Slide 8 text

Method: A Human-AI Collaboration Framework • Step 1: Selecting Visually Grounded Metaphors • Visually Grounded Metaphors: Such as metaphors of concrete subjects or some abstract subjects like ‘confusion’ as a question mark or ‘idea’ as a lightbulb over someone’s head 8 ‘idea’ as a lightbulb

Slide 9

Slide 9 text

• Step 2: Generating Visual Elaborations via Chain-of-Thought (CoT) and Few-Shot • Human experts then validate and, if necessary, perform minor edits on these elaborations (this occurred in 29% of cases). • Step 3: Image Generation and Human Filtering • The detailed "visual elaboration" from Step 2 is used as a prompt for a diffusion model (DALL-E 2) to generate images. 9 Method: A Human-AI Collaboration Framework

Slide 10

Slide 10 text

Chain-of-Thought Prompt and Output: 10 Method: A Human-AI Collaboration Framework …………

Slide 11

Slide 11 text

Evaluation • Goal: To assess the impact of the LLM-Diffusion Model collaboration • Evaluation Setup: • Conducted by three professional illustrators and designers recruited via the Upwork platform. • The illustrators ranked the outputs from five different system setups (Among 100 random metaphors, ) • Rankings were based on the image representation of the metaphor. Raters also provided natural language instructions to improve imperfect images. 11

Slide 12

Slide 12 text

Evaluation 12 LLM-DALL-E２ is the winner with the highest average rank and the least lost cause and average number of instructions. Avg Rank: the average ranking given by 3 human evaluators % Lost Cause: the percentage of the images labelled as ‘Lost Cause’, which are considered to have 5 edits to ensure fairness in this computation Avg # of Instructions: the average number of edits needed to make the image perfect otherwise

Slide 13

Slide 13 text

Human-AI Collaboration Evaluation • Intrinsic Evaluation • Extrinsic Evaluation: Visual Entailment Task 13 The images from HAIVMet obtained the highest preference and lowest lost cause rate compared to other models Model’s performance in visual entailment results task rose after adding HAIVMet into the training set.

Slide 14

Slide 14 text

Compositionality in Visual Metaphors • Some researchers argue that metaphors arises through cross- domain composition and a visual material rather than conceptual. • Many images from HAIVMet dataset showcase the compositional nature of visual metaphors 14

Slide 15

Slide 15 text

Overall Discussion • Key to Success: Role-sharing • LLM(Reasoning): A ‘translator’, interpreting abstract metaphors into concrete instructions. • Diffusion Model(Depicting): An ‘artist’, generating high-quality images from the specific instructions • Human(Evaluating): Managing context and quality that AI alone cannot. 15

Slide 16

Slide 16 text

Conclusion • Contributions • The key finding is that using a LLM with CoT prompting to generate detailed "visual elaborations" significantly improves the quality of visual metaphors produced by diffusion models • The resulting HAIVMet dataset • Limitations: • The Human-AI collaboration process can be time-consuming. • The best-performing models used in the study are not open-source and are accessed via paid APIs. • The current work is limited to English-only metaphors. 16